Specifically, use a union rather than pointer-based type punning to convert the Long value returned by GetContentOrigin() to a Point. This makes for more readable code and should also generate somewhat better assembly code.
DoReadTCP automatically disposes the old read buffer handle before doing the read, and then locks the new read buffer handle before returning. This removes the need for most explicit locking/unlocking of the handle. A macro is provided to dispose of the handle earlier (useful in a few cases where it may be big, and another read isn't immediately performed).
Also, set displayInProgress to FALSE in NextRect, rather than in code for each encoding.
They are now called only every 16 lines, instead of every line. This is enough to maintain reasonable responsiveness, while providing a measurable speedup.
This is only a win if we can use the optimized case a reasonable proportion of the time (~40% or more), but that should be the case for most real screen images.
The equality comparisons are written with XORs because that produces better assembly code.