So, I started playing around with the graphics drawing code today. I was basically looking at the low level software bitmap copy for sprites. (Sprites are bitmaps with a transparent background, or irregular shape, such as vehicles, that are overlaid on top of some other background).
For some time I had suspected they could be rewritten to increase their speed. Mostly because of a call made for each scanline drawn in the bitmap. I figured a double loop with inline code instead of a function call could be faster. It would require more code overall though, since the part that iterates over the scanline is the same for a number of different types of bitmap copies.
In short, I have some code that seems to run about 14% faster. Mind you, this doesn't result in a noticable speed difference while playing, since not a whole lot of time is spent in the sprite drawing code. For each frame drawn, measurements suggested that almost 240000 processor cycles are spent with the original code, and a little over 200000 cycles are spent with the modified code. The measurements are of course subject to error due to paging and task switching effects. Improving the non-sprite drawing code could possibly result in more of a speedup. This would be the code that draws the screen background for instance. As it covers much more area than the sprites drawn on top of it, it would likely accouny for more CPU time.
The object of this exercise was basically to write a replacement function. It takes a struct parameter with the following layout:
BitmapCopyInfo
--------------
0x0 4 void* sourceImageBuffer
0x4 4 void* destImageBuffer
0x8 4 void* overlayMask [blight or day and night]
0xC 4 int overlayMaskBitOffset
0x10 4 int width
0x14 4 int height
0x18 4 int sourcePitch
0x1C 4 int destPitch
0x20 4 enum DrawMethod drawMethod
0x24 4 union
0x24 4 short[256]* darkPal16 [current light level]
0x24 4 int bitXOffset [For 1 bpp images]
0x28 4 int blightOverlay [0..15]
0x2C 4 short[256]* lightPal16 [full daylight]
---------------
The function prototype would be:
void __cdecl DrawSprite(BitmapCopyInfo* bitmapCopyInfo);
The original code was the following:
00586CEF > >PUSH EBP ; Function: DrawSprite(BitmapCopyInfo* bitmapCopyInfo)
00586CF0 >MOV EBP,ESP
00586CF2 >PUSHAD
00586CF3 >MOV EBP,DWORD PTR SS:[EBP+8] ; EBP := [param1] BitmapCopyInfo* bitmapCopyInfo
00586CF6 >MOV ECX,DWORD PTR SS:[EBP+20] ; ECX := BitmapCopyInfo.drawMethod
00586CF9 >SHL ECX,2 ; [ECX := drawMethod * 4]
00586CFC >MOV EDI,DWORD PTR DS:[ECX+586CDF]
00586D02 >SUB EDI,Outpost2.00586D3C ; [* Calculate offset from end of CALL instruction *]
00586D08 >MOV DWORD PTR DS:[586D38],EDI ; [* Hardcode CALL address into instruction *]
00586D0E >MOV EAX,DWORD PTR SS:[EBP+18] ; EAX := BitmapCopyInfo.sourcePitch
00586D11 >MOV EBX,DWORD PTR SS:[EBP+10] ; EBX := BitmapCopyInfo.sourceWidth
00586D14 >MOV EDX,EBX
00586D16 >MOV ESI,DWORD PTR SS:[EBP+1C] ; ESI := BitmapCopyInfo.destPitch
00586D19 >SHL EDX,1
00586D1B >SUB ESI,EDX
00586D1D >SUB EAX,EBX ; EAX := sourcePitch - sourceWidth
00586D1F >MOV DWORD PTR DS:[<int destPitchDelta>],ESI ; [global] int destPitchDelta := ESI
00586D25 >MOV DWORD PTR DS:[<int sourcePitchDelta>],EAX ; [global] int sourcePitchDelta := EAX
00586D2A >MOV ESI,DWORD PTR SS:[EBP] ; ESI := BitmapCopyInfo.sourceImageBuffer*
00586D2D >MOV EDI,DWORD PTR SS:[EBP+4] ; EDI := BitmapCopyInfo.destImageBuffer*
00586D30 >MOV ECX,DWORD PTR SS:[EBP+10] ; [LoopStart]: ECX := BitmapCopyInfo.sourceWidth
00586D33 >PUSH EBP ; [Save EBP]
00586D34 >MOV EBP,DWORD PTR SS:[EBP+24] ; EBP := BitmapCopyInfo.darkPal16*
00586D37 >CALL 128CC3B4
00586D3C >POP EBP ; [Restore EBP]
00586D3D >ADD ESI,DWORD PTR DS:[<int sourcePitchDelta>] ; ESI := sourceImageBuffer* + sourcePitchDelta
00586D43 >ADD EDI,DWORD PTR DS:[<int destPitchDelta>] ; EDI := destImageBuffer* + destPitchDelta
00586D49 >DEC DWORD PTR SS:[EBP+14] ; BitmapCopyInfo.sourceHeight--
00586D4C ^ >JNZ SHORT Outpost2.00586D30 ; -> LoopStart
00586D4E >POPAD
00586D4F >LEAVE
00586D50 >RETN
00586D51 > >MOV EDX,EBP ; Function: DrawScanline8Pal16Transparent0
00586D53 >SHR ECX,1
00586D55 >JNB SHORT Outpost2.00586D6D
00586D57 >MOVZX EAX,BYTE PTR DS:[ESI] ; EAX := *sourceImageBuffer
00586D5A >ADD EAX,EAX ; [EAX := sourcePixel * 2]
00586D5C >JE SHORT Outpost2.00586D65
00586D5E >MOV AX,WORD PTR DS:[EAX+EDX] ; AX := pal16[sourcePixel * 2]
00586D62 >MOV WORD PTR DS:[EDI],AX ; *destImageBuffer := AX
00586D65 >INC ESI ; ESI := sourceImageBuffer*++
00586D66 >ADD EDI,2 ; EDI := destImageBuffer* += 2
00586D69 >OR ECX,ECX
00586D6B >JE SHORT Outpost2.00586D96 ; -> Return
00586D6D >XOR EBX,EBX ; [LoopStart]:
00586D6F >XOR EAX,EAX
00586D71 >MOV BL,BYTE PTR DS:[ESI+1] ; BL := *(sourceImageBuffer + 1)
00586D74 >MOV AL,BYTE PTR DS:[ESI] ; AL := *sourceImageBuffer
00586D76 >ADD EBX,EBX ; [EBX := sourcePixel2 * 2]
00586D78 >JE SHORT Outpost2.00586D97
00586D7A >ADD EAX,EAX ; [EAX := sourcePixel1 * 2]
00586D7C >JE SHORT Outpost2.00586DAC
00586D7E >MOV BX,WORD PTR DS:[EBX+EDX] ; [DrawBothPixels]: BX := pal16[sourcePixel2 * 2]
00586D82 >ADD ESI,2 ; ESI := sourceImageBuffer* += 2
00586D85 >MOV AX,WORD PTR DS:[EAX+EDX] ; AX := pal16[sourcePixel1 * 2]
00586D89 >SHL EBX,10 ; EBX := destPixel2 << 16
00586D8C >OR EAX,EBX ; EAX := destPixel1 | (destPixel2 << 16)
00586D8E >MOV DWORD PTR DS:[EDI],EAX ; *destImageBuffer := EAX
00586D90 >ADD EDI,4 ; EDI := destImageBuffer* += 4
00586D93 >DEC ECX
00586D94 ^ >JNZ SHORT Outpost2.00586D6D ; -> LoopStart
00586D96 >RETN
00586D97 >ADD EAX,EAX ; [EAX := sourcePixel1 * 2]
00586D99 >JE SHORT Outpost2.00586DA2 ; -> Skip drawing pixels
00586D9B >MOV BX,WORD PTR DS:[EAX+EDX] ; [DrawFirstPixelOnly]:
00586D9F >MOV WORD PTR DS:[EDI],BX ; *destImageBuffer := BX
00586DA2 >ADD ESI,2 ; [LoopEpilog]: ESI := sourceImageBuffer* += 2
00586DA5 >ADD EDI,4 ; EDI := destImageBuffer* += 4
00586DA8 >DEC ECX
00586DA9 ^ >JNZ SHORT Outpost2.00586D6D ; -> LoopStart
00586DAB >RETN
00586DAC >MOV AX,WORD PTR DS:[EBX+EDX] ; [DrawSecondPixelOnly]: AX := pal16[sourcePixel2 * 2]
00586DB0 >ADD ESI,2 ; ESI := sourceImageBuffer* += 2
00586DB3 >MOV WORD PTR DS:[EDI+2],AX ; *(destImageBuffer + 2) := AX
00586DB7 >ADD EDI,4 ; EDI := destImageBuffer* += 4
00586DBA >DEC ECX
00586DBB ^ >JNZ SHORT Outpost2.00586D6D ; -> LoopStart
00586DBD >RETN
This was replaced with this new code:
// About 14% faster
__asm
{
sub esp, 0x8
push ebx
push esi
push edi
push ebp
; Make sure we have something to draw
mov edx, [ecx + 0x14]; height
mov ebx, [ecx + 0x10]; width
or edx, edx
jz Return
or ebx, ebx
jz Return
; Precalculate
lea esi, [ebx * 2] ; width*2
mov eax, [ecx + 0x18]; sourcePitch
mov edi, [ecx + 0x1C]; destPitch
sub eax, ebx ; sourcePitch - width
sub edi, esi ; destPitch - width*2
mov [esp + 0x10], eax; sourcePitchDelta
mov [esp + 0x14], edi; destPitchDelta
; Cache values in registers
mov esi, [ecx] ; sourceImage*
mov edi, [ecx + 0x4]; destImage*
mov ebp, [ecx + 0x24]; palette16*
mov ecx, ebx ; width
DrawPixelLoopStart:
mov al, [esi]
inc esi
;lodsb
or al, al
jz SkipPixelDraw
movzx eax, al
mov ax, [ebp + eax*2]
;stosw
mov [edi], ax
SkipPixelDraw:
add edi, 2
dec ecx
jnz DrawPixelLoopStart
mov ecx, ebx ; width
add esi, [esp + 0x10]; sourcePitchDelta
add edi, [esp + 0x14]; destPitchDelta
dec edx ; heightRemaining
jnz DrawPixelLoopStart
Return:
pop ebp
pop edi
pop esi
pop ebx
add esp, 0x8
}
I had a few other ideas that I'd tried. For instance, I tried blocking memory reads and writes to try and take advantage of the full 32 bit register size. I also tried a few memory alignment tricks, since access to a 32 value that's aligned on a 32 boundary is faster than an unaligned access. Unfortunately, these changes led to quite an increase in code size, and with a lot of added complexity. The simpler code was both faster, and easier to write/debug. This is possibly due to code caching effects, as the simpler code had a much smaller loop that is more likely to fit in a code cache. It may also be due to the simpler loop structure, that might have led to fewer pipline stalls due to branching.
I still have a few tricks up my sleave that I'd like to try though. For one, this could be an excellent excuse to try and use MMX instructions/registers. Although, I wanted to write normal integer code first that didn't rely on MMX. There are some early pentiums that don't have MMX, so I wanted something a little more universal. Another thing I wanted to try, was to modify the output blocking code so that it avoided branches by reading the destination bitmap, so that the original value could be written back inside of a block instead of branching to partial block writing code. This would probably work best with the MMX idea.