The SSE2 code is ported to 64-bit, via the Yasm assembler. Most of the work went into getting the call stack, local variables and prologue/epilogue right. I kept reading the same pages of the Yasm manual over and over, especially the one that shows the stack frames, and also the register saving conventions. It's different, but it's not too hard once you get it, and it's certainly more efficient. Once the stack gets set up, the actual SSE2 code is essentially identical to the 32-bit version, as you might expect, except that more stuff is in registers.
I'm probably not making optimal use of the extra registers in 64-bit mode. I'm only processing two pixels at once in each core, but with twice as many SSE registers, in theory it should be possible to process four at once per core (for a whopping total of sixteen pixels at once on an i7). The basic idea: since the CPU has two pipes, and each register is 128 bits, you can carefully pair the instructions so that you always have two 64-bit pixels in each pipe, provided neither pipe depends on the results of the other. Good luck with that.
Meanwhile the two-at-once code is good enough. I don't expect it to perform much differently than the 32-bit version, though that remains to be tested. The only significant remaining 64-bit hurdle is the BmpToAvi DLL.