I'm getting ready to ship another version of Fractice. I got the 64-bit SSE (shallow zoom) code to take about two-thirds the time it did previously, by taking advantage of having twice as many XMM registers and unrolling the innermost loop to process four pixels at once instead of two. Let's see, on an i7 that's 16 pixels at once, or 32 at once with Hyperthreading, not too shabby! It's still using packed doubles (i.e. each register contains two 64-bit values), so it isn't quite twice as fast, because the instructions don't always "pair" properly. I use quotes because pairing is actually an outdated Pentium 4 concept, as I recently discovered. The infamous "U and V pipes" are long gone. Instead the more recent Intel CPUs do "dynamic fusing" of instructions, and can get as many as 6 uops running at once in a single core. It depends on what resources the uops need, e.g. even in i7 chips there's still only one floating-point multiplier per core AFAIK.
The rules are more complicated than ever before but the bottom line is, unrolling is still generally a good thing up to a point, and rearranging instructions to minimize dependency chains also helps. And of course there's plenty of other stuff to avoid, like branches in loops, partial registers, memory accesses less than 64-bits wide, and so much more. I've been having fun wading through Intel's
Intel® 64 and IA-32 Architectures Optimization Reference Manual. Highly recommended! And free! Excellent late-night reading (if you're an insomniac like me).