Wednesday, February 24, 2010

more SSE optimizations in 64-bit version

I'm getting ready to ship another version of Fractice. I got the 64-bit SSE (shallow zoom) code to take about two-thirds the time it did previously, by taking advantage of having twice as many XMM registers and unrolling the innermost loop to process four pixels at once instead of two. Let's see, on an i7 that's 16 pixels at once, or 32 at once with Hyperthreading, not too shabby! It's still using packed doubles (i.e. each register contains two 64-bit values), so it isn't quite twice as fast, because the instructions don't always "pair" properly. I use quotes because pairing is actually an outdated Pentium 4 concept, as I recently discovered. The infamous "U and V pipes" are long gone. Instead the more recent Intel CPUs do "dynamic fusing" of instructions, and can get as many as 6 uops running at once in a single core. It depends on what resources the uops need, e.g. even in i7 chips there's still only one floating-point multiplier per core AFAIK.

The rules are more complicated than ever before but the bottom line is, unrolling is still generally a good thing up to a point, and rearranging instructions to minimize dependency chains also helps. And of course there's plenty of other stuff to avoid, like branches in loops, partial registers, memory accesses less than 64-bits wide, and so much more. I've been having fun wading through Intel's
Intel® 64 and IA-32 Architectures Optimization Reference Manual. Highly recommended! And free! Excellent late-night reading (if you're an insomniac like me).

Wednesday, January 27, 2010

Fractice bug fix for slow shallow zooms

The latest release of Fractice (1.0.05) fixes a bug that made shallow (non-deep) zooms take twice as long as they should have. The bug was introduced in the previous version (1.0.04). It was pretty simple: SSE2 support was accidentally disabled for shallow zooms. Deep zooms were unaffected. Checking the "Use SSE2" checkbox in Options/Engine made no difference. The bug was present in both the 32-bit and 64-bit versions, and in both the server and the client. Sorry! All better now.

On a lighter note, the real-time colormapping/downsampling function that's used in full-screen Exclusive (Mixer) mode is now much better optimized, and runs two or three times faster on all platforms. This leaves more CPU time for rendering. This particular bit of code is critical because it runs at the output frame rate (typically 30 FPS) and is also single-threaded. The optimization was in the pixel averaging, and consisted of replacing three divides (one for each color channel) with a single SSE packed multiply.

All users should download the latest version.

Saturday, January 23, 2010

64-bit versions of round and trunc, using SSE intrinsics

It took me two hours to figure this out. Not sure if that's a good thing or a bad thing, but it's a thing.

#include "intrin.h"

inline int round(double x)
{
return(_mm_cvtsd_si32(_mm_set_sd(x)));
}

inline int trunc(double x)
{
return(_mm_cvttsd_si32(_mm_set_sd(x)));
}

inline __int64 round64(double x)
{
return(_mm_cvtsd_si64x(_mm_set_sd(x)));
}

inline __int64 trunc64(double x)
{
return(_mm_cvttsd_si64x(_mm_set_sd(x)));
}

I benchmarked these carefully on Intel Xeon E5520. The intrinsic SSE round is almost twice as fast as the conditional offset and trunc method:

inline int round(double x)
{
return(int(x > 0 ? x + 0.5 : x - 0.5));
}

Which is understandable, since the intrinsic SSE round compiles to a single instruction:

cvtsd2si eax,xmm0

whereas the conditional offset and trunc method compiles to:

xorpd xmm7,xmm7 ; xmm7 = 0
movsd xmm2,0.5 ; xmm2 = 0.5
comisd xmm1,xmm7 ; x > 0?
jbe $1 ; n, skip to neg case
movsd xmm0,xmm1
addsd xmm0,xmm2 ; x += 0.5
jmp $2
$1:
movsd xmm0,xmm1
subsd xmm0,xmm2 ; x -= 0.5
$2:
cvttsd2si eax,xmm0 ; eax = trunc(x)

Saturday, January 9, 2010

64-bit SSE2 code working

The SSE2 code is ported to 64-bit, via the Yasm assembler. Most of the work went into getting the call stack, local variables and prologue/epilogue right. I kept reading the same pages of the Yasm manual over and over, especially the one that shows the stack frames, and also the register saving conventions. It's different, but it's not too hard once you get it, and it's certainly more efficient. Once the stack gets set up, the actual SSE2 code is essentially identical to the 32-bit version, as you might expect, except that more stuff is in registers.

I'm probably not making optimal use of the extra registers in 64-bit mode. I'm only processing two pixels at once in each core, but with twice as many SSE registers, in theory it should be possible to process four at once per core (for a whopping total of sixteen pixels at once on an i7). The basic idea: since the CPU has two pipes, and each register is 128 bits, you can carefully pair the instructions so that you always have two 64-bit pixels in each pipe, provided neither pipe depends on the results of the other. Good luck with that.

Meanwhile the two-at-once code is good enough. I don't expect it to perform much differently than the 32-bit version, though that remains to be tested. The only significant remaining 64-bit hurdle is the BmpToAvi DLL.

Wednesday, January 6, 2010

64-bit MPIR twice as fast as 32-bit GMP

I'm finally making real progress on the 64-bit version of Fractice. MPIR is clearly the way to go: It ships with VC++ solutions and carefully optimized 64-bit assembler, and as an added bonus the MPIR developers aren't hostile to Windows users. FractServ is already ported to MPIR, with essentially zero code impact, and the initial benchmarks are very impressive, see below. Bottom line: On almost every Core2 / Vista64 machine I tried, the 64-bit MPIR version of FractServ is more than twice as fast as the 32-bit GMP version.

32-bit vs. 64-bit FractServ benchmarks

32-bit code: GMP 4.1.2 with P4 assembler
64-bit code: MPIR 1.3.0 rc3 with Core2 assembler
test project: bench mpir64.frp
Record mode, only two frames, one to local machine, one to server being benchmarked
AFAIK all machines are Core2 running Vista64 except wzenge (i7 running Vista64)

PC 32-bit 64-bit gain
ckeny 466 186 2.51
nstone 523 219 2.39
dtopp 525 1064 0.49 <-- WTF?!? better check this one
dpeder 545 225 2.42
bbetts 463 193 2.40
sfreed 521 215 2.42
jperre 683 535 1.28 <-- ? another mystery
wzenge 251 99 2.54 <-- i7

Porting Fractice itself will take more time. It took quite a bit of fussing to get the code compiled cleanly in 64-bit. Some of the more common issues are described succinctly in Intel's article Code Cleaning MFC/ATL Applications for 64-Bit Intel Architecture: basically Polymorphic data types (i.e. INT_PTR), DoModal return value, SendMessage return value, failure to use WPARAM/LPARAM in prototypes, and item data (e.g. SetItemData). Some issues they don't mention: CArray GetSize now returns 64-bit, and the prototype of OnTimer changed.

The key to my solution is this block of code, which is included by stdafx.h:

#ifdef _WIN64
#define INT64TO32(x) static_cast(x)
#define UINT64TO32(x) static_cast(x)
#define GCL_HBRBACKGROUND GCLP_HBRBACKGROUND
typedef INT_PTR W64INT;
typedef UINT_PTR W64UINT;
#include "ArrayEx.h"
typedef CArrayEx CDWordArrayEx;
#define CDWordArray CDWordArrayEx
typedef CArrayEx CPtrArrayEx;
#define CPtrArray CPtrArrayEx
typedef CArrayEx CByteArrayEx;
#define CByteArray CByteArrayEx
#else
typedef int W64INT;
typedef UINT W64UINT;
#define INT64TO32(x) x
#define UINT64TO32(x) x
#endif

Anywhere I need to cast to 32-bit, I use INT64TO32(x) or UINT64TO32(x) which makes the changes compact and easy to find. For example getting item data from a control is a very common case:

int idx = INT64TO32(m_List.GetItemData());

Another common case is ON_MESSAGE handlers, or any other handler where the arguments are generic WPARAM/LPARAM and you're using them as int or UINT.

Since I already have a wrapper (CArrayEx), and I don't require arrays with more than 2 billion elements, I can get away with overriding GetSize to return a 32-bit int. I also redefine the CDWordArray, CPtrArray, CByteArray etc. as CArrayEx instances, so that they inherit the 32-bit GetSize. This avoids LOTS of tedious rewriting of code that doesn't need to be 64-bit anyway.

I define polymorphic types using W64INT or W64UINT. Yes I could use INT_PTR/DWORD_PTR but the extra indirection doesn't hurt a bit and I'm getting tired of M$ changing the rules. Once burned, twice shy. The most common cases are:

Timer instances
OnTimer nIDEvent argument
DoModal return value
SerializeElements nCount argument

The only serious bug so far was in CDib::Serialize, which reads/writes a BITMAP struct. The issue is that one of BITMAP's members (dwBits) is a pointer, which means the struct had to grow in 64-bit Windows, from 24 bytes to 32 bytes. My code doesn't use dwBits, but that doesn't matter. I guess I should have known better, but BITMAP has been around so long I tend to think of it as bedrock. It's certainly not an internal struct or anything. I guess this is what you might call a "breaking change" in Windows. I could have stored the size of the struct in the archive, but I didn't. I could have made my own struct and copied the BITMAP members I care about to/from it, but I was too lazy. So I'm left with a minor kludge:

void CDib::Serialize(CArchive& ar)
{
// The BITMAP struct got bigger in 64-bit Windows, due to the bmBits member
// being a pointer. To keep our archives compatible with 32-bit Windows, we
// must use the original size of BITMAP. The 64-bit load case leaves bmBits
// uninitialized, but since we don't use bmBits here it doesn't matter.
#ifdef _WIN64
static const int BITMAP_SIZE = 24; // size of BITMAP in 32-bit Windows
#else
static const int BITMAP_SIZE = sizeof(BITMAP);
#endif
if (ar.IsStoring()) {
BITMAP bmp;
if (m_pBits == NULL || !GetBitmap(&bmp))
AfxThrowArchiveException(CArchiveException::genericException, ar.m_strFileName);
ar.Write(&bmp, BITMAP_SIZE);
ar.Write(m_pBits, bmp.bmWidthBytes * bmp.bmHeight);
} else {
BITMAP bmp;
ar.Read(&bmp, BITMAP_SIZE);
if (!Create(bmp.bmWidth, bmp.bmHeight, bmp.bmBitsPixel))
AfxThrowArchiveException(CArchiveException::genericException, ar.m_strFileName);
ar.Read(m_pBits, bmp.bmWidthBytes * bmp.bmHeight);
}
}

There are still some outstanding problems.

1. Fractice movie recording depends on the BmpToAvi DLL, which is 32-bit code. I sure as hell don't want to deal with porting all that nasty DirectShow filter code to 64-bit. Instead I plan run BmpToAvi as a separate 32-bit application. The DLL will just be a 64-bit proxy for the 32-bit app. The DLL will send commands to the app using registered messages. The commands will show the compressor dialog, open an AVI, add a frame to the AVI, close the AVI, etc. It won't be without its difficulties but it's got to be easier than debugging 64-bit filter chains.

2. Since inline assembler isn't supported in 64-bit, the Mandelbrot/Mandelbar SSE2 code is a problem. The options are either rewrite it using intrinsics, or port it to YASM and make it an external function with "C" linkage. I don't have to bench the intrinsics to know that they would generate horribly inefficient code, I'll take Lee Avery's word for it. We're talking about the innermost triple-nested loop code here, critical path is an understatement. So really the external YASM is the only option. That's a significant project but also a highly worthwhile one, and not just because I'll be able to put 64-bit assembler on my resume.

Monday, January 4, 2010

MPIR 64-bit FractServ test

doc: bench mpir64.frs
res: 640x480
q: 4096
aa: 1x
PCs: CHRISK, ENG-WZENGERLE64

32-bit production code:
ET: 274.974
CHRISK: 83
ENG-WZ: 397

64-bit MPIR:
ET: 246.165 s.
CHRISK: 74
ENG-WZ: 406


A 10% improvement, not too impressive but better than nothing.
Hypothesis: maybe zoom isn't deep enough for MPIR to show much improvement. Try using z = 1E100.