Wednesday, January 6, 2010

64-bit MPIR twice as fast as 32-bit GMP

I'm finally making real progress on the 64-bit version of Fractice. MPIR is clearly the way to go: It ships with VC++ solutions and carefully optimized 64-bit assembler, and as an added bonus the MPIR developers aren't hostile to Windows users. FractServ is already ported to MPIR, with essentially zero code impact, and the initial benchmarks are very impressive, see below. Bottom line: On almost every Core2 / Vista64 machine I tried, the 64-bit MPIR version of FractServ is more than twice as fast as the 32-bit GMP version.

32-bit vs. 64-bit FractServ benchmarks

32-bit code: GMP 4.1.2 with P4 assembler
64-bit code: MPIR 1.3.0 rc3 with Core2 assembler
test project: bench mpir64.frp
Record mode, only two frames, one to local machine, one to server being benchmarked
AFAIK all machines are Core2 running Vista64 except wzenge (i7 running Vista64)

PC 32-bit 64-bit gain
ckeny 466 186 2.51
nstone 523 219 2.39
dtopp 525 1064 0.49 <-- WTF?!? better check this one
dpeder 545 225 2.42
bbetts 463 193 2.40
sfreed 521 215 2.42
jperre 683 535 1.28 <-- ? another mystery
wzenge 251 99 2.54 <-- i7

Porting Fractice itself will take more time. It took quite a bit of fussing to get the code compiled cleanly in 64-bit. Some of the more common issues are described succinctly in Intel's article Code Cleaning MFC/ATL Applications for 64-Bit Intel Architecture: basically Polymorphic data types (i.e. INT_PTR), DoModal return value, SendMessage return value, failure to use WPARAM/LPARAM in prototypes, and item data (e.g. SetItemData). Some issues they don't mention: CArray GetSize now returns 64-bit, and the prototype of OnTimer changed.

The key to my solution is this block of code, which is included by stdafx.h:

#ifdef _WIN64
#define INT64TO32(x) static_cast(x)
#define UINT64TO32(x) static_cast(x)
#define GCL_HBRBACKGROUND GCLP_HBRBACKGROUND
typedef INT_PTR W64INT;
typedef UINT_PTR W64UINT;
#include "ArrayEx.h"
typedef CArrayEx CDWordArrayEx;
#define CDWordArray CDWordArrayEx
typedef CArrayEx CPtrArrayEx;
#define CPtrArray CPtrArrayEx
typedef CArrayEx CByteArrayEx;
#define CByteArray CByteArrayEx
#else
typedef int W64INT;
typedef UINT W64UINT;
#define INT64TO32(x) x
#define UINT64TO32(x) x
#endif

Anywhere I need to cast to 32-bit, I use INT64TO32(x) or UINT64TO32(x) which makes the changes compact and easy to find. For example getting item data from a control is a very common case:

int idx = INT64TO32(m_List.GetItemData());

Another common case is ON_MESSAGE handlers, or any other handler where the arguments are generic WPARAM/LPARAM and you're using them as int or UINT.

Since I already have a wrapper (CArrayEx), and I don't require arrays with more than 2 billion elements, I can get away with overriding GetSize to return a 32-bit int. I also redefine the CDWordArray, CPtrArray, CByteArray etc. as CArrayEx instances, so that they inherit the 32-bit GetSize. This avoids LOTS of tedious rewriting of code that doesn't need to be 64-bit anyway.

I define polymorphic types using W64INT or W64UINT. Yes I could use INT_PTR/DWORD_PTR but the extra indirection doesn't hurt a bit and I'm getting tired of M$ changing the rules. Once burned, twice shy. The most common cases are:

Timer instances
OnTimer nIDEvent argument
DoModal return value
SerializeElements nCount argument

The only serious bug so far was in CDib::Serialize, which reads/writes a BITMAP struct. The issue is that one of BITMAP's members (dwBits) is a pointer, which means the struct had to grow in 64-bit Windows, from 24 bytes to 32 bytes. My code doesn't use dwBits, but that doesn't matter. I guess I should have known better, but BITMAP has been around so long I tend to think of it as bedrock. It's certainly not an internal struct or anything. I guess this is what you might call a "breaking change" in Windows. I could have stored the size of the struct in the archive, but I didn't. I could have made my own struct and copied the BITMAP members I care about to/from it, but I was too lazy. So I'm left with a minor kludge:

void CDib::Serialize(CArchive& ar)
{
// The BITMAP struct got bigger in 64-bit Windows, due to the bmBits member
// being a pointer. To keep our archives compatible with 32-bit Windows, we
// must use the original size of BITMAP. The 64-bit load case leaves bmBits
// uninitialized, but since we don't use bmBits here it doesn't matter.
#ifdef _WIN64
static const int BITMAP_SIZE = 24; // size of BITMAP in 32-bit Windows
#else
static const int BITMAP_SIZE = sizeof(BITMAP);
#endif
if (ar.IsStoring()) {
BITMAP bmp;
if (m_pBits == NULL || !GetBitmap(&bmp))
AfxThrowArchiveException(CArchiveException::genericException, ar.m_strFileName);
ar.Write(&bmp, BITMAP_SIZE);
ar.Write(m_pBits, bmp.bmWidthBytes * bmp.bmHeight);
} else {
BITMAP bmp;
ar.Read(&bmp, BITMAP_SIZE);
if (!Create(bmp.bmWidth, bmp.bmHeight, bmp.bmBitsPixel))
AfxThrowArchiveException(CArchiveException::genericException, ar.m_strFileName);
ar.Read(m_pBits, bmp.bmWidthBytes * bmp.bmHeight);
}
}

There are still some outstanding problems.

1. Fractice movie recording depends on the BmpToAvi DLL, which is 32-bit code. I sure as hell don't want to deal with porting all that nasty DirectShow filter code to 64-bit. Instead I plan run BmpToAvi as a separate 32-bit application. The DLL will just be a 64-bit proxy for the 32-bit app. The DLL will send commands to the app using registered messages. The commands will show the compressor dialog, open an AVI, add a frame to the AVI, close the AVI, etc. It won't be without its difficulties but it's got to be easier than debugging 64-bit filter chains.

2. Since inline assembler isn't supported in 64-bit, the Mandelbrot/Mandelbar SSE2 code is a problem. The options are either rewrite it using intrinsics, or port it to YASM and make it an external function with "C" linkage. I don't have to bench the intrinsics to know that they would generate horribly inefficient code, I'll take Lee Avery's word for it. We're talking about the innermost triple-nested loop code here, critical path is an understatement. So really the external YASM is the only option. That's a significant project but also a highly worthwhile one, and not just because I'll be able to put 64-bit assembler on my resume.

No comments: