Thanks virtualcoin, that's a perfect comparison.
The 8% speedup from 32-bit Windows (2310k) to 32-bit Linux (2500k) is probably from the newer version of GCC on Linux (4.4.3 vs 3.4.5).
The 15% speedup from 32-bit to 64-bit Linux is more of a mystery. The code is completely 32-bit.
Hmm, I think the 8 extra registers added by x86-64 must be what's helping. That would make a significant difference to SHA if it could hold most of the 16 state variables in registers.