4 hashes parallel on SSE2 CPUs for 0.3.6

View all posts

External link

That's amazing...

So are you saying you use 128-bit registers to SIMD four 32-bit data at once?  I've wondered about that for a long time, but I didn't think it would be possible due to addition carrying into the neighbour's value.
Is it 2x fast on AMD and 1/2 fast on Intel?

Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?
Tried that, but it doesn't work for things on the stack.  I ran some tests.

It doesn't even cause an error, it just doesn't align it.
CRITICAL_BLOCK is a macro that contains a for loop. The assertion failure indicates that break has been called inside the body of the loop. The only break statement in this block is in line 2762. In the original source file, there is no break statement in this critical block. I think you must remove lines 2759-2762. The is nothing like that in the original main.cpp.
Sorry about that.  CRITICAL_BLOCK isn't perfect.  You have to be careful not to break or continue out of it.  There's an assert that catches and warns about break.  I can be criticized for using it, but the syntax would be so much more bloated and error prone without it.

Is there a chance the SSE2 code is slow on Intel because of some quirk that could be worked around?  For instance, if something works but is slow if it's not aligned, or thrashing the cache, or one type of instruction that's really slow?  I'm not sure how available it is, but I think Intel used to have a profiler for profiling on a per instruction level.  I guess if tcatm doesn't have a system with the slow processor to test with, there's not much hope.  But it would be really nice if this was working on most CPUs.
That big of a difference in speed, by a factor of 4 or 6, feels like it's likely to be some quirky weak spot or instruction that the old chip is slow with.  Unless it's a touted feature of the i5 that they made SSE2 six times faster.

A quick summary:
Xeon Quad        41% slower
Core 2 Duo        55% slower
Core 2 Duo        same (vess)
Core 2 Quad      50% slower
Core i5            200% faster (nelisky)
Core i5            100% faster (vess)
AMD Opteron    105% faster

My system went from ~7100 to ~4200.
This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.

on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)

My Core2Quad (Q6600) slowed down 50%,
my i5 improved ~200%,

on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)
MinGW on Windows has trouble compiling it:

g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I"/boost" -I"/db/build_unix" -I"/openssl/include" -I"/wxwidgets/lib/gcc_lib/mswud" -I"/wxwidgets/include" -msse2 -O3 -o obj/sha256.o sha256.cpp

sha256.cpp: In function `long long int __vector__ Ch(long long int __vector__, long long int __vector__, long long int __vector__)':
sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:> for instructions.
make: *** [obj/sha256.o] Error 1
If you haven't already, try aligning thash.  It might matter.  Couldn't hurt.

Looks like we're triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?
No help from -O0, same error.

MinGW is GCC 3.4.5.  Probably the problem.

I'll see if I can get a newer version of MinGW.

Got the test working on 32-bit with MinGW GCC 4.5.  Exactly 50% slower than stock with Core 2.
MinGW GCC 4.5.0:
Crypto++ doesn't work, X86_SHA256_HashBlocks() never returns
I only got 4-way working with test.cpp but not when called by BitcoinMiner

MinGW GCC 4.4.1:
Crypto++ works

GCC is definitely not aligning __m128i.

Even if we align our own __m128i variables, the compiler may decide to use a __m128i behind the scenes as a temporary variable.

By making our __m128i variables aligned and changing these inlines to defines, I was able to get it to work on 4.4.1 with -O0 only:
#define Ch(b, c, d)  ((b & c) ^ (~b & d))
#define Maj(b, c, d)  ((b & c) ^ (b & d) ^ (c & d))
#define ROTR(x, n) (_mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n))
#define SHR(x, n)  _mm_srli_epi32(x, n)

But that's with -O0.

On both MinGW GCC 4.4.1 and 4.5.0 I have it working with test.cpp but SIGSEGV when called by BitcoinMiner.  So now it doesn't look like it's the version of GCC, it's something else, maybe just the luck of how the stack is aligned.

I have it working fine on GCC 4.3.3 on Ubuntu 32-bit.

I found the problem with Crypto++ on MinGW 4.5.0.  Here's the patch for that:
--- \old\sha.cpp Mon Jul 26 13:31:11 2010
+++ ew\sha.cpp Sat Aug 14 20:21:08 2010
@@ -336,7 +336,7 @@
  ROUND(14, 0, eax, ecx, edi, edx)
  ROUND(15, 0, ecx, eax, edx, edi)
- ASL(1)
+    ASL(label1)   // Bitcoin: fix for MinGW GCC 4.5
  AS2(add WORD_REG(si), 4*16)
  ROUND(0, 1, eax, ecx, edi, edx)
  ROUND(1, 1, ecx, eax, edx, edi)
@@ -355,7 +355,7 @@
  ROUND(14, 1, eax, ecx, edi, edx)
  ROUND(15, 1, ecx, eax, edx, edi)
  AS2( cmp WORD_REG(si), K_END)
- ASJ( jne, 1, b)
+    ASJ(    jne,    label1,  )   // Bitcoin: fix for MinGW GCC 4.5
  AS2( mov WORD_REG(dx), DATA_SAVE)
  AS2( add WORD_REG(dx), 64)