BitcoinTalk

4 hashes parallel on SSE2 CPUs for 0.3.6

152

Satoshi-only

BitcoinTalk

From:

satoshi

Subject:

Re: 4 hashes parallel on SSE2 CPUs for 0.3.6

Date:

July 31, 2010 at 24:29:20 UTC

That's amazing...

So are you saying you use 128-bit registers to SIMD four 32-bit data at once? I've wondered about that for a long time, but I didn't think it would be possible due to addition carrying into the neighbour's value.

BitcoinTalk

#33

From:

satoshi

Subject:

Re: 4 hashes parallel on SSE2 CPUs for 0.3.6

Date:

August 2, 2010 at 19:02:46 UTC

Is it 2x fast on AMD and 1/2 fast on Intel?

Quote from: tcatm on July 31, 2010, 10:12:38 AM

Btw. Why are you using this alignup<16> function when __attribute__ ((aligned (16))) will tell the compiler to align at compiletime?

Tried that, but it doesn't work for things on the stack. I ran some tests.

It doesn't even cause an error, it just doesn't align it.

BitcoinTalk

#65

From:

satoshi

Subject:

Re: 4 hashes parallel on SSE2 CPUs for 0.3.6

Date:

August 7, 2010 at 21:16:01 UTC

Quote from: impossible7 on August 06, 2010, 11:37:20 AM

CRITICAL_BLOCK is a macro that contains a for loop. The assertion failure indicates that break has been called inside the body of the loop. The only break statement in this block is in line 2762. In the original source file, there is no break statement in this critical block. I think you must remove lines 2759-2762. The is nothing like that in the original main.cpp.

Sorry about that. CRITICAL_BLOCK isn't perfect. You have to be careful not to break or continue out of it. There's an assert that catches and warns about break. I can be criticized for using it, but the syntax would be so much more bloated and error prone without it.

Is there a chance the SSE2 code is slow on Intel because of some quirk that could be worked around? For instance, if something works but is slow if it's not aligned, or thrashing the cache, or one type of instruction that's really slow? I'm not sure how available it is, but I think Intel used to have a profiler for profiling on a per instruction level. I guess if tcatm doesn't have a system with the slow processor to test with, there's not much hope. But it would be really nice if this was working on most CPUs.

BitcoinTalk

#72

From:

satoshi

Subject:

Re: 4 hashes parallel on SSE2 CPUs for 0.3.6

Date:

August 12, 2010 at 22:07:23 UTC

That big of a difference in speed, by a factor of 4 or 6, feels like it's likely to be some quirky weak spot or instruction that the old chip is slow with. Unless it's a touted feature of the i5 that they made SSE2 six times faster.

A quick summary:
Xeon Quad 41% slower
Core 2 Duo 55% slower
Core 2 Duo same (vess)
Core 2 Quad 50% slower
Core i5 200% faster (nelisky)
Core i5 100% faster (vess)
AMD Opteron 105% faster

aceat64:
My system went from ~7100 to ~4200.
This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.

impossible7:
on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)

nelisky:
My Core2Quad (Q6600) slowed down 50%,
my i5 improved ~200%,

impossible7:
on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)

BitcoinTalk

#80

From:

satoshi

Subject:

Re: 4 hashes parallel on SSE2 CPUs for 0.3.6

Date:

August 14, 2010 at 24:49:18 UTC

MinGW on Windows has trouble compiling it:

g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I"/boost" -I"/db/build_unix" -I"/openssl/include" -I"/wxwidgets/lib/gcc_lib/mswud" -I"/wxwidgets/include" -msse2 -O3 -o obj/sha256.o sha256.cpp

sha256.cpp: In function `long long int __vector__ Ch(long long int __vector__, long long int __vector__, long long int __vector__)':
sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:http://www.mingw.org/bugs.shtml> for instructions.
make: *** [obj/sha256.o] Error 1

BitcoinTalk

#83

From:

satoshi

Subject:

Re: 4 hashes parallel on SSE2 CPUs for 0.3.6

Date:

August 14, 2010 at 04:22:29 UTC

If you haven't already, try aligning thash. It might matter. Couldn't hurt.

Quote from: tcatm on August 14, 2010, 12:53:07 AM

Looks like we're triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?

No help from -O0, same error.

MinGW is GCC 3.4.5. Probably the problem.

I'll see if I can get a newer version of MinGW.

BitcoinTalk

#84

From:

satoshi

Subject:

Re: 4 hashes parallel on SSE2 CPUs for 0.3.6

Date:

August 14, 2010 at 17:55:37 UTC

Got the test working on 32-bit with MinGW GCC 4.5. Exactly 50% slower than stock with Core 2.

BitcoinTalk

#85

From:

satoshi

Subject:

Re: 4 hashes parallel on SSE2 CPUs for 0.3.6

Date:

August 14, 2010 at 22:06:13 UTC

MinGW GCC 4.5.0:
Crypto++ doesn't work, X86_SHA256_HashBlocks() never returns
I only got 4-way working with test.cpp but not when called by BitcoinMiner

MinGW GCC 4.4.1:
Crypto++ works
4-way SIGSEGV

GCC is definitely not aligning __m128i.

Even if we align our own __m128i variables, the compiler may decide to use a __m128i behind the scenes as a temporary variable.

By making our __m128i variables aligned and changing these inlines to defines, I was able to get it to work on 4.4.1 with -O0 only:
#define Ch(b, c, d) ((b & c) ^ (~b & d))
#define Maj(b, c, d) ((b & c) ^ (b & d) ^ (c & d))
#define ROTR(x, n) (_mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n))
#define SHR(x, n) _mm_srli_epi32(x, n)

But that's with -O0.

BitcoinTalk

#87

From:

satoshi

Subject:

Re: 4 hashes parallel on SSE2 CPUs for 0.3.6

Date:

August 15, 2010 at 03:40:29 UTC

On both MinGW GCC 4.4.1 and 4.5.0 I have it working with test.cpp but SIGSEGV when called by BitcoinMiner. So now it doesn't look like it's the version of GCC, it's something else, maybe just the luck of how the stack is aligned.

I have it working fine on GCC 4.3.3 on Ubuntu 32-bit.

I found the problem with Crypto++ on MinGW 4.5.0. Here's the patch for that:

Code:

--- \old\sha.cpp Mon Jul 26 13:31:11 2010
+++ ew\sha.cpp Sat Aug 14 20:21:08 2010
@@ -336,7 +336,7 @@
ROUND(14, 0, eax, ecx, edi, edx)
ROUND(15, 0, ecx, eax, edx, edi)

- ASL(1)
+ ASL(label1) // Bitcoin: fix for MinGW GCC 4.5
AS2(add WORD_REG(si), 4*16)
ROUND(0, 1, eax, ecx, edi, edx)
ROUND(1, 1, ecx, eax, edx, edi)
@@ -355,7 +355,7 @@
ROUND(14, 1, eax, ecx, edi, edx)
ROUND(15, 1, ecx, eax, edx, edi)
AS2( cmp WORD_REG(si), K_END)
- ASJ( jne, 1, b)
+ ASJ( jne, label1, ) // Bitcoin: fix for MinGW GCC 4.5

AS2( mov WORD_REG(dx), DATA_SAVE)
AS2( add WORD_REG(dx), 64)