c# - Parallel Framework and avoiding false sharing -
recently, had answered question optimizing parallelizable method generation every permutation of arbitrary base numbers. posted answer similar parallelized, poor implementation code block list, , pointed out:
this pretty guaranteed give false sharing , many times slower. (credit gjvdkamp)
and right, death slow. said, researched topic, , found interesting material , suggestions combating it. if understand correctly, when threads access contiguous memory (in say, array that's backing concurrentstack
), false sharing occurs.
for code below horizontal rule, bytes
is:
struct bytes { public byte a; public byte b; public byte c; public byte d; public byte e; public byte f; public byte g; public byte h; }
for own testing, wanted parallel version of running , genuinely faster, created simple example based on original code. 6
limits[0]
lazy choice on part - computer has 6 cores.
single thread block average run time: 10s0059ms
var data = new list<bytes>(); var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 }; (byte = 0; < limits[0]; a++) (byte b = 0; b < limits[1]; b++) (byte c = 0; c < limits[2]; c++) (byte d = 0; d < limits[3]; d++) (byte e = 0; e < limits[4]; e++) (byte f = 0; f < limits[5]; f++) (byte g = 0; g < limits[6]; g++) (byte h = 0; h < limits[7]; h++) data.add(new bytes { = a, b = b, c = c, d = d, e = e, f = f, g = g, h = h });
parallelized, poor implementation run time avg: 81s729ms, ~ 8700 contentions
var data = new concurrentstack<bytes>(); var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 }; parallel.for(0, limits[0], (a) => { (byte b = 0; b < limits[1]; b++) (byte c = 0; c < limits[2]; c++) (byte d = 0; d < limits[3]; d++) (byte e = 0; e < limits[4]; e++) (byte f = 0; f < limits[5]; f++) (byte g = 0; g < limits[6]; g++) (byte h = 0; h < limits[7]; h++) data.push(new bytes { = (byte)a,b = b,c = c,d = d, e = e,f = f,g = g,h = h }); });
parallelized, ?? implementation run time avg: 5s833ms, 92 contentions
var data = new concurrentstack<list<bytes>>(); var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 }; parallel.for (0, limits[0], () => new list<bytes>(), (a, loop, locallist) => { (byte b = 0; b < limits[1]; b++) (byte c = 0; c < limits[2]; c++) (byte d = 0; d < limits[3]; d++) (byte e = 0; e < limits[4]; e++) (byte f = 0; f < limits[5]; f++) (byte g = 0; g < limits[6]; g++) (byte h = 0; h < limits[7]; h++) locallist.add(new bytes { = (byte)a, b = b, c = c, d = d, e = e, f = f, g = g, h = h }); return locallist; }, x => { data.push(x); });
i'm glad had got implementation faster single threaded version. expected result closer around 10s / 6, or around 1.6 seconds, that's naive expectation.
my question for parallelized implementation faster single-threaded version, there further optimizations applied operation? i'm wondering optimizations related parallelization, not improvements algorithm used compute values. specifically:
- i know optimization store , populate
struct
instead ofbyte[]
, it's not related parallelization (or it?) - i know desired value lazy evaluated ripple-carry adder, same
struct
optimization.
first off, initial assumption regarding parallel.for()
, parallel.foreach()
wrong.
the poor parallel implementation has 6 threads attempting write single councurrentstack()
@ once. implementation usuing thread locals (explained more below) accesses shared variable once per task, eliminating contention.
when using parallel.for()
, parallel.foreach()
, cannot in-line replace for
or foreach
loop them. that's not couldn't blind improvement, without examining problem , instrumenting it, using them throwing multithreading @ problem because might make faster.
**parallel.for()
, parallel.foreach()
has overloads allow create local state task
create, , run expression before , after each iteration's execution.
if have operation parallelize parallel.for()
or parallel.foreach()
, it's idea use overload:
public static parallelloopresult for<tlocal>( int frominclusive, int toexclusive, func<tlocal> localinit, func<int, parallelloopstate, tlocal, tlocal> body, action<tlocal> localfinally )
for example, calling for()
sum integers 1 100,
var total = 0; parallel.for(0, 101, () => 0, // <-- localinit (i, state, localtotal) => { // <-- body localtotal += i; return localtotal; }, localtotal => { <-- localfinally interlocked.add(ref total, localtotal); }); console.writeline(total);
localinit
should lambda initializes local state type, passed body
, localfinally
lambdas. please note not recommending implementing summing 1 100 using parallelization, have simple example make example short.