c# - Parallel Framework and avoiding false sharing -


recently, had answered question optimizing parallelizable method generation every permutation of arbitrary base numbers. posted answer similar parallelized, poor implementation code block list, , pointed out:

this pretty guaranteed give false sharing , many times slower. (credit gjvdkamp)

and right, death slow. said, researched topic, , found interesting material , suggestions combating it. if understand correctly, when threads access contiguous memory (in say, array that's backing concurrentstack), false sharing occurs.


for code below horizontal rule, bytes is:

struct bytes {   public byte a; public byte b; public byte c; public byte d;   public byte e; public byte f; public byte g; public byte h; } 

for own testing, wanted parallel version of running , genuinely faster, created simple example based on original code. 6 limits[0] lazy choice on part - computer has 6 cores.

single thread block average run time: 10s0059ms

  var data = new list<bytes>();   var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };    (byte = 0; < limits[0]; a++)   (byte b = 0; b < limits[1]; b++)   (byte c = 0; c < limits[2]; c++)   (byte d = 0; d < limits[3]; d++)   (byte e = 0; e < limits[4]; e++)   (byte f = 0; f < limits[5]; f++)   (byte g = 0; g < limits[6]; g++)   (byte h = 0; h < limits[7]; h++)     data.add(new bytes {       = a, b = b, c = c, d = d,        e = e, f = f, g = g, h = h     }); 

parallelized, poor implementation run time avg: 81s729ms, ~ 8700 contentions

  var data = new concurrentstack<bytes>();   var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };    parallel.for(0, limits[0], (a) => {     (byte b = 0; b < limits[1]; b++)     (byte c = 0; c < limits[2]; c++)     (byte d = 0; d < limits[3]; d++)     (byte e = 0; e < limits[4]; e++)     (byte f = 0; f < limits[5]; f++)     (byte g = 0; g < limits[6]; g++)     (byte h = 0; h < limits[7]; h++)       data.push(new bytes {         = (byte)a,b = b,c = c,d = d,         e = e,f = f,g = g,h = h       });   });  

parallelized, ?? implementation run time avg: 5s833ms, 92 contentions

  var data = new concurrentstack<list<bytes>>();   var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };    parallel.for (0, limits[0], () => new list<bytes>(),      (a, loop, locallist) => {        (byte b = 0; b < limits[1]; b++)       (byte c = 0; c < limits[2]; c++)       (byte d = 0; d < limits[3]; d++)       (byte e = 0; e < limits[4]; e++)       (byte f = 0; f < limits[5]; f++)       (byte g = 0; g < limits[6]; g++)       (byte h = 0; h < limits[7]; h++)         locallist.add(new bytes {           = (byte)a, b = b, c = c, d = d,           e = e, f = f, g = g, h = h         });       return locallist;   }, x => {     data.push(x);   }); 

i'm glad had got implementation faster single threaded version. expected result closer around 10s / 6, or around 1.6 seconds, that's naive expectation.

my question for parallelized implementation faster single-threaded version, there further optimizations applied operation? i'm wondering optimizations related parallelization, not improvements algorithm used compute values. specifically:

  • i know optimization store , populate struct instead of byte[], it's not related parallelization (or it?)
  • i know desired value lazy evaluated ripple-carry adder, same struct optimization.

first off, initial assumption regarding parallel.for() , parallel.foreach() wrong.

the poor parallel implementation has 6 threads attempting write single councurrentstack() @ once. implementation usuing thread locals (explained more below) accesses shared variable once per task, eliminating contention.

when using parallel.for() , parallel.foreach(), cannot in-line replace for or foreach loop them. that's not couldn't blind improvement, without examining problem , instrumenting it, using them throwing multithreading @ problem because might make faster.

**parallel.for() , parallel.foreach() has overloads allow create local state task create, , run expression before , after each iteration's execution.

if have operation parallelize parallel.for() or parallel.foreach(), it's idea use overload:

public static parallelloopresult for<tlocal>(     int frominclusive,     int toexclusive,     func<tlocal> localinit,     func<int, parallelloopstate, tlocal, tlocal> body,     action<tlocal> localfinally ) 

for example, calling for() sum integers 1 100,

var total = 0;  parallel.for(0, 101, () => 0,  // <-- localinit (i, state, localtotal) => { // <-- body   localtotal += i;   return localtotal; }, localtotal => { <-- localfinally   interlocked.add(ref total, localtotal); });  console.writeline(total); 

localinit should lambda initializes local state type, passed body , localfinally lambdas. please note not recommending implementing summing 1 100 using parallelization, have simple example make example short.


Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -