Dataflow performance issues -


i'm aware update made cdf service few weeks ago (default worker type & attached pd changed), , made clear make batch jobs slower. however, performance of our jobs has degraded beyond point of them fulfilling our business needs.

for example, 1 of our jobs in particular: reads ~2.7 million rows table in bigquery, has 6 side inputs (bq tables), simple string transformations, , writes multiple outputs (3) bigquery. used take 5-6 minutes , takes anywhere between 15-20 mins - not matter how many vm's chuck @ it.

is there can speeds used see?

here stats:

  1. reading bq table 2,744,897 rows (294mb)
  2. 6 bq side inputs
  3. 3 multi-outputs bq, 2 of 2,744,897 , other 1,500 rows
  4. running in zone asia-east1-b
  5. times below include worker pool spin , tear down

10 vms (n1-standard-2) 16 mins 5 sec 2015-04-22_19_42_20-4740106543213058308

10 vms (n1-standard-4) 17 min 11 sec 2015-04-22_20_04_58-948224342106865432

10 vms (n1-standard-1) 18 min 44 sec 2015-04-22_19_42_20-4740106543213058308

20 vms (n1-standard-2) 22 min 53 sec 2015-04-22_21_26_53-18171886778433479315

50 vms (n1-standard-2) 17 min 26 sec 2015-04-22_21_51_37-16026777746175810525

100 vms (n1-standard-2) 19 min 33 sec 2015-04-22_22_32_13-9727928405932256127

the evidence seems indicate there issue how pipeline handles side inputs. specifically, it's quite side inputs may getting re-read bigquery again , again, every element of main input. orthogonal changes type of virtual machines used dataflow workers, described below.

this closely related changes made in dataflow sdk java, version 0.3.150326. in release, changed side input api apply per window. calls sideinput() return values in specific window corresponding window of main input element, , not whole side input pcollectionview. consequently, sideinput() can no longer called startbundle , finishbundle of dofn because window not yet known.

for example, following code snippet has issue cause re-reading side input every input element.

@override public void processelement(processcontext c) throws exception {   iterable<string> uniqueids = c.sideinput(iterableview);    (string item : uniqueids) {     [...]   }    c.output([...]); } 

this code can improved caching side input list member variable of transform (assuming fits memory) during first call processelement, , use cached list instead of side input in subsequent calls.

this workaround should restore performance seeing before, when side inputs have been called startbundle. long-term, work on better caching side inputs. (if doesn't resolve issue, please reach out via email , share relevant code snippets.)


separately, there was, indeed, update cloud dataflow service around 4/9/15 changed default type of virtual machines used dataflow workers. specifically, reduced default number of cores per worker because our benchmarks showed cost effective typical jobs. not slowdown in dataflow service of kind -- runs less resources per worker, default. users still given options override both number of workers type of virtual machine used workers.


Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -