scala - how to deal with error SPARK-5063 in spark -


i error message spark-5063 in line of println

val d.foreach{x=> for(i<-0 until x.length)       println(m.lookup(x(i)))}     

d rdd[array[string]] m rdd[(string, string)] . there way print way want? or how can convert d rdd[array[string]] array[string] ?

spark-5063 relates better error messages when trying nest rdd operations, not supported.

it's usability issue, not functional one. root cause nesting of rdd operations , solution break up.

here trying join of drdd , mrdd. if size of mrdd large, rdd.join recommended way otherwise, if mrdd small, i.e. fits in memory of each executor, collect it, broadcast , 'map-side' join.

join

a simple join go this:

val rdd = sc.parallelize(seq(array("one","two","three"), array("four", "five", "six"))) val map = sc.parallelize(seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6)) val flat = rdd.flatmap(_.toseq).keyby(x=>x) val res = flat.join(map).map{case (k,v) => v} 

if use broadcast, first need collect value of resolution table locally in order b/c executors. note rdd broadcasted must fit in memory of driver of each executor.

map-side join broadcast variable

val rdd = sc.parallelize(seq(array("one","two","three"), array("four", "five", "six"))) val map = sc.parallelize(seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6))) val bctable = sc.broadcast(map.collectasmap) val res2 = rdd.flatmap{arr => arr.map(elem => (elem, bctable.value(elem)))}  

Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -