scala - how to deal with error SPARK-5063 in spark -
i error message spark-5063 in line of println
val d.foreach{x=> for(i<-0 until x.length) println(m.lookup(x(i)))}
d rdd[array[string]]
m rdd[(string, string)]
. there way print way want? or how can convert d rdd[array[string]]
array[string]
?
spark-5063 relates better error messages when trying nest rdd operations, not supported.
it's usability issue, not functional one. root cause nesting of rdd operations , solution break up.
here trying join of drdd
, mrdd
. if size of mrdd
large, rdd.join
recommended way otherwise, if mrdd
small, i.e. fits in memory of each executor, collect it, broadcast , 'map-side' join.
join
a simple join go this:
val rdd = sc.parallelize(seq(array("one","two","three"), array("four", "five", "six"))) val map = sc.parallelize(seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6)) val flat = rdd.flatmap(_.toseq).keyby(x=>x) val res = flat.join(map).map{case (k,v) => v}
if use broadcast, first need collect value of resolution table locally in order b/c executors. note rdd broadcasted must fit in memory of driver of each executor.
map-side join broadcast variable
val rdd = sc.parallelize(seq(array("one","two","three"), array("four", "five", "six"))) val map = sc.parallelize(seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6))) val bctable = sc.broadcast(map.collectasmap) val res2 = rdd.flatmap{arr => arr.map(elem => (elem, bctable.value(elem)))}