scala - how to deal with error SPARK-5063 in spark -
i error message spark-5063 in line of println
val d.foreach{x=> for(i<-0 until x.length) println(m.lookup(x(i)))} d rdd[array[string]] m rdd[(string, string)] . there way print way want? or how can convert d rdd[array[string]] array[string] ?
spark-5063 relates better error messages when trying nest rdd operations, not supported.
it's usability issue, not functional one. root cause nesting of rdd operations , solution break up.
here trying join of drdd , mrdd. if size of mrdd large, rdd.join recommended way otherwise, if mrdd small, i.e. fits in memory of each executor, collect it, broadcast , 'map-side' join.
join
a simple join go this:
val rdd = sc.parallelize(seq(array("one","two","three"), array("four", "five", "six"))) val map = sc.parallelize(seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6)) val flat = rdd.flatmap(_.toseq).keyby(x=>x) val res = flat.join(map).map{case (k,v) => v} if use broadcast, first need collect value of resolution table locally in order b/c executors. note rdd broadcasted must fit in memory of driver of each executor.
map-side join broadcast variable
val rdd = sc.parallelize(seq(array("one","two","three"), array("four", "five", "six"))) val map = sc.parallelize(seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6))) val bctable = sc.broadcast(map.collectasmap) val res2 = rdd.flatmap{arr => arr.map(elem => (elem, bctable.value(elem)))}