Merging files in Scala

By | February 1, 2017

I understand that Scala may be used in an ETL context. In ETL, an important element is the merge of two files. We will get data from different sources and they must be merged in one file only. As an example, we may think of two files, one containing a number and a name, another being a file with a number and age.
In a previous post, I showed how two files can be merged in PySpark. In this post, I will show how it can be undertaken within Scala.
In my first step, I read two files. I then split the lines along the delimiter. Subsequently, I identify the different columns:

val namen=sc.textFile("/user/hdfs/namen")
case class MatchData(id1: String, id2: Int)
def parseNamen(line: String) = {
      val pieces = line.split(',')
      val id1 = pieces(0).toString
      val id2 = pieces(1).toInt
val parsedNamen = => parseNamen(line)).toDF()

val leeftijd = sc.textFile("/user/hdfs/leeftijd")
case class MatchLeeftijd(id3: Int, id4: Int)
def parseLeef(line: String) = {
      val pieces = line.split(',')
      val id3 = pieces(0).toInt
      val id4 = pieces(1).toInt
val parsedLeef = => parseLeef(line)).toDF()

I can then join the two objects with:

val samen=parsedLeef.join(parsedNamen,parsedNamen("id2") === parsedLeef("id3"))

The idea is that a column can be accessed with parsedNamen(“id2”) or parsedLeef(“id3”). For me, the non trivial item is how these columns are compared. A symbol like === is used for that purpose.
An alternative is a Carthesian product that is subsequently filtered:

val samen=parsedLeef.join(parsedNamen)
samen.filter("id2-id3 = 0").show()