Another Pyspark scripts

By | January 5, 2017

In this note, I show yet another Pyspark with slightly different methods to filter. The idea is that file is read in a RDD. Subsequently, it is cleaned. That cleaning process involves a removal of lines that are too long. The lines are split with a character that is on the twentieth position. Then the lines with a number of elements unequal to 14 are filtered out. After that, some columns are kept. Finally, the values are concatenated into a line and written to disk.
We have:

devstatus = sc.textFile("/loudacre/devicestatus.txt")

# Filter out lines with < 10 characters, use the 20th character as the delimiter, parse the line, and filter out bad lines
cleanstatus = devstatus. \
    filter(lambda line: len(line) > 20). \
    map(lambda line: line.split(line[19:20])). \
    filter(lambda values: len(values) == 14)
    

# Create a new RDD containing date, manufacturer, device ID, latitude and longitude
devicedata = cleanstatus. \
    map(lambda words: (words[0], words[1].split(' ')[0], words[2], words[12], words[13]))

# Save to a CSV file as a comma-delimited string (trim parenthesis from tuple toString)
devicedata. \
    map(lambda values: ','.join(values)). \
    saveAsTextFile("/loudacre/devicestatus_etl")