A python script with many steps

Pyspark is the python language that is applied to spark. It therefore allows a wonderful merge between spark with its possibilities to circumvent the limitation that are set by the mapreduce framework and python that is relatively simple.

In the scheme below, some steps are shown that might be used.

sc.textFile allow to read a file and process its as a RDD (resilient distributed dataset). This stands for a dataset that is distributed over nodes and which can be recreated fast.

 

flatMap allows to create multiple lines from one line.

Map processes one line. From one word, two fields are created: the original word and a field with the length of a word.

filter allows to filter the lines.

groupByKey aggregates the lines by the first field that acts as a key.

map then translates the aggregate into something that is human readable.

collect displays the results.