The 1000th wordcount example

By | January 3, 2017

I just discovered the 1000th wordcount example. It is based on Pyspark. The idea is actually quite simple. One creates a script. This script can be written in any editor. The programme can then be run from the terminal by spark-submit [programme]. As an example, one may start the programme below with: spark-submit –master yarn-cluster prog1.py /user/hdfs/fam /user/hdfs/prog1A. The option –master yarn-cluster indicates how the programme is run. In this case the cluster is used. It can also be run locally; in that case –master local is used.

import sys
from pyspark import SparkContext
if __name__ == "__main__":
    if len(sys.argv) < 3:
        print >> sys.stderr, "Usage: WordCount <file>"
        exit(-1)
sc = SparkContext()
counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split(',')) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
counts.saveAsTextFile(sys.argv[2])
sc.stop()