reading an HDFS file in Python

In this note, I show you how to get data from an HDFS platform into a Python programme. The idea is that we have data on HDFS and we would like to use these data in a Python programme. So, we must connect to HDFS from within a Python programme, read the data , transform them and store these.

One may achieve this from the usage of Spark. This can be started by the command “pyspark” in linux. Hopefully, you see something like:

untitled

All in all, getting a file in Python with Spark is a bit complicated. First, we need to import the right library that allows to set a connection to the HDFS platform. We can then access the HDFS platform. Implicitly, a context is created. From that object, the HDFS file can be retrieved. The lines in the file must then be split into fields. Once that is done, the content can shown with the method “collect”. Thhe code is:

from  pywebhdfs.webhdfs import PyWebHdfsClient
hdfs = PyWebHdfsClient(user_name="hdfs",port=50070,host="192.168.2.59")
data = sc.textFile("/Chapter5/uit2")
parts = data.map(lambda r:r.split(';'))
datalines = parts.filter(lambda x:x)
for i in datalines.collect():
  print(i)

tabel = datalines.toDF(['nummer','naam'])
tabel.select("nummer", "naam").show()
tabel.filter(tabel['nummer'] == '1').show()

The first line “from pywebhdfs” etc. links the programme to the webhdfs service. Within HDFS, one may verify whether this works by looking at the properties of HFDS. One should see something like:
untitled

This is taken from Hortonworks that I currently use. The webhdfs allows you to communicate with the hdfs platform. As an example if you want to know the status of hdfs, one may use any browser and type: “http://server:50070/webhdfs/v1/user/hdfs?op=LISTSTATUS”, with server the hdfs server.

So, the from statement allows the Python programme to communicate with hdfs.

In the second statement (PyWebHdfsClient) is wrapper around the webhdfs service. This wrapper extends the service and it allows to provide the server and the user.

The third line (“sc.textfile”) reads the file from the service stream that we just opened up. We must indicate which file to read.

The fourth line (“map(lambda r: “) is a very concise way of creating a function. Whatever we see after the semi colon is the transformation. Before the semi colon, the expression for an outcome is given. The idea is that a row (“1 tom”) is split along the blank in the row.

For.. allows to display the different rows in the result we have sofar.

The subsequent line (“filter(lamda”) is an elegant way of filtering the rows for which the preceding split was successful.

The next line (“toDF”) is used to change from RDD to DF. RDD stands for Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained. DF stands for Data Frame which is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

The last line selects some columns and shows them to the user whereas the subsequent line filters on a certain value.