With Python in Hive

By | November 28, 2016

In this small note, it is described how an HDFS file can be stored in a Hive context. In it stored in a Hive context, it can be accessed from outside via ODBC. It is also possible to access the data as a SQL compliant database. The idea is that an abstraction is created on top of the HDFS datasets. One may then access the HDFS datasets, much like an ordinary database.
We will use the python language via spark. This avoids the bottleneck that MapReduce has created.
One starts python via spark with the command “pyspark”. If everything goes correct, we see:
untitled
Two variables are important: sc that is an anchor point for methods that can be used within Spark and HiveContext that be used as a starting point for Hive methods.

We first import the relevant libraries and create the context:

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

Then the table is defined:

sqlContext.sql("CREATE TABLE IF NOT EXISTS HiveTom (key STRING, value STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'")

In the last step, an existing HDFS file is connected to that table definition:

sqlContext.sql("LOAD DATA INPATH 'hdfs:/Chapter5/uit2' INTO TABLE HiveTom")

We may now approach this dataset as a table. The tablename is HiveTom. A possibility is to access the table via ODBC. We can download an ODBC connector. Each distribution (Cloudera, MapR, Hortonworks) has a ODBC connector. Once installed, we may retrieve the data in a ODBC compliant tool. As example, we may undertake this in Ecel:
untitled