Monthly Archives: November 2016

With Python in Hive

In this small note, it is described how an HDFS file can be stored in a Hive context. In it stored in a Hive context, it can be accessed from outside via ODBC. It is also possible to access the data as a SQL compliant database. The idea is that an abstraction is created on top of the HDFS datasets. One may then access the HDFS datasets, much like an ordinary database.
We will use the python language via spark. This avoids the bottleneck that MapReduce has created.
One starts python via spark with the command “pyspark”. If everything goes correct, we see:
Two variables are important: sc that is an anchor point for methods that can be used within Spark and HiveContext that be used as a starting point for Hive methods.

We first import the relevant libraries and create the context:

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

Then the table is defined:


In the last step, an existing HDFS file is connected to that table definition:

sqlContext.sql("LOAD DATA INPATH 'hdfs:/Chapter5/uit2' INTO TABLE HiveTom")

We may now approach this dataset as a table. The tablename is HiveTom. A possibility is to access the table via ODBC. We can download an ODBC connector. Each distribution (Cloudera, MapR, Hortonworks) has a ODBC connector. Once installed, we may retrieve the data in a ODBC compliant tool. As example, we may undertake this in Ecel:

Oracle ODI

The successor to OWB is the Oracle Data Integrator. This tool has more functionalities than OWB. Next to that, it has an interface that more or less steers the user through a series of steps.
The idea is that one starts with a technical view where the file locations, databases and schemes are declared. Once that is done, one creates a set of logical names that provide the user with an understandable name that can be used to create a so-called mapping. If the mapping is created, it can be run to see if it works.
Let me first introduce the technical aspects. This can be done in a widget that is called “topology” where the technical architecture can be defined. One may see a set of technologies that allow to specify a technical object. As example, one may look at the technical indication of a file. The object file_tom provides a manner how a file is accessed. There, the driver is provided. In the object beneath, one may indicate the directory where the files are stored.



Similarly, the technology “Oracle” has an object ” Oratom”  where the JDBC driver is provided. Beneath, another object is created where the scheme is provided.


Let us switch to the logical design. We have the possibility to define logical handles that link the technical concepts to logical handles. We have “File_Tom”  that is logically linked to the File_Tom that we have defined in the physical architecture. The same holds for “ODI_Tom”  that is the logical handle for the physical defined scheme.


A crucial link between the logical handles and the technical objects is thru the idea of ” context”. This provides a scheme where logical handles are linked to physical entities.



This is almost genial is its triviality.  If we have more contexts, we may create a development, test and production contexts. Promotion into production is then just a change in context.

We are now in a situation where technical concepts like a scheme or a file location are translated into logical concepts. In a subsequent step actual files and tables can be accessed. This is done in the designer tab, under the models. We can reverse engineer the file definitions and table definitions from the physical environment. We then get something like this:


Here, we have the logical components (Names as file definition and names as table definition) that are added to the models. The file definition “names” is added to “File_Tom”  that is derived from the logical component that we created. Likewise the table definition for “Names”  is added to FlatFile2Table that is linked to the logical definition for the scheme that we created.

The logical definitions for the file and table can be used in a mapping. We now have a logical definition for a source and a target. This can then be used in the mapping. In this example the mapping between source and target is almost trivial:


The mapping can be run in several ways. I believe there are about 4 ways to run a mapping. When a mapping is run, one has to fill out the LKM (Loading Knowledge Module) .



This can then be run and the content of the file is loaded into the table.








Dataflow in Oracle Warehouse Builder

I know that Oracle Warehouse Builder (OWB) is at end of life. On the other hand, I encounter OWB quite often and it is interesting to see how it works.
So investigate how it works, I created a dataflow. It it a trivial one: it consists of a file that must be read into Oracle.
The first step is to create a file and assign it to OWB. The file is quite simple. I created this flow:


The metadata on this file is added to OWB. One must have a so-called location that provides data where the file can be found. From that location, the metadata on the file can be imported. In that process, the screen looks like:
A file is imported via the construct of an external file. This is an object in the database that allows to read a flat file from within the Oracle database.
The external file is created from within Oracle Warehouse builder. This must be stored within the metadata on the database. After that code can be generated and deployed on the database.
If it is deployed on the database, one may verify its existence of that table with, say, sql developer. I realise the code and its deployment is steered from OWB.
Another important element is the target file. Let us assume, we have a target file in the database. The metadata on that table can be imported. If that is done, one may see in OWB:
The sources and the targets are now known. These are the key elements of a mapping. The mapping in this case is almost trivial.
If the mapping is created, it can be generated as an PLSQL script. This can be deployed on Oracle. This package can be run from any client tool (SQL Developer, SQL Plus, OWB). This package is like any other package that you created yourself.

Docker container

Only this weekend I downloaded a Docker package from This package allows you to run very small light weight containers on your server than act as components to perform a certain task. In a way, it looks like a virtual machine. It has no direct connect connection to the host machine and it runs in isolation.
Once it is installed, one only notices a small icon in the task bar that shows its presence.
To turn it in something useful, one must create a directory with a so-called “Dockerfile” in it. In that dockerfile one may store a series of commands that indicate what the Dockerfile container is supposed to do. An example of such file can be found here On my machine, the directory looks like this:
The container is then created with:

docker build -t mysql .

The container can then be started from the command line with:

docker run --name mysql -e MYSQL_ROOT_PASSWORD=bonvegni -d mysql:8

To get access to the container, one uses this command:

docker exec -it mysql bash

This is as far as I got. I have a MySql database running. I can access this database from within the container. That look normal. But the challenge is how to communicate with this database. Something to figure out.

reading an HDFS file in Python

In this note, I show you how to get data from an HDFS platform into a Python programme. The idea is that we have data on HDFS and we would like to use these data in a Python programme. So, we must connect to HDFS from within a Python programme, read the data , transform them and store these.

One may achieve this from the usage of Spark. This can be started by the command “pyspark” in linux. Hopefully, you see something like:


All in all, getting a file in Python with Spark is a bit complicated. First, we need to import the right library that allows to set a connection to the HDFS platform. We can then access the HDFS platform. Implicitly, a context is created. From that object, the HDFS file can be retrieved. The lines in the file must then be split into fields. Once that is done, the content can shown with the method “collect”. Thhe code is:

from  pywebhdfs.webhdfs import PyWebHdfsClient
hdfs = PyWebHdfsClient(user_name="hdfs",port=50070,host="")
data = sc.textFile("/Chapter5/uit2")
parts = r:r.split(';'))
datalines = parts.filter(lambda x:x)
for i in datalines.collect():

tabel = datalines.toDF(['nummer','naam'])"nummer", "naam").show()
tabel.filter(tabel['nummer'] == '1').show()

The first line “from pywebhdfs” etc. links the programme to the webhdfs service. Within HDFS, one may verify whether this works by looking at the properties of HFDS. One should see something like:

This is taken from Hortonworks that I currently use. The webhdfs allows you to communicate with the hdfs platform. As an example if you want to know the status of hdfs, one may use any browser and type: “http://server:50070/webhdfs/v1/user/hdfs?op=LISTSTATUS”, with server the hdfs server.

So, the from statement allows the Python programme to communicate with hdfs.

In the second statement (PyWebHdfsClient) is wrapper around the webhdfs service. This wrapper extends the service and it allows to provide the server and the user.

The third line (“sc.textfile”) reads the file from the service stream that we just opened up. We must indicate which file to read.

The fourth line (“map(lambda r: “) is a very concise way of creating a function. Whatever we see after the semi colon is the transformation. Before the semi colon, the expression for an outcome is given. The idea is that a row (“1 tom”) is split along the blank in the row.

For.. allows to display the different rows in the result we have sofar.

The subsequent line (“filter(lamda”) is an elegant way of filtering the rows for which the preceding split was successful.

The next line (“toDF”) is used to change from RDD to DF. RDD stands for Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained. DF stands for Data Frame which is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

The last line selects some columns and shows them to the user whereas the subsequent line filters on a certain value.


Putting a file on HDFS

Putting a file on HDFS is relatively easy. There are a few steps to take. Let us assume the file is on a linux system. The first step is to copy the file to an area where it can be stored with the hdfs user as its owner. On my system, I have /tmp that has 777 privs: here files with every owner can be stored. So I copy my file to that area:

cp /root/uit /tmp

Now, I have got the file in an area where the user as I use in my HDFS platform can be set as owner to the file.
The user I have on my HDFS plaform is “hdfs”. I set the file with “hdfs” as owner with:

chown hdfs:hdfs /tmp/uit

Now the file uit has hdfs as owner. I can now put it on HDFS with:

sudo -u hdfs hadoop fs -put /tmp/uit /Chapter5/uit

Estimating with Python

It is relatively easy to do an estimate with a Python script. This is due to the fact that Python works with matrices and such matrices can be used as an input in a estimation procedure. I created an example where a dataset is retrieve from Oracle. Then the dataset is translated into a matrix. Subsequently some columns from that matrix are taken as dependent and independent variables, which are subsequently used as input to an OLS estimation procedure.
The code:

import cx_Oracle
import numpy as np
import statsmodels.api as sm
con = cx_Oracle.connect('scott/')
cur = con.cursor()
cur.execute('select * from departments where manager_id is not null')
res = cur.fetchall()
print res
A = np.array(res)
y= A[:,0].astype(np.float)
x= A[:,[2,3]].astype(np.float)
lmRegModel = sm.OLS(y,x)
result =

I use the numpy library to translate the data as retrieved from Oracle into a matrix. I am then able to retrieve one column as the independent variable and 2 columns as independent variables. I explicitly cast them into floats. I am then able to start the estimation process.

Read and write by Python

Python seems to be a very convenient way to transfer data to and fro Oracle. It has capabilities to set up a connection and it seems quite capable to transfer a matrix into a table and vice versa. Next code shows this. It first retrieves the content of a table. In a second step some data from a matrix are stored into a table.

import cx_Oracle
con = cx_Oracle.connect('scott/')
cur = con.cursor()
cur.execute('select * from departments order by department_id')
res = cur.fetchall()
print res
rows = [ (1, "First" ),
         (2, "Second" ),
         (3, "Third" ),
         (4, "Fourth" ),
         (5, "Fifth" ),
         (6, "Sixth" ),
         (7, "Seventh" ) ]
cur = con.cursor()
cur.setinputsizes(int, 20)
cur.executemany("insert into mytable(id, data) values (:1, :2)", rows)