Sqoop

Sqoop is a tool that allows you to ship data from a RDBMS to a Hadoop platform. Let us take an example to clarify this. One may have some data in a MySQL table persons, within database thom. This database is stored on server 62.51.51.999. The data can be accessed with the knowledge of the server (62.51.51.999), the database (thom), tablename (persons), the userid (thom) and a password. One may then transfer these data with this sqoop command:

sqoop import --connect jdbc:mysql://62.131.51.999/tom --table persons -m 1 --username thom --password thom24257

Looking on the Hadoop platform, one may look at the data:

[pivhdsne:sqoop]$ hadoop dfs -cat /user/gpadmin/persons/part-m-00000
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Peter,Griffin,35
Glenn,Quagmire,33
tom,vanmaanen,99

After this, I tried to retrieve data from Oracle to a file on HDFS. This proved quite similar. The only difference being that another jar was used (understandable) and the columns had to be mentioned (less obvious).
However, I got it working via:

sqoop import --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.26)(port=1521))(connect_data=(service_name=xe)))" --table scott.employees --columns "FIRST_NAME, LAST_NAME"  --num-mappers 1 --username scott --password binvegni

By that time, I had collected two jars in my lib file where sqoop stores the jars. They are (being able to connect to mysql and oracle):

-rw-r--r-- 1 gpadmin gpadmin  459094 Oct 18  2006 mysql-connector-java-3.1.14-bin.jar
-rw-r--r-- 1 gpadmin gpadmin 2739670 Sep 26 16:30 ojdbc6.jar

On another environment I undertook a similar exercise with transferral of an Oracle table. The first thing was to store an ojdbc.jar into the correct environment. I stored this jar in the same environment as the mysql*.jar:

lrwxrwxrwx 1 root root 40 Aug 27  2015 /usr/lib/sqoop/lib/mysql-connector-java.jar 
-rwxrwxrwx 1 root root 3698857 Dec  4 11:25 /usr/lib/sqoop/lib/ojdbc7.jar

I then issued this command to transfer the data:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password binvegni \
--table accounts \
--columns "ACCT_NUM, ACCT_CREATE_DT, ACCT_CLOSE_DT, FIRST_NAME, LAST_NAME, ADDRESS, CITY, STATE, ZIPCODE, PHONE_NUMBER, CREATED, MODIFIED" \
--m 1 \
--target-dir /loudacre/Oraaccounts \
--null-non-string '\\N'

I understood than the addition “–m 1” was necessary as the target table did not have a primary key. In that case one mapper needs to be included.
I then added one record to the Oracle table. This one record could be copied to HDFS with:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password binvegni \
--incremental append \
--null-non-string '\\N' \
--table accounts \
--columns "ACCT_NUM, ACCT_CREATE_DT, ACCT_CLOSE_DT, FIRST_NAME, LAST_NAME, ADDRESS, CITY, STATE, ZIPCODE, PHONE_NUMBER, CREATED, MODIFIED" \
--m 1 \
--target-dir /loudacre/Oraaccounts \
--check-column acct_num \
--last-value 129764

I could verify by looking at the HDFS environment where a new file was created with the content of that one record that was added.
untitled
Another nice facility is that it is possible to issue directly a query to the Oracle table, like:

sqoop eval \
--query "SELECT * FROM accounts where ACCT_NUM < 10" \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password binvegni 

that provides something like:
untitled
I also became aware that we may apply a small twist to allow for strange characters (such as a comma in strings). We then apply a command like:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password bunvegni \
--table webpage \
--columns "WEB_PAGE_NUM,WEB_PAGE_FILE_NAME,ASSOCIATED_FILES" \
--m 1 \
--null-non-string '\\N' \
--target-dir /loudacre/Orawebpage \
--fields-terminated-by "\t"

It is also possible to include a filter. One thus limits the number of records to be read in HDFS:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password bunvegni \
--table accounts \
--columns "ACCT_NUM, ACCT_CREATE_DT, ACCT_CLOSE_DT, FIRST_NAME, LAST_NAME, ADDRESS, CITY, STATE, ZIPCODE, PHONE_NUMBER, CREATED, MODIFIED" \
--where "CITY='GOUDA'" \
--m 1 \
--target-dir /loudacre/Oraaccounts_Current \
--null-non-string '\\N'

Introducing SQL Server meant another jar has to added. It is the sqljdbc4.jar that can be downloaded from microsoft. There are two such files, the sqljdbc.jar class library that provides support for JDBC 3.0 and the sqljdbc4.jar class library that provides support for JDBC 4.0. After downloading this jar file and storing it in the sqoop lib directory, I could fire next line:

sqoop import --connect "jdbc:sqlserver://192.168.2.26:1433;database=master;username=sa;password=binvegni" --table tomtest -m 1

This allowed me to download data from table tomtest that is stored in the master database of a SQL server that resides on 192.168.2.26, listening to port 1433.
And this is how you get data from a PostgreSQL server (using the postgresql-9.4-1203-jdbc4.jar):

sqoop import --connect jdbc:postgresql://192.168.2.26:5432/postgres --table testtom --username postgres --password binvegni  --direct

and adding data to the hadoop platform on base of existing data needs:

sqoop import --connect jdbc:postgresql://192.168.2.26:5432/postgres --table testtom --username postgres --password binvegni  --incremental append --check-column nummer

One thought on “Sqoop

Comments are closed.