Monthly Archives: September 2015

Oops how much tablespace is left?

A few days ago, I was asked to load some tables in Oracle. A rather trivial question but I wasn’t sure if enough tablespace was left. From the table definition, I came to know what tablespace was used. After that I ran below query to see how much tablespace was actually left. I want to retain this code as it might be handy in the future.

column "Tablespace" format a13
column "Used MB"    format 99,999,999
column "Free MB"    format 99,999,999
column "Total MB"   format 99,999,999
select
   fs.tablespace_name                          "Tablespace",
   (df.totalspace - fs.freespace)              "Used MB",
   fs.freespace                                "Free MB",
   df.totalspace                               "Total MB",
   round(100 * (fs.freespace / df.totalspace)) "Pct. Free"
from
   (select
      tablespace_name,
      round(sum(bytes) / 1048576) TotalSpace
   from
      dba_data_files
   group by
      tablespace_name
   ) df,
   (select
      tablespace_name,
      round(sum(bytes) / 1048576) FreeSpace
   from
      dba_free_space
   group by
      tablespace_name
   ) fs
where
   df.tablespace_name = fs.tablespace_name;

Sqoop

Sqoop is a tool that allows you to ship data from a RDBMS to a Hadoop platform. Let us take an example to clarify this. One may have some data in a MySQL table persons, within database thom. This database is stored on server 62.51.51.999. The data can be accessed with the knowledge of the server (62.51.51.999), the database (thom), tablename (persons), the userid (thom) and a password. One may then transfer these data with this sqoop command:

sqoop import --connect jdbc:mysql://62.131.51.999/tom --table persons -m 1 --username thom --password thom24257

Looking on the Hadoop platform, one may look at the data:

[pivhdsne:sqoop]$ hadoop dfs -cat /user/gpadmin/persons/part-m-00000
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Peter,Griffin,35
Glenn,Quagmire,33
tom,vanmaanen,99

After this, I tried to retrieve data from Oracle to a file on HDFS. This proved quite similar. The only difference being that another jar was used (understandable) and the columns had to be mentioned (less obvious).
However, I got it working via:

sqoop import --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.26)(port=1521))(connect_data=(service_name=xe)))" --table scott.employees --columns "FIRST_NAME, LAST_NAME"  --num-mappers 1 --username scott --password binvegni

By that time, I had collected two jars in my lib file where sqoop stores the jars. They are (being able to connect to mysql and oracle):

-rw-r--r-- 1 gpadmin gpadmin  459094 Oct 18  2006 mysql-connector-java-3.1.14-bin.jar
-rw-r--r-- 1 gpadmin gpadmin 2739670 Sep 26 16:30 ojdbc6.jar

On another environment I undertook a similar exercise with transferral of an Oracle table. The first thing was to store an ojdbc.jar into the correct environment. I stored this jar in the same environment as the mysql*.jar:

lrwxrwxrwx 1 root root 40 Aug 27  2015 /usr/lib/sqoop/lib/mysql-connector-java.jar 
-rwxrwxrwx 1 root root 3698857 Dec  4 11:25 /usr/lib/sqoop/lib/ojdbc7.jar

I then issued this command to transfer the data:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password binvegni \
--table accounts \
--columns "ACCT_NUM, ACCT_CREATE_DT, ACCT_CLOSE_DT, FIRST_NAME, LAST_NAME, ADDRESS, CITY, STATE, ZIPCODE, PHONE_NUMBER, CREATED, MODIFIED" \
--m 1 \
--target-dir /loudacre/Oraaccounts \
--null-non-string '\\N'

I understood than the addition “–m 1” was necessary as the target table did not have a primary key. In that case one mapper needs to be included.
I then added one record to the Oracle table. This one record could be copied to HDFS with:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password binvegni \
--incremental append \
--null-non-string '\\N' \
--table accounts \
--columns "ACCT_NUM, ACCT_CREATE_DT, ACCT_CLOSE_DT, FIRST_NAME, LAST_NAME, ADDRESS, CITY, STATE, ZIPCODE, PHONE_NUMBER, CREATED, MODIFIED" \
--m 1 \
--target-dir /loudacre/Oraaccounts \
--check-column acct_num \
--last-value 129764

I could verify by looking at the HDFS environment where a new file was created with the content of that one record that was added.
untitled
Another nice facility is that it is possible to issue directly a query to the Oracle table, like:

sqoop eval \
--query "SELECT * FROM accounts where ACCT_NUM < 10" \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password binvegni 

that provides something like:
untitled
I also became aware that we may apply a small twist to allow for strange characters (such as a comma in strings). We then apply a command like:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password bunvegni \
--table webpage \
--columns "WEB_PAGE_NUM,WEB_PAGE_FILE_NAME,ASSOCIATED_FILES" \
--m 1 \
--null-non-string '\\N' \
--target-dir /loudacre/Orawebpage \
--fields-terminated-by "\t"

It is also possible to include a filter. One thus limits the number of records to be read in HDFS:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password bunvegni \
--table accounts \
--columns "ACCT_NUM, ACCT_CREATE_DT, ACCT_CLOSE_DT, FIRST_NAME, LAST_NAME, ADDRESS, CITY, STATE, ZIPCODE, PHONE_NUMBER, CREATED, MODIFIED" \
--where "CITY='GOUDA'" \
--m 1 \
--target-dir /loudacre/Oraaccounts_Current \
--null-non-string '\\N'

Introducing SQL Server meant another jar has to added. It is the sqljdbc4.jar that can be downloaded from microsoft. There are two such files, the sqljdbc.jar class library that provides support for JDBC 3.0 and the sqljdbc4.jar class library that provides support for JDBC 4.0. After downloading this jar file and storing it in the sqoop lib directory, I could fire next line:

sqoop import --connect "jdbc:sqlserver://192.168.2.26:1433;database=master;username=sa;password=binvegni" --table tomtest -m 1

This allowed me to download data from table tomtest that is stored in the master database of a SQL server that resides on 192.168.2.26, listening to port 1433.
And this is how you get data from a PostgreSQL server (using the postgresql-9.4-1203-jdbc4.jar):

sqoop import --connect jdbc:postgresql://192.168.2.26:5432/postgres --table testtom --username postgres --password binvegni  --direct

and adding data to the hadoop platform on base of existing data needs:

sqoop import --connect jdbc:postgresql://192.168.2.26:5432/postgres --table testtom --username postgres --password binvegni  --incremental append --check-column nummer

pushing files via Netcat

Netcat is a utility in unix to investigate network connections. It has now been ported to windows and it allows us to query network connections on a windows platform with netcat (nc). A nice possibility is to push files via nc from one machine to another. Assume for the moment that both machines have netcat installed. If not, it can be retrieved here.
On the receiving machine a listener process is started with:

nc -l -p 1234 > romme.txt

This process listens to local port 1234 and it will store all information in file “romme.txt”.
On the sending machine, we reach out to the receiving machine. Doing so, we start a datastream that streams the contents of the file that we want to send across. This can be done via:

type invoer.txt | nc 192.168.2.22 1234

The file is sent to a stream (via type). The stream is captured via the pipe symbol. It is transmitted to the nc command. The nc reaches out to machine 192.168.2.22, where the listener process is invokes. Everything is handled via the 1234 port. On the receiving machine, signals are received on the 1234 port. The data are captured and they are stored in file romme.txt.

Flume

Flume allows to directly tranfer messages into a file. It even allows such files to be stored on Hadoop. This opens a way to capture messages in a file that is stored on Hadoop, ready to be analysed. The example is a series of events from a log that are collected. The file is then transferred to another platform (say Hadoop) to be processed further.
I got flume working on a sandbox for Cloudera. It looks as if three related parameters must be provided for: one parameter refers to a flume.conf file (found in /etc/flume-ng/conf.empty/flume.conf); one parameter refers to the name of the agent. This can be found in the flume.conf file which happens to be sandbox in my case. Finally a third parameter refers to a conf directory which parameters are set via flume-env.sh.

 flume-ng agent --conf-file /etc/flume-ng/conf.empty/flume.conf --name sandbox  --conf /opt/examples/flume/conf

Another example is next statement, more or less similar to the command given above:

sudo flume-ng agent -c /etc/gphd/flume/conf -f /etc/gphd/flume/conf/flume.conf -n agent

This one uses a conf file that is stored on /etc/gphd/flume/conf/flume.conf. The name of the agent is “agent; the conf directory is /etc/gphd/flume/conf. From the conf file, we know that so-called netcat is set up, that listens to port 44444. We use this to start a terminal session that starts a stream on the local Linux platform:

[pivhdsne:~]$ nc localhost 44444
testing
OK
1
OK
3
OK
4
OK

From the conf file, we know that these streams are stored on hdfs in directory /user/flume. If we look there, we see this file:

[pivhdsne:~]$ hadoop dfs -cat /user/flume/FlumeData.1442949290540
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

testing
1
3
4
[pivhdsne:~]$ 

We see a file that is created on hdfs that stores the streamed data from the Linux platform. This is a way to transfer files from Linux to hdfs. A final example is given below, with the same netcat listener:

cat test|nc localhost 44444

This creates a stream (via cat) that translates a file into a stream. The stream is sent to the netcat process that submits the stream to port 44444. The stream is then catched by flume and stored in hdfs files.