Monthly Archives: December 2014

AWK to investigate files on Unix

Today, I worked with the Unix’ awk utility. This is an extremely potent utility to investigate text files on a Unix platform. It can be invoked from the terminal command line. The command must start with awk.

The keyword awk is followed by a script that is positioned between quotes. After the quotes, the textfile is mentioned (say ww-ii-data.txt).

When some items need to initialised, we have the begin clause. The beginclause is positioned between brackets {}.
After that a selection can be made on lines with a selection between slashes. The actions on the line are then also positioned between brackets. Finaly after the END, an end-clause may be included. We then have:

awk ‘BEGIN {} /selection/ {} END {}’ file.

As an example:

awk ' 
BEGIN {count=0;max=0}
 //{
   temp = substr($0,37,3) + 0;
   count++;
   if (max< temp)
      max=temp
        }
END {print "regels:  ", count," max in Celcius", (5/9)*(max-32);}
'   ww-ii-data.txt

I noticed that variables can be used. No declaration is needed. Nice.

An alternative programme is written on a file where columns are separated by commas. In that case, the seperator must be included in the BEGIN clause. This is accomplished with "FS="separator code"". If that is done, the different columns are labelled as $1, $2, etc. This allows you to directly access such a column. If one would like to use this columns, one may use a variable $1, $2 that stands for this column.

awk ' 
BEGIN {count=0;max=0;FS=","}
 // {
   temp = $3 + 0;
   count++;
   if (max< temp)
      max=temp
        }
END {print "regels:  ", count," max ", max;}
'   /home/hadoop/a.csv

Finally, a statement to remove end-of-line characters in a UNIX file:

 {Processed_File}

Hadoop

Everyone talks about big data and Hadoop. Someone even compared it to teenage sex: everone talks about it, everyone knows someone who does it but no-one yet does it. I just tried hadoop to see what it is all about.

Yahoo-hadoop-cluster_OSCON_2007
I made two attempts to install hadoop.
One attempt was about installing Hadoop 1.0.3. I relied on a paper from Michael Noll ( http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ ). I noticed it was really important to use the correct versions of the jdk; I first tried to install Hadoop 1.0.3 with a recent version of the Jdk but that failed. A subsequent attempt was successfull. This could be established with help of the interface ( localhost:50070 ).
A second attempt was about installing Hadoop 2.4. I used a blog from Matthew Sharpe ( http://dogdogfish.com/2014/04/26/installing-hadoop-2-4-on-ubuntu-14-04/ ) to get the necessary information. After the installion, I created a small example to check if this worked. It did work.

Hence, I ended up with two working examples of Hadoop.

The problem is that hadoop is now alive and kicking. I verified this by executing a small example. This worked out ok. But how to continue. Working with hadoop isn’t trivial. My next step will be is finding a means to work with Hadoop.