Monthly Archives: July 2015

How to find a file on Linux

There are zillions of small little Linux commands that make life easy. Those tiny little commands may save you hours of time and tons of frustration. Wheneever I look as the terminal of one of my colleagues, I discover yet another trick that they use to make life easy. May I present a little command from my side:

find / -name post*.* 2>/dev/null

This tiny little command help you to find a file. It starts with the command find. Its first parameter is the root from where to start. If you have nu clue where to look “/” might be a good idea as it is the root of the file system. Then we have parameter -name post*.*. This indicates that we use a name to fin d file, followed by the name. Wild card characters may be used such as the *. In this case, we are looking for a file that starts with post. Then “2>/dev/null” indicates that error messages are ignored. Otherwise, every map that provides no access generates an error message which precludes rapid analysis of the results.

Add data on Big Data in Hive and Impala

This post provides info on how data may be added on a Big Data platform with help of Hive and Impala. We start with a dataset that is stored on a Linux platform. We will show how these data can be stored on a HDFS/ Hadoop platform. After that, we will show how these data can be included in a Hive database. Subsequently it is shown how these data can be accessed from a client (like SQL Developer).
But the first step to have data on a Linux platform. Such data can be inserted via FTP or via de text editor like nano or vi. Let us assume we create a file with nano, inserting several comma-seperated records. I created a 5 record-file with two fields:

[cloudera@quickstart ~]$ nano invoer3.txt
[cloudera@quickstart ~]$ cat invoer3.txt
1,tom
2,ine
3,paula
4,stella
5,bart
[cloudera@quickstart ~]$

The data are now on a Linux platform. They have to be moved to the HDFS platform. This can be be done with:

cat invoer3.txt|hdfs dfs -put - /user/cloudera/invoer/invoer3.txt

This command streams the data (using cat) and puts that stream into a file on the HDFS platform. To name the data lineage easy, I use the same name for the file as on Linux. In both cases, it is invoer3.txt. However, these are two physically different files. Once they are stored on the hfds platform, I would like to set the privilegs such that all users may access them. I use these commands to accomplish this:

sudo -u hdfs hadoop fs -chmod 777 /user/cloudera/invoer
sudo -u hdfs hadoop fs -chmod 777 /user/cloudera/invoer/invoer3.txt

In subsequent steps, the files will be read by Hive/ Impala which is technically another user. I created privileges which allow other users to read the data and to create other files if necessary.
I will subsequently create a table via Impala, using the HDFS file as an external file:

[cloudera@quickstart ~]$ impala-shell
Starting Impala Shell without Kerberos authentication
Connected to quickstart.cloudera:21000
Server version: impalad version 2.2.0-cdh5 RELEASE (build 2ffd73a4255cefd521362ffe1cfb37463f67f75c)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v2.2.0-cdh5 (2ffd73a) built on Tue Apr 21 12:09:21 PDT 2015)
[quickstart.cloudera:21000] > create EXTERNAL TABLE invoer3
(
id INT,
naam STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/sample_data/invoer3.txt';
...
Fetched 0 row(s) in 0.08s
[quickstart.cloudera:21000] > 

In Impala, a file is indicated as an external file that provides the data for the table. In Hive, a similar exercise is possible, see:

[cloudera@quickstart ~]$ hive

Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-1.1.0-cdh5.4.0.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> CREATE EXTERNAL TABLE invoer3
    > (
    >    id INT,
    >    naam STRING
    > )
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    > LOCATION '/user/cloudera/invoer';
OK
Time taken: 0.772 seconds
hive> 

However, in that case, a directory is used to indicate where the data come from.

A slightly differnt way is to use the load command that loads data directly into a Hive default database. Let us assume, we have on Linux a file that is called invoer4.txt and which reads:

[cloudera@quickstart ~]$ cat invoer4.txt
2	tom
3	ine
[cloudera@quickstart ~]$ 

.

This file can be loaded into a Hive database in two steps:
Step1: start Hive and create a table.
Step2: load the data into the Hive table.

hive> create table invoer4(k string, v string)row format delimited fields terminated by '\t'  stored as textfile;
OK
Time taken: 0.785 seconds
hive> load data local inpath '/home/cloudera/invoer4.txt'  into table invoer4;
Loading data to table default.invoer4
Table default.invoer4 stats: [numFiles=1, totalSize=12]
OK
Time taken: 0.603 seconds

In that case, the data are stored in a table – not as an external file as above.
This can be assessed if we see on the hdfs platform a file like: /user/hive/warehouse/invoer4/invoer4.txt. This is a copy that results from load command as given by hive.

As the data are now defined in Hive/ Impala, it is possible to access these data with a client that has a Hive connector. An example is the Oracle SQL Developer, that has a Hive connector. Once installed, one may access this file, see:
Untitled
In this case, I used ordinary SQL to access the table and to fetch the records.

I realise, I use Hive and Impala as synonymns to each other, where they are in reality two different tools. Hive uses the map reduce logic to derive the results, while Impala is able to circumvent this bottleneck by their own deamons. As a result, Impala returns the results faster than Hive. Impala has been designed from its start as a better and faster tool than Hive. Hive (like Pig) is build on Map Reduce. Impala is completely new and is not built on Map Reduce. More information can be found in a wonderful article from Cloudera on Hive and Impala.

Dropping a table in Oracle

To drop a table is straightforward in Oracle. One might simply issue a drop table statement. Let us assume we have table HH. When “drop table HH” is fired, the table is removed. However, an error is returned if the stable doesn’t exist. Again: if table HH doesn’t exist and a SQL “drop table HH” is fired, an error is returned: .ORA-00942 is generated. In itsself, this is understable as a non-existent table is attempted to drop. Understably, such error is generated to warn us that something weird happened. In other DMBS, one might use use a statement like “drop if exists table HH”. However, Oracle doesn’t provide us with this facility.
In Oracle, one needs to write a small PL SQL procedure to circumvent the error. The code reads like:

create or replace PROCEDURE DROP_TABLE(NAAM IN STRING) AS
BEGIN
   EXECUTE IMMEDIATE 'DROP TABLE '||NAAM;
EXCEPTION
   WHEN OTHERS THEN
      IF SQLCODE != -942 THEN
         RAISE;
      END IF;
END DROP_TABLE;
/

This procedure can be called by “EXEC DROP_TABLE(‘HH’);”.
The procedure calls the statement DROP TABLE HH. Howver if HH doesn’t exist, an error is raised. Howver such error is handled as an exception that doesn’t lead to an error generated by the procedure.

transpose a record in Oracle

Transpose a record in Oracle isn’t easy. I had a small table with several records and one value in a record. I wanted to transpose that table into one record with the values adjacent to each other. The question: how to accomplish this? Recently, Oracle introduced the pivot facility that allowed this procedure. The code to undertake is as follows:

select * from
    (select peri_omsch, 
        ROW_NUMBER ()  
        over (PARTITION BY een order by peri_omsch) as rn 
        from (select distinct '1' as een, peri_omsch from salesall order by peri_omsch))
       pivot
        (
          min(peri_omsch)
          for rn in (1 as a1,2 as a2,3 as a3,4 as a4,5 as a5,6 as a6,7 as a7,8 as a8,9 as a9,10 as a10,11 as a11,12 as a12)
        );

Let us start with this table:
Untitled
The sql fetches the values from the original table (peri_omsch) and it adds an additional field that contains the row number. The code to add that additional field uses a ranking function where the whole table is seen as one set with a fixed value (‘een’) as an identifier for the set. Upon this set a rownumber is added.
We get

Untitled

I use this rownumber to create the columns. Each value of the rownumber (1, 2, 3 etc ) is translated into a column. Hereby 1 is translated into column a1, 2 is translated into column a2 etc. I store the peri_omsch in this column. To enforce a unique value, I used the min function, which enforces one unique value to be stored.
The final result looks like:
Untitled

SQL Loader

The SQL Loader is a facility that allows you to load data files blazingly fast. It is able to do as data files are directly written to disk without any overhead.
It needs two files: a control file and a data file. The process generates a log file, that provides information on whether process has succeeded or not. It also generates a file on records that could not be loaded and a file that indicates whether some files could not be used.
The sql loader is invoked at terminal level by a command that given on the server. It may look like:

SQLLDR CONTROL=C:\Users\Administrator\Dropbox\Uitwisseling\gg_DATA_TABLE.ctl, LOG=sample.log, BAD=baz.bad, DATA=C:\Users\Administrator\Dropbox\Uitwisseling\gg_DATA_TABLE.ldr USERID=scott/binvegni, ERRORS=999, LOAD=500000, DISCARD=toss.dsc, DISCARDMAX=5

In this command, ome may distinguish the control file that ends on .ctl. Likewise the data file can de seen as ending on .ldr. Another feature is the limit on the number of records. Setting this number at a low level, invokes only a low number of records being loaded.
Luckily for us, the common Oracle clients, like SQL Developer and TOAD are able to generate these files automatically. Hence the files are generated and only some trivial actions must be undertaken such as copying the files in the correct location (accessible from teh server) and aboven SQL Loader script should be adjusted in accordance with the directory- and file names.
Playing with this facility showed me that varchar fields were easy to load, but with both numerical fields and date files one may easily run into issues as the layout between different instances might be different. For me, it took quite some time to adjust the scripts as to have date fields and numeric fields being loaded without issues. I then discovered the value of the log file. Allthough exact reason for a rejected record was difficult to detect, it gave ample information on the records that were rejected and the fields that started the rejection.

Just as an example, let me add the final code:

OPTIONS (SKIP=1)
LOAD DATA 
TRUNCATE
INTO TABLE "SCOTT"."ONLINEFORECAST"
FIELDS TERMINATED BY'|'
OPTIONALLY ENCLOSED BY '"' AND '"'
TRAILING NULLCOLS ( 
"NASA_NUM" ,
"DATUM" DATE "DD-MM-YY" ,
"SUM_CE_AANTAL" ,
"AANTAL_ORDERS" ,
"AANTAL_MET_ACTIE_IND" "to_number(:AANTAL_MET_ACTIE_IND,'99999.99')" )

Calculate distances in Oracle

Oracle allows you to calculate a distance between two points. Such calculation is not trivial as one must take into account that distances are calculated over the globe and the points are indicated on a longitude – latitude base. If we calculate the distance between longitude, latitude =(0,0) and (1,1), one has about 156 kilometers. In that case, we are calculating a distance between two points in the Atlantic Ocean, somewhere along the coast of Ghana. However, if we calculate the distance between one degree longitude and and one degree latitude on the North Pole, we only have 111 kilometers.
Oracle allows you to calculate such distances. The code to do so is given below as an example:

SELECT sdo_geom.sdo_distance(sdo_geom.sdo_geometry(2001, 8307, sdo_geom.sdo_point_type(4.7172715,52.030588, NULL), NULL, NULL),
                           sdo_geom.sdo_geometry(2001, 8307, sdo_point_type(-0.1223527,51.49065, NULL), NULL, NULL), 0.0001, 'unit=KM') distance_in_km
                           from DUAL;

The points are given as:
sdo_geom.sdo_geometry([gridtype], [points in latitude, longitude], sdo_geom.sdo_point_type([latitude][longitude], NULL), NULL, NULL).
The function to calculate distance is given as sdo_geom.sdo_distance(point A, point B, 0,0001,’unit=[KM]’).

It is also possible to store these points in a database. Oracle has a sdo_geometry type that allows to store point on a latitude, longitude base. An example:

CREATE TABLE stores (
  plek_id NUMBER,
  description VARCHAR2(100),
  store_geo_location SDO_GEOMETRY);
INSERT INTO stores VALUES
  (1,'Tom Gouda',
   SDO_GEOMETRY(2001, 8307, 
     SDO_POINT_TYPE (4.7172715,52.030588,NULL),NULL,NULL));
INSERT INTO stores VALUES
  (2,'Tom Londen', 
   SDO_GEOMETRY(2001, 8307, 
     SDO_POINT_TYPE (-0.1223527,51.49065,NULL),NULL,NULL));
 

A table store is defined with a field store_geo_location. Its field type is SDO_GEOMETRY. Values must be added as SDO_GEOMETRY(,,SDO_POINT_TYPE(,,),,,).
The distance can then be calculated as:

SELECT a.description,b.description,SDO_GEOM.SDO_DISTANCE(a.store_geo_location, b.store_geo_location, 0.0001, 'unit=KM') as dist_in_KM
   FROM stores a, stores b
   where a.plek_id>b.plek_id;

This query showed a direct fly distance of 339 kilometers between Tom_Gouda and Tom_London.

A final word.
The question is where such latitude, longitude data can be found. As always, google is the master. If a certain point in google maps is found, the latitude, longitude can be found in the URL. One sees several such pairs in the URL, corresponding to different points on the map. However if one zooms in, the pairs do not deviate as the map only covers a small area. The first pair can be read, directly after the @ sign. For example, my home can be found as having this URL:https://www.google.nl/maps/place/Waterlelie+60,+2804+PW+Gouda/@52.0306982,4.717726. After the @, the longitude and latitude are displayed.