Monthly Archives: June 2015

Transpose in Oracle

Transposing data means changing data from a row into a column. Starting from version 11, this is possible in Oracle as well. It is possible to translate some values that appear in rows into columns. Doing so, a new table can be created that has an additional set of columns with column names being derived from row values. An example might clarify this:

create table ff
as
select * from (
   select 
   times_purchased as "Puchase Frequency", 
   state_code
   from customers t
)
pivot 
(
   count(state_code)
   for state_code in ('NY' as "New York",'CT' "Connecticut",'NJ' "New Jersey",'FL' "Florida",'MO' as "Missouri")
)
order by 1

In this examples a table (ff) is created. It is created from another table, customers. From that table two columns are retrieved: a field “times_purchased” and a field “state_code”. The latter field is translated into columns. The state_code is translated into five columns: New York, Connecticut etc. Whenever a value ‘NY’ is encountered, the column ‘New York’ is updated.
The update is defined under the key word “pivot”. Whenever a set of values in encountered, that is entered in another set, the count is incremented by 1; otherwise it is started at 1.

Let us look at the customer table. It looks like:

"STATE_CODE"	"TIMES_PURCHASED"
"NY"	100
"NY"	150
"NY"	100
"NY"	75
"CT"	75

Here, the state code values will be used to create columns. We will have a New York column and a Connecticut column.
Every record will generate an update. In principle the update will be according to a count: an increment by one if the row already exists, otherwise a 1. The first record will generate a record having 100 as times_purchased value and 1 in the Nuew York column. The second record will generate a record having 150 in the times_purchased and 1 in the New York column. Th third record will update the first record (100,1,0) to (100,2,0). The fourth record will generate another record. The fifth record will update an existing record from (75,1,0) to (75,1,1). The end result is:

"Puchase Frequency"	"New York"	"Connecticut"	
75      1       1	
100     2       0	
150     1       0	

Reading and writing in Java

Reading and writing from and to files is not easy in Java. This can already be seen if one simply googles on “Java Filereader problem”. This generated 477000 hits. Apparently, reading (and writing) is not trivial. I wrote a small program that is able to read a file and copy its contents to another file. It is tested on both Windows and Linux and it seems to work.

A minor note on using the programme on Windows. To circumvent problems with the backslash, it is preceded by another backslash. Hence the filename is written as “D:\\bla bla”, where one might expect a “D:\bla bla”. The cause is that a backslash is interpreted as an exception character. (and not as a directory seperator. Hence the necessity to use the double backslash “\\” instead of a single one. Then the EOF character. In Windows, this EOF character can be enforced by a “\r\n”, whereas such EOF character is “\n” when it comes down to Linux.

I use a buffered read to hadle buffers. This enables this programme to be used with large input files.

import java.io.*;
 
public class CopyCharacters {
    public static void main(String[] args) throws IOException {
 
    	BufferedReader bufferedReader = null;
        FileWriter outputStream = null;
        String filename = "D:\\Users\\tmaanen\\prive\\java\\invoer.txt";
 
        try {
            bufferedReader = new BufferedReader(new FileReader(filename));
            File file = new File("D:\\Users\\tmaanen\\prive\\java\\characteroutput.txt");
            file.createNewFile();
            outputStream = new FileWriter(file, true);    
            int c;
            while ((c = bufferedReader.read()) != -1) 
                 {
                    outputStream.write(c);
                  }
            }
            catch (Exception e)
            {
              System.err.format("Exception occurred trying to read '%s'.", filename);
              e.printStackTrace();
            }
         finally {
            if (bufferedReader != null) 
            {
            	bufferedReader.close();
            }
            if (outputStream != null) 
            {
            	outputStream.write("\r\n");
            	outputStream.close();
            }
                }
    }
}

Another set of keys and values

Another example on how mapper and reducers are used in a Hadoop context is given below. This programme is created as three classes. One class is an overall class that calls two other classes: a mapper class and a reducer.
The mapper classer reads a file and creates a series of words. In the first reducer programme, the series is grouped where the frequency of each word is calculated. In the second reducer programme the results are combined. The programme reads as:

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.Date;


@SuppressWarnings("unused")
public class WordCount {

   public static class Map extends MapReduceBase implements Mapper {
     private final static IntWritable one = new IntWritable(1);
     private Text word = new Text();

     public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
       String line = value.toString();
       StringTokenizer tokenizer = new StringTokenizer(line);
       File file = new File("/home/hduser/example-mapper.txt");
       if (!file.exists()) {
       	file.createNewFile();
       };
       FileWriter fw = new FileWriter(file.getAbsoluteFile(),true);
       BufferedWriter output1 = new BufferedWriter(fw);
       Date date = new Date();
       while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         output.collect(word, one);
         output1.append("mappert is jaar " + date.toString() +">"+ word + "    " + one + "\n");
       };
       output1.close();
     }
   }

   public static class Reduce extends MapReduceBase implements Reducer {
     public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
   	   File file = new File("/home/hduser/example-reducer.txt");
       if (!file.exists()) {
          	file.createNewFile();
          };
       FileWriter fw = new FileWriter(file.getAbsoluteFile(),true);
       BufferedWriter output1 = new BufferedWriter(fw);
       int sum = 0;
       while (values.hasNext()) {
         sum += values.next().get();
         output1.write("mappert is gelezen waarde " + " key "+ key + " sum " + sum + "\n");
       }
       output.collect(key, new IntWritable(sum));
       output1.close();
     }
   }

   public static void main(String[] args) throws Exception {
     JobConf conf = new JobConf(WordCount.class);
     conf.setJobName("wordcount");

     conf.setOutputKeyClass(Text.class);
     conf.setOutputValueClass(IntWritable.class);

     conf.setMapperClass(Map.class);
     conf.setCombinerClass(Reduce.class);
     conf.setReducerClass(Reduce.class);

     conf.setInputFormat(TextInputFormat.class);
     conf.setOutputFormat(TextOutputFormat.class);

     FileInputFormat.setInputPaths(conf, new Path(args[0]));
     FileOutputFormat.setOutputPath(conf, new Path(args[1]));

     JobClient.runJob(conf);
   }
}

The input file looks like:

dit is van tom tom

In the mapper, this text is split into seperate words:

mappert is jaar Tue Jun 16 05:31:22 PDT 2015>dit    1
mappert is jaar Tue Jun 16 05:31:22 PDT 2015>is    1
mappert is jaar Tue Jun 16 05:31:22 PDT 2015>van    1
mappert is jaar Tue Jun 16 05:31:22 PDT 2015>tom    1
mappert is jaar Tue Jun 16 05:31:22 PDT 2015>tom    1

The keys are the seperate words (like “dit”, “is” etc.). The value is always one.
In the first reducer programme, the frequency is calculated:

mappert is gelezen waarde  key dit sum 1
mappert is gelezen waarde  key is sum 1
mappert is gelezen waarde  key tom sum 1
mappert is gelezen waarde  key tom sum 2
mappert is gelezen waarde  key van sum 1

One sees the calculation when the key “tom” is processed. In the first round the value is 1, which is the original value. In the second round, it is detected that “tom” was detected before and the value is incremented with “1”.
In the second reducer programme, the results are merged:

mappert is gelezen waarde  key dit sum 1
mappert is gelezen waarde  key is sum 1
mappert is gelezen waarde  key tom sum 2
mappert is gelezen waarde  key van sum 1

This can be found in the output that reads as:

hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop dfs -cat /user/output84/part-00000DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

15/06/16 06:21:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
dit	1
is	1
tom	2
van	1

Map and reduce – what happens?

In Big Data, the concept of mapping and reducing plays a huge role. The idea is that a a massive dataset is split over several servers. On each server, a part of the data is investigated. This part is called a mapper. In a subsequent part, these parts are merged into an outcome. This latter part is called the reduce part. The communication between these two parts go along key-value pairs.
In a well-known example (MaxTemperature), this mechanism is demonstrated in a Java programme. This programme consists of 3 classes: a supervisory programme, that is shown below.

// cc MaxTemperature Application to find the maximum temperature in the weather dataset
// vv MaxTemperature
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("invoer is " + args[0]);
System.out.println("uitvoer is " + args[1]);
System.err.println("Usage: MaxTemperature [input path] [output path]");
System.exit(-1);
}
@SuppressWarnings("deprecation")
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
System.out.println("invoer is " + args[0]);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.out.println("uitvoer is " + args[1]);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
// ^^ MaxTemperature

This programme calls two other classes. The call is done via job.setMapperClass, which is coded below:

// cc MaxTemperatureMapper Mapper for maximum temperature example
// vv MaxTemperatureMapper
import java.io.IOException;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.Date;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
File file = new File("/home/hduser/example-mapper.txt");
if (!file.exists()) {
	file.createNewFile();
};
FileWriter fw = new FileWriter(file.getAbsoluteFile(),true);
BufferedWriter output = new BufferedWriter(fw);
Date date = new Date();
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
output.append("mappert is jaar " + date.toString() +">"+ year + " temp  " + airTemperature + "\n");
output.close();
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
// ^^ MaxTemperatureMapper

In this class, the input is read as a key value pair. On its turn the output is written as a new key value pair. This key value pair consists of a year and a temperature measurement. To know exactly what values are communicated, the key-value pairs are written to a file. The file (“/home/hduser/example-mapper.txt”) contains these lines:

mappert is jaar Mon Jun 15 05:58:29 PDT 2015>1975 temp  12341
mappert is jaar Mon Jun 15 05:58:29 PDT 2015>1975 temp  12342
mappert is jaar Mon Jun 15 05:58:29 PDT 2015>1975 temp  12343
mappert is jaar Mon Jun 15 05:58:29 PDT 2015>1975 temp  12345

The value pairs that are communicated are 1975 – 12341, 1975 12342 etc, . The resulting key value pair are processed in the subsequent reducer part that has this code:


// cc MaxTemperatureReducer Reducer for maximum temperature example
// vv MaxTemperatureReducer
import java.io.IOException;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer {
@Override
public void reduce(Text key, Iterable values,
Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
File file = new File("/home/hduser/example-reducer.txt");
BufferedWriter output = new BufferedWriter(new FileWriter(file));
for (IntWritable value : values) {
int waarde = value.get();
maxValue = Math.max(maxValue, waarde);
output.write("mappert is gelezen waarde " + waarde + " max  " + maxValue + "\n");
}
context.write(key, new IntWritable(maxValue));
output.close();
};

}
// ^^ MaxTemperatureReducer

Also, in this part a file is written that contains the values as they are processed. The values are 12345, 12343 etc

mappert is gelezen waarde 12345 max  12345
mappert is gelezen waarde 12343 max  12345
mappert is gelezen waarde 12342 max  12345
mappert is gelezen waarde 12341 max  12345

From these values the maximum is calculated.
The final result(key and maximum) can finally be read in the hdfs file with:
/usr/local/hadoop/bin/hadoop dfs -cat /user/output51/part-r-00000. This shows: 1975 12345, which is the final outcome of this exercise.