Monthly Archives: August 2015

Serialise

I encountered the term “serialise”. But what does it mean?
I understood the term “serialise” when I read a comment that explained that data structures can be created inside, say, PHP. One may think of an object or an array. Such data structures can only be used inside PHP and they cannot be transported outside PHP. To transport such structures outside, one needs to translate the structure into strings and numerics. PHP has a very convenient function, serialize, that does this for you. In other languages, you have to write your own serialise fundtion whereby you export strings and numerics from an object.

$a= array( 'piet', 'jan', 'klaas');
print_r($a);
$b=serialize($a);
print_r($b);
$c=unserialize($b);
print_r($c);

The output looks like:

Array ( [0] => piet [1] => jan [2] => klaas ) 

a:3:{i:0;s:4:"piet";i:1;s:3:"jan";i:2;s:5:"klaas";}

Array ( [0] => piet [1] => jan [2] => klaas )

The first line shows the PHP representation of an array. The second line shows a string that can be understood by an outside application. After that, the operation is reversed and the original, PHP-internal, representation is provided.

Avro in Java

Another example shows a similar idea. In this example a stream is created. This stream consists of 3 objects that contain a name and a number. Once the stream is created, it is serialised. In other words: the stream is prepared to be stored. It is stored in a file that is called “test.avro”.
Before continuing, one remark on serialisation.
The idea on serialisation is that one creates a format that is understandable outside the original language. An object that is created in Java can only be handled inside Java. To communicate the content, one needs to use a format that is underable by other languages, such as a string or an integer. The translation from an object into strings/integers is called serialisation. One then creates something that is understood outside Java. In this case, everyting will be translated into strings and integers. These are wrtten to a file. They can be understood by, say, PHP or Oracle. The strings and integers are written to a file. That file can be read by Oracle or PHP as they will only encounter strings/integers that can be transmitted from say Java to Oracle/ PHP.
In that file, we may detect the scheme along which the data are stored and the actual data. It takes a bit of courage as it is a binary file. Subsequently, it will be read from that file and the contents is shown.
The programme is written in Java. It reads like:

package avro;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.io.Encoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.util.Utf8;

@SuppressWarnings("deprecation")
class EmployeeTom
{
	public static Schema SCHEMA;
	
	static {
		try {
			SCHEMA = Schema.parse(EmployeeTom.class.getResourceAsStream("EmployeeTom.avsc"));
		}
		catch (IOException e)
		{
			System.out.println("Couldn't load a schema: "+e.getMessage());
		}
	}
	
	private String name;
	private int age;

	public EmployeeTom(String name, int age){
		this.name = name;
		this.age = age;
	}
	public GenericData.Record serialize() {
		  GenericData.Record record = new GenericData.Record(SCHEMA);
		  record.put("name", this.name);
		  record.put("age", this.age);
		  return record;
		}
	public static void testWrite(File file, EmployeeTom[] people) throws IOException {
		   GenericDatumWriter datum = new GenericDatumWriter(EmployeeTom.SCHEMA);
		   DataFileWriter writer = new DataFileWriter(datum);
		   writer.create(EmployeeTom.SCHEMA, file);
		   for (EmployeeTom p : people)
		      writer.append(p.serialize());
		   writer.close();
		}	

	public static void testRead(File file) throws IOException {
		GenericDatumReader datum = new GenericDatumReader();
		DataFileReader reader = new DataFileReader(file, datum);
		GenericData.Record record = new GenericData.Record(reader.getSchema());
		while (reader.hasNext()) {
			reader.next(record);
			System.out.println("Name " + record.get("name") + 
			                    " Age " + record.get("age") );
		}
		reader.close();
	}
	public static void main(String[] args) {
		EmployeeTom e1 = new EmployeeTom("Joe",31);
		EmployeeTom e2 = new EmployeeTom("Jane",30);
		EmployeeTom e3 = new EmployeeTom("Zoe",21);
		EmployeeTom[] all = new EmployeeTom[] {e1,e2,e3};

		File bf = new File("test.avro");
		
		try {
			testWrite(bf,all);
			testRead(bf);
		}
		catch (IOException e) {
			System.out.println("Main: "+e.getMessage());			
		}
	}
	
}

A final remark. I stored the schema in the same directory as the class files. This allowed the class EmployeeTom to find the schema file. The schema looked like:

{
  "type": "record", 
  "name": "Employee", 
  "fields": [
      {"name": "name", "type": "string"},
      {"name": "age", "type": "int"}
  ]
}

Sending data via AVRO

I got a better understanding when I used AVRO to write data via PHP and to read them via Java. It demonstrated to me how data can be written in one language and subsequently be read in another language.
I use a file to have the data be written by PHP. Subsequently the data can be read by Java.
The question then is: what is the advantage of using AVRO to have data been written in file. This can be compared to ordinary CSV files or a more advanced XML format.
Let us first write the data via PHP:
using this script
We have now written in PHP some data to a file. The nice thing about it that the data are written with their description, as provided by the schema. Apparently, the schema shows that we write records with two elements: a number and a name. This schema is only written once and the data are written after the schema.
When one compares this to a normal CSV file, one may notice that the schema is added to the file. Hence a programme that reads the data may verify that the data are in the correct format.
One may argue that this is similar to XML but with XML, the schema is repeated with every data element. Hence XML files tend to very large as compared to CSV. An avro file avoids this by providing the sceme just once.

The file may be read in Java by:

package avro;

import java.io.File;
import java.io.IOException;

import org.apache.avro.Schema;
import org.apache.avro.Schema.Parser;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;

public class GenericMain {
	public static void main(String[] args) throws IOException {
		Schema schema = new Parser().parse(new File("C:/inetpub/wwwroot/user.avsc"));
		File file = new File("C:/inetpub/wwwroot/data.avr");
		DatumReader datumReader = new GenericDatumReader(schema);
		DataFileReader dataFileReader = new DataFileReader(file, datumReader);
		GenericRecord member = null;
		while (dataFileReader.hasNext()) {
			// Reuse user object by passing it to next(). This saves us from
			// allocating and garbage collecting many objects for files with
			// many items.
			member = dataFileReader.next(member);
			System.out.println(member);
		}
		
	}
}

This demonstrates that the data can be read in another language. In this case, Java is used to read the file. The Java programme just needs to know where the data is stored (in data.avr) and how the schema looks like (provided in user.avsc). After that the file can be read and its records can be accessed.

Avro – getting it work

When you read about Hadoop, you come across AVRO. This is a mechanism to exchange data via streams and it is named after the famous British aircraft industry that amongst many other types, delivered the Lancaster that helped to liberate Europe. AVRO can be implemented in many languages, amongst them PHP. Before continuing let us run a programme with AVRO included.
PHP can be run as a webpage. One could also run it from the command line, but running as a webpage is probably the easiest way around. One need to install PHP but this is abundantly described. I installed PHP as an executable that can be run from CGI with IIS. It can be verified that PHP runs smoothly is one is able to run a so-called phpinfo() command. Examples can be be found on the internet.This phpinfo() is often used as a “HelloWorld” application. It is easy to program and it allows the user to verify that PHP is working.
One continues by installing AVRO which is a set of PHP programmes that can be installed in the web root directory.
Other things must be changed as well before using PHP. One must include a DLL that must also be mentioned in the php.ini file. Somewhere in the PHP.ini, one must include a line like extension=php_gmp.dll to have it being used by PHP.
As one need to write a file, the user priviliges need to be set accordingly.Untitled
I think, one might give it a go. Try to run this page . If one sees something like

 from file: array ( 'member_id' => 1392, 'member_name' => 'Jose', ) array ( 'member_id' => 1642, 'member_name' => 'Maria', ) from binary string: array ( 'member_id' => 1392, 'member_name' => 'Jose', ) array ( 'member_id' => 1642, 'member_name' => 'Maria', )

, one is on the right track.
What does this programme do? First, it creates an avro file. This is a binary file that contains the structure and the data. This file looks like:

In a subsequent step, the file is then written to screen. This example shows that avro files are a nice alternative to ordinary textfiles as the files contain the definition of the attributes.