Using IDA

IBM launched InfoSphere Data Architect (IDA). This is a data modelling tool that acts as a competitor with Erwin and PowerDesigner. I played a bit with this tool to see what the capabilities are and in which areas these products are different from Erwin and PowerDesigner. As a general remark, I must say that I… Read More »

Create a description of a system

From time to time, I must provide a scheme on how a system works. It is not really to find a technique in which all details can be shown. I was to make a differentiation between processes, like sending mail, databases and applications. I also want to indicate what techniques are used. Moreover, I would… Read More »

ElasticSearch: Restful services

As we have seen in a previous post, we communicate with the ElasticSearch server via messages that are sent to a server. On the other hand, the server responds in messages that are received by the client. This system of messages are labelled as s “RESTful” structure. This RESTful structure is based om messages that… Read More »

Curl and elasticSearch

One of the most useful utilities is “curl”. This wonderful tool can be used to transfer data from one platform to another. It is relatively easy to install in Windows, whereas under linux, it is often already installed. It must be run from the terminal in Linux or the command line in Windows. One example… Read More »


A new and popular nosql database is the Elastic Search database. This database is easy to install en easy to run. But is it easy to insert data and extract the outcomes? The principle of inserting data into ElasticSearch looks rather straight forward. One inserts json files. On the other hand, with filters, one may… Read More »

Scala merging files

In a previous post, I showed how two files can be merged in Scala. The idea was that RDDs were translated as data frames and a join was undertaken on these. In this post, the philosophy is slightly different. Now the RDD is rewritten as a key-value pair with a unique key. This then allows… Read More »

Scala and RDDs

RDDs are the basic unit in Scala on Spark. The abbreviation stands for Resilient Distributed Dataset, This shows that we are talking on full data sets that are stored persistently on a distributed network. So the unit of work is comparable to a table. We have two different operations on this RDD. These are a… Read More »

Merging files in Scala

I understand that Scala may be used in an ETL context. In ETL, an important element is the merge of two files. We will get data from different sources and they must be merged in one file only. As an example, we may think of two files, one containing a number and a name, another… Read More »

Getting a histogram from Big Data with Scala

Scala can be used as a tool to manipulate big data. If it is used in the spark context, we have a possibility to combine two strong tools: spark with its possibility to bypass the MapReduce bottleneck and Scala with its short learning curve. The idea that Scala can be closely integrated with Spark is… Read More »


Scala is a language that is used for general purposes. One may use it as a statistical tool, a tool to undertake pattern matching etc. Just like any other programming tool like Java, C++, Fortran might do. But on top of that, Scala is used as a means to steer Big Data on a Hadoop… Read More »