Monthly Archives: October 2012

Skewness in teradata

Currently, I work on a Teradata machine. I understand that Teradata is based on the idea that processes work in parallel with each other. The idea is that data are divided over so-called amps that each process their share of the data.
If each so-called amp may process their share of data, the amount of data must be equally divided over the amps. If each amp take proportional part of data, each amp has a fair share of work. As the last amp decides on total elapse time, it is advisable to have an equal amount of data allocated to each amp.
To see as to what extent the data are equally divided, one may run this query:

SELECT databasename, tablename, creatorname
,dat_created*1 + 19000000 AS dat_created
,MAX(amps) AS amps
,MAX(R) AS Ratio
,MAX(AvgMB) AS AvgMB
,MAX(MaxMB) AS MaxMB
FROM (
SELECT t.tablename
, t.databasename
, ts.vproc
, ts.currentperm/1000000.0  AmpMB
, t.creatorname 
,CAST (t.createtimestamp AS DATE) AS dat_created
,AVG(ts.currentperm/1000000.0) OVER 
 (PARTITION BY ts.databasename, ts.tablename) AS AvgMB
,MAX(ts.currentperm/1000000.0) OVER 
 (PARTITION BY ts.databasename, ts.tablename) AS MaxMB
,COUNT(*) OVER (PARTITION BY ts.databasename, ts.tablename) AS AMPS
,CAST ( 10000* (maxmb-avgmb)/maxmb AS INTEGER )  AS R
FROM dbc.tablesize ts JOIN  dbc.tables t 
ON ts.databasename = t.databasename AND ts.tablename = t.tablename
WHERE
 t.databasename  IN  ('financial') 
) MY_DISPLACED
GROUP BY 1,2,3,4
ORDER BY Ratio DESC;

The higher, the ratio R, the more skewed the division of data over the amps is.
One influences the division of data with the addition of “primary index(attribute name)” in the ddl that creates the table.
Records that have the same value for the attribute are then allocated to the same amp. This implies that the primary index should have a many different values as possible. Different values implies different amps that get records allocated to.
An example. Table “slecht” is divided over two amps. One amp contains 0.08 MB, the second contains 0.08 MB. This is nicely distributed and the SQL that runs on this table will run smooth. If we look at the values, we notice that slecht has a primary index refers to an attribute that has many different values with no value standing out. If we update the table “slecht”by giving the primary index attribute the same value, we see that one amp gets 0.16 MB whereas the other amp gets no records at all. In that case, the sql will take more time as only one amp must do all work.

Get data from Oracle

There exist many ways to get data from Oracle. After all, this is the core functionality of Oracle: getting data in and out. A nice possibility is to retrieve data by means of a small script. Such script is straightforward:

set echo off   
set pagesize 0 
set feedback off
set linesize 50
set termout off

spool c:/onzin.txt

select '"'||job_id||'";"'||job_id||'";'
||min_salary||';'||max_salary from jobs;

spool off

The two most important lines are: “spool ” which designates a file that will receive data and the SQL that reads like “select etc”. Such script can be executed from SQLplus by @ “filename” where “filename” is the file that holds the script. In SQLplus, this looks like
When I go to the directory c:\, I see the file “onzin.txt”, that contains the data

"AD_PRES";"AD_PRES";20000;40000                   
"AD_VP";"AD_VP";15000;30000                       
"AD_ASST";"AD_ASST";3000;6000                     
"FI_MGR";"FI_MGR";8200;16000                      

I like this approach: you only need SQLplus as client application, which is standard in most circumstances. Moreover, it is fast: it depends on the network, but it is possible to retrieve millions of records within reasonable amount of time. Finally, this approach is repeatable since the script can be stored in a file.

Final remark

It is possible to execute this approach in a scheduled environment. If we have a script (say a.bat) that looks like:

connect scott/bunvgeni@INTERNET
@C:\a.sql
quit

This script can be executed by a scheduler with subsequent command “sqlplus /nolog @a.bat”. Without human interference, SQLplus is started and within SQLplus, a script is executed that retrieves data.

The foreign key in Oracle XE

Oracle has introduced a small version of its DBMS that is labelled “XE”. This small version can be downloaded and used for free upon the condition that its usage is for personal study only.
I really like this application: it has a relative small footprint; it runs on a small machine and it allows you to learn Oracle in your own environment.
It is possible to play with the features that Oracle has. To refresh my knowledge on foreign keys.

Let us first start with the creation of two tables:

DROP TABLE "SCOTT"."CHILD";
CREATE TABLE "SCOTT"."CHILD" 
   (	"CHILD_ID" NUMBER NOT NULL ENABLE, 
	"PARENT_ID" NUMBER, 
	"BIJDRAGE" NUMBER, 
	 CONSTRAINT "CHILD_PK" PRIMARY KEY ("CHILD_ID")) ;

DROP TABLE "SCOTT"."PARENT";
CREATE TABLE "SCOTT"."PARENT" 
   (	"PARENT_ID" NUMBER NOT NULL ENABLE, 
	"PARENT_NAAM" VARCHAR2(20 BYTE), 
	 CONSTRAINT "PARENT_PK" 
         PRIMARY KEY ("PARENT_ID")) ;

INSERT INTO "SCOTT"."PARENT" 
(PARENT_ID, PARENT_NAAM) VALUES ('1', 'tom');
INSERT INTO "SCOTT"."PARENT" 
(PARENT_ID, PARENT_NAAM) VALUES ('2', 'ine');

INSERT INTO "SCOTT"."CHILD" 
(CHILD_ID, PARENT_ID, BIJDRAGE) VALUES ('1', '1', '123');
INSERT INTO "SCOTT"."CHILD" 
(CHILD_ID, PARENT_ID, BIJDRAGE) VALUES ('2', '1', '23');
INSERT INTO "SCOTT"."CHILD" 
(CHILD_ID, PARENT_ID, BIJDRAGE) VALUES ('3', '2', '45');
INSERT INTO "SCOTT"."CHILD" 
(CHILD_ID, PARENT_ID, BIJDRAGE) VALUES ('4', '2', '65');
INSERT INTO "SCOTT"."CHILD" 
(CHILD_ID, PARENT_ID, BIJDRAGE) VALUES ('5', '2', '78');

We now have two tables: a child and a parent. With a foreign key relation, I will link the two tables. This is done with:

alter table "CHILD" 
add foreign key("PARENT_ID") 
references "PARENT"("PARENT_ID")
; 

This links the parent to the child. In this case, one record in the “parent” table may be linked to zero, one or more records in the “child” table. In it also true that a record in the child must always be linked to the parent. This can also be seen in the database diagramme:


The foreign key relation described above does not allow you to remove a parent record that has still child records. If we want to remove such a record, we should first remove the child records.
To avoid this somewhat cumbersome procedure (first delete relevant chiuld record, then remove parent record), we might use a different way of creating the foreign key:

alter table "CHILD" 
add foreign key("PARENT_ID") 
references "PARENT"("PARENT_ID")
on delete cascade
; 

In that case, removal of a parent record automatically deletes the child records. This could be quite convenient if we have a system where details are stored in a separate table, away from a general table. With the removal of records from the general table, the details are automatically removed as well.

 

 

How many non-nulls per column?

In Oracle, it is quite well possible to display quickly how many non-nulls we have per column in a table.
The script reads like:

set termout off echo on feed off trimspool 
on head off pages 0

spool C:\countall.tmp

SELECT  'select count('||t.column_name||') 
 from ' || t.table_name ||';'
FROM    user_tab_columns t
WHERE  t.table_name = 'DATE_D2'
;

spool off

set termout on
@@c:\countall.tmp

set head on feed on

And the results look like:

How many records in a table?

Recently, I encountered a very nice script that allows us to get an overview of the number of records per table in an Oracle scheme. First the script:

set termout off echo off feed off trimspool 
on head off pages 0

spool C:\countall.tmp
select 'SELECT count(*), '''||table_name||''' 
from '||table_name||';'
from   user_tables
/
spool off

set termout on
@@c:\countall.tmp

set head on feed on

The output then looks like:

As you can see, I stored the script in a file c:\a.sql, where I copied the script. The resulting output showed itself in the terminal.

ASCII or EBCDIC: how to recognise it?

Today, I heard someone complaining that he received a file that contained “unexpected characters”. Later it was discovered that this file was provided in EBCDIC and the receiver tried to open it on a UNIX box with an ordinary editor. Such a UNIX box assumes the ASCII lay-out and as the file was delivered in EBCDIC, the characters shown were not comprehensible.

What is the difference between ASCII and EBCDIC?

Roughly stated ASCII and EBCDIC both translate a byte into a character. A byte may have 256 different values. A byte then ranges from 0 to 255. In heximal format: it ranges from 00 to FF. This implies that a byte with heximal value 20 is ASCII interpreted as a blank. It also implies that a byte with heximal value 40 is EBCDIC interpreted as a blank.
We can see the heximal values when we toggle in a text-editor between the normal representation and the heximal representation. On my laptop, I have a text editor (Notepad++) that allows such a toggle. My laptop uses the ASCII representation and toggling shows that “TOM ” is equivalent to heximal values “54 4F 4D 20”. On Windows and Unix the ASCII representation is most often used and text editors on either a Windows box or a UNIX provide you with the heximal representation.
On the other hand, EBCDIC is used on IBM mainframes. There, we also have the possibility to show the heximal values. See:.

All right. Assume, we get a file that contain strange characters. Do we have a EBCDIC/ASCII mix up?
If we are working on a Windows or Unix box, we see the ASCII representation. We know that the ASCII heximal value of “@” is 40. This 40 is translated into a space under EBCDIC. Hence: if we see in an editor on a Windows/Unix box many “@”, we may suspect that an EBCDIC file is delivered whereby the blanks are translated into ASCII “@”.
Another possibility is to look at the heximal representation. If we see many “F0”, we may suspect an EBCDIC file as the EBCDIC heximal value F0 stands for a “0”. Hence: many heximal values of F0 indicate EBCDIC as any file may contains many 0.
Hence: enough clues to find out if an EBCDIC file is delivered in stead of an ASCII file!

A script to create a sample data warehouse – part 2 the fact

I have created an example data warehouse with just one dimension table and one fact table. We have discussed how the dimension table must be loaded. Once the dimension table is loaded, we may start loading the fact table.

Let us see how that is done.

The fact table looks like:

create table devADW_BAN_MI.factTARGET
(PI DECIMAL(15,0) GENERATED  BY DEFAULT AS IDENTITY
           (START WITH 1
            INCREMENT BY 1
            MINVALUE 1
           NO CYCLE
            ),
 SK decimal(15,0),
 SBLrowid integer,
 getal integer,
 updat  timestamp(0) ,
 VERSION_SDATE timestamp(0)  ,
 version_edate  timestamp(0)   );

Here, we have a SBLrowid that provides us with an indicator on the event. This indicator will be used as a business key of the event. This enable us to differentiate between a renewed recording of an event that we already know of and a comnpletely new event. Whenever a new recording of an event is received, we would like to label the former recording as being obsolete. Likewise, the newest recording should be marked as a new recording with times attached as from what point in time the recording is valid.
The periods in which the data of a record reflect an actual situation is indicated by version_sdate and version_edate. The version_sdate stands for the start of the validity period whereas the version_edate stands for the end of the validity period.
If a record is received, it is assumed that version_sdate can be filled out with a sysdate to indicate a start of the validity period. Likewise, if the record is superseded, the point in time at which the record is superseded is included as “version_edate”.
I have included “getal” as an attribute that describes the features of an event. Also this event is assumed to be described by an attribute that stems from the dimension table.  Therefore a reference to that dimension table is included. This reference is done via the use of the surrogate key of the dimension table. This provides us with all details from the dimension table that are linked to an event.

We also include an attribute “updat”. This indicates the time when a record is created. This attribute allows us to select a subset of the input records. Records that were processed earlier can then be skipped.

Much like the loading of the dimension, a seperate table is created that contains source records. Let us assume that at a certain point of time, we have these records available in the source:

	BK	SBLrowid	getal	updat
1	PIP	3	25	10/3/2012 15:08:29
2	BKO	2	1,200	10/3/2012 15:08:29
3	BKO	1	1,000	10/3/2012 15:08:29
4	XYZ	4	400	10/3/2012 15:08:30
5	TBO	2	1,300	10/3/2012 15:11:13
6	BKO	5	95	10/3/2012 15:11:14
7	XYZ	4	400	10/3/2012 15:11:14

Let us assume that the first 4 records are already loaded in the target fact table. That leaves 3 records to be loaded. These three records represent three different situations.

  1. The first situation are records that are updates on an event that is reported earlier. In this case, the event is indicated by SBLrowid. We notice that for SBLrowid=2, we received a record with [BKO, 1200] whereas an update is sent with [TBO, 1300]. To handle this, we first create an intermediate table that only contains updates on events that are not yet processed in the data warehouse.
    --temp tabel maken
    drop table devADW_BAN_MI.temp;
    create table devADW_BAN_MI.temp
    ( SK decimal(15,0),
     SBLrowid integer,
     getal integer,
     updat  timestamp(0) 
    );
    insert into  devADW_BAN_MI.temp 
      select A.SK,A.SBLrowid,A.getal,A.updat
      from (select SK, SBLrowid, getal, updat  
            from devADW_BAN_MI.factSOURCE
            left join devADW_BAN_MI.dimTARGET 
            on factSOURCE.BK = dimTARGET.BK
            where version_edate > CURRENT_DATE
            and (updat > (select max(updat) 
                 from devADW_BAN_MI.factTARGET)
                 )
            ) A, 
      (select SK, SBLrowid, getal, updat  
       from devADW_BAN_MI.factTARGET 
       where version_edate > CURRENT_DATE) B
    where A.SBLrowid=B.SBLrowid 
       and 
    (A.SK||A.getal != B.SK||B.getal);

    These records are subsequently used in two different actions. The first action is to update the records in the target table for which an update is received

    update devADW_BAN_MI.factTARGET
    set version_edate = CURRENT_TIME 
    where  (SBLrowid  in 
    (select SBLrowid from devADW_BAN_MI.temp))
     and version_edate > CURRENT_DATE;

    The second action is to insert the receords with the newest version of the events:

    insert into devADW_BAN_MI.factTARGET
    (SK,SBLrowid, getal,updat,version_sdate,version_edate) 
      select SK,
      SBLrowid, 
      getal,
      updat, 
      CURRENT_time,  
      cast ('99991231000000' as TIMESTAMP(0) FORMAT 'yyyymmddhhmiss') 
     from devADW_BAN_MI.temp
     ;
  • The second situation is a situation that records on events are received from the source that are not yet processed into the target table. This can be inserted by:
    --nieuwe records
    insert into devADW_BAN_MI.factTARGET
    (SK,SBLrowid, getal,updat,version_sdate,version_edate) 
     select 
      B.SK,
      A.SBLrowid, 
      A.getal,
      A.updat, 
      current_time, 
      cast ('99991231000000' as TIMESTAMP(0) FORMAT 'yyyymmddhhmiss')
     from 
      (select BK, SBLrowid, getal, updat  
       from devADW_BAN_MI.factSOURCE 
       where  SBLrowid not in 
       (select SBLrowid from devADW_BAN_MI.factTARGET)) A 
       left join 
       (select BK, SK from devADW_BAN_MI.dimTARGET 
       where version_edate > CURRENT_DATE) B
       on A.BK=B.BK
      ;

    This insert statement has two characteristics. The first characteristic is to select records on events from the source that are not yet included in the target table. The second characteristic is to look up teh surrogate key from the dimensional table with the business key that is delivered. From the source, we get a business key on items that we have stored in the dimension table. This is subsequently translated into a surrogate key with this dimension table. We need that surrogate key to identify exactly one record within the dimension table in which the correct versions of the dimension is stored.

  • The third situation is whereby records are sent from the source on events that are already stored in the target table. The content is also exactly equal between source and target. In that case, no action needs to be taken.

A script to create a sample data warehouse – part 1 the dimension

In the last week, I have written a sample script to create a skeleton data warehouse. I will use this as a head start for future work where the skeleton will be used to implement an actual data warehouse.

The skeleton is written for a Teradata DBMS. This is the platform that I currently use.

The example data warehouse has two tables: a dimension table and a fact table. Hence we have a small example that can be expanded in the future.

Let us start with the dimension table.
See also the script with ExampleDW
The dimension table is created with:

drop table devADW_BAN_MI.dimTARGET;

create table devADW_BAN_MI.dimTARGET
(SK DECIMAL(15,0) GENERATED  BY DEFAULT AS IDENTITY
           (START WITH 1
            INCREMENT BY 1
            MINVALUE 1
           NO CYCLE
            ),
 BK varchar(3),
 extra varchar(256),
 VERSION_SDATE timestamp(0)  ,
 version_edate  timestamp(0)   );

This dimension has a surrogate key (SK) that is loaded with the DBMS facility. This leads to the situation that every record is given an attribute that can be used as a primary key. This SK is the primary key. It uniquely identifies a record within a table.
The table also has a business key (BK). This is something by which the business identifies an object (say a product, a contract or a customer). If we have a business key, we are able to know whether we have an earlier version of the object. If we get a new record on an object that we already have in the table, we may act accordingly.
In our case, we assume that we want to act in the following way whenever a new record is found on an object that we already have in the data warehouse.

  • The record is added to the table.
  • The record that describes the old state of the object is labelled as ‘old version’ .

In the table we have an attribute “extra”, that acts as an example of a descriptive attribute. It can be a colour of the object, the size, the owner etc.

We also encounter an attribute “version_sdate”  which indicates the starting point of validity of the record. We assume that an object generates over time a series of records. Once the record is sent as the newest version of the object, it is stored in the data warehouse. We use the attribute version_sdate as the starting point in time after which the record is seen as the valid version of the object.

 Likewise, we have a “version_edate”  that indicates the finishing point in time. If a record is stored in the data warehouse and if a new version is included in the data warehouse, the old record is updated with version_edate equals finishing point in time.

A source table is created with:

create table devADW_BAN_MI.dimSOURCE
(BK varchar(3),
 extra varchar(256));

This represents the source of data. This source contains two attributes: a business key (BK) that identifies an object and an additional attribute (extra) that contains a description of the object.

Let us then load the source:

insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('BKO','BKO extra');
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('TBO','TBO extra');
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('SRM','SRM extra');
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('XYZ','XYZ extra');

Ok. we now have a source that is loaded and a dimension table that is still empty. This can be initially loaded with:

insert into devADW_BAN_MI.dimTARGET(BK,extra,version_sdate,version_edate)
select BK,extra, current_time, cast ('99991231000000' as TIMESTAMP(0) FORMAT 'yyyymmddhhmiss')  
from devADW_BAN_MI.dimSOURCE

This gives:

	
          SK  	BK	extra	 VERSION_SDATE	 version_edate
1	100,001	XYZ	XYZ extra 10/2/2012 16:29:15 12/31/9999 00:00:00
2	300,001	BKO	BKO extra 10/2/2012 16:29:15 12/31/9999 00:00:00
3	1	SRM	SRM extra 10/2/2012 16:29:15 12/31/9999 00:00:00
4	200,001	TBO	TBO extra 10/2/2012 16:29:15 12/31/9999 00:00:00

Let us continue with a new dataload for the dimesnion table:

delete  from devADW_BAN_MI.dimSOURCE;
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('SRM','SRM nog meer');
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('ABC','ABC extra');
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('TBO','TBO extra');

We first deleted all records in the source. This is followed by loading three records.

In an incremental load step, we may have three situations:

  • the record represents an object that is not yet included in the dimensional table. Whether or not this is the case can be assessed with the business key. If a new business key is encountered, we have this situation. The algorithm is quite simple: do we have a record is the source for which the business key is not yet included in the target dimension table. The sql reads as:
--NEW records in source
insert into devADW_BAN_MI.dimTARGET
 (BK,extra,version_sdate,version_edate)
 select 
  BK,
  extra, 
  CURRENT_time, 
  cast ('99991231000000' as TIMESTAMP(0) 
     FORMAT 'yyyymmddhhmiss') 
 from devADW_BAN_MI.dimSOURCE 
  where BK not in 
  (select BK 
   from devADW_BAN_MI.dimTARGET 
   where version_edate > CURRENT_DATE
  );
  •  the record in the source is included exactly in the target table. In our case, we have a combination (BK, extra) that may already be present in the target table. In that case we have no reason to apply any change in the target table: the data are already present in the target. We may go on looking for records in the source that may need action.
  • the record in the source table represents an update on an object that is already present in the data warehouse. In that case we have a business key (BK) that is already present in the dimensional table. We also have a situation whereby the content in additional attributes are different from what we have in the dimensional table. In this case “extra” in the source is not equal to “extra”  in the target table. We identify these records as:
    create table devADW_BAN_MI.temp
    (BK varchar(3),
     extra varchar(256));
    insert into  devADW_BAN_MI.temp 
    select B.BK, A.extra
    from 
    (select BK, extra from devADW_BAN_MI.dimSOURCE) A,
    (select BK, extra from devADW_BAN_MI.dimTARGET 
      where version_edate > CURRENT_DATE) B
    where A.BK=B.BK and A.extra != B.extra;
  • Once the records that contain an update are identified, we must take 2 actions:

    1. Update records in the data warehouse that represent an obsolete version of the object. This update consists of setting version_edate to a date that indicates the end of a validity period.
      update devADW_BAN_MI.dimTARGET
      set version_edate = CURRENT_TIME 
      where  
      (BK  in (select BK from devADW_BAN_MI.temp))  
      and 
      version_edate > CURRENT_DATE; 
    2. Insert records that represent the newest update on the objects.
      insert into devADW_BAN_MI.dimTARGET
      (BK,extra,version_sdate,version_edate)
      select BK,extra, CURRENT_time,  
      cast ('99991231000000' as TIMESTAMP(0) 
         FORMAT 'yyyymmddhhmiss')  
      from devADW_BAN_MI.temp
      where (BK  in (select BK from devADW_BAN_MI.Temp)) ;