Monthly Archives: June 2014

Different codepages in SQL Server

It is possible in SQL Server to create columns in a table that have different codepage. This can be shown in the table below. That table has two attributes: char_unicode with a unicode codepage and char_latin that has a Latin codepage with West European characters only.

codepage

The char_latin cannot contain characters like ł, İ or the Φ. Such characters are not part of the Latin codepage as this codepage only contans characters that are in use in Western European languages. To store such characters, we might use the unicode codepage. Let us show how such characters are represented.

Chars

The column char_unicode shows the characters İ, Φ , И and the ł. In the columns char_latin, these characters are inserted in the column but (obviously) couldn’t be represented. It is interesting to see what happens: the İ is replaced by an I; a Φ is replaced by the F, the И is replaced by an unknown and ł is replaced by l.
This allows an easy logical translation from, say İstanbul into Istanbul. This is a very interesting property as we could easy store Polish, Turkish in local characters and store them in a logical understandable way if a Latin codepage is used. Nice.

Codepages

Recently, I encountered the issue of codepages again. The problem was that the name of a Turkish city (İstanbul) was not represented correctly. Again a codepage issue. The character İ was not part of the application.

How did these codepages come around?

In the early days of Windows 95, we had ASCII. This had 127 different characters that were represented with 7 bit. For some network reason an additional bit was necessary which led to the situation that 8 bits (a full byte) was necessary to represent an ASCII character. The set of ASCII characters contains the normal a,b, c etc, both in lower case and upper case. This is added with numerics (0,1,2,3, ..) and signs like the question mark, comma, dot etc. We miss in the ASCII the language specific characters like È, ç characters.
To cover for the Western European languages an additional series of 128 characters was created. The original ASCII plus the West European special characters ( È, ç) could be represented in 8 bits. Hence one byte was sufficient to display all characters in the Western European languages. This list is indicated by extended ASCII, Codepage 1252 or ISO -8859-1 or Latin1. Roughly these codepages allow us to represent West European texts like French, English, Spanish, German and Dutch.

However if one would like to write İstanbul, one encounters the limitations of these codepages. Hence other codepages are created to represent Baltic, Turkish or Czech. In most cases the alphabets that are derived from Latin contain the original ASCII list plus a series of characters (İ, Ď ) that are specific to these languages. The idea is that one byte represents one character.
Likewise, other languages like Russian, Hebrew and Arabic have created a list of characters whereby one byte is used to represent a character.

However, one might run into problems if an attribute should contain both Turkish and French names. What codepage to use? To solve this a new codepage is created: the unicode. Here, roughly two bytes are used which allows to store a nearly infinite set of characters. Standard characters are represented with one byte; more exotic characters use 2, 3 or 4 characters. One might even store Japanese, Korean in unicode.

Unicode certainly has the future. As we speak, one sees many migration documents where a ASCII codepage is migrated to unicode. I am also informed that more and more dbms recommend unicode as their default codepage.