www.destructor.de
Character Sets are an issue every programmer has to deal with one day. This is an overview of the most important character sets.
Name | Bytes per Character | Description | Range | IANA/MIME Code |
---|---|---|---|---|
7-bit ASCII | 1 | The mother of all character sets. Contains 32 invisible control characters, the latin letters A-Z, a-z, the arabic digits 0-9 and a bunch of punctual characters. Code Range 0..127. | 0..127 | US-ASCII |
Unicode based character sets | ||||
Unicode, ISO 10646 | N.A. | A universal code for all characters someone can think of. Defines characters, assigns them a scalar value, but does not define how characters are rendered graphically or stored in memory. | U+0000.. U+10FFFF |
N.A. |
UTF-8 | 1..6 | A Unicode transformation format that uses 1-Byte characters for all 7-bit US-ASCII characters and sequences of up to 4 bytes for all other Unicode characters. | All Unicode characters | UTF-8 |
UCS-2 | 2 | A unicode transformation format that uses 2 Bytes (16 Bits) for every character. This character set is not able to render all Unicode scalars and is therefore obsolete. | U+0000.. U+FFFF |
ISO-10646-UCS-2 |
UTF-16 | 2 | A unicode transformation format that uses 2 Bytes (16 Bits) for every character. Using the concept of "Surrogate Pairs", this format is able to store all Unicode characters. However, 1 Unicode character can be stored as two contiguous 16-bit words. | All Unicode characters
|
UTF-16 |
UCS-4,
UTF-32 |
4 | Two unicode transformation formats that use 4 Bytes (32 Bits) for every character. UCS-4 and UTF-32 are the only character sets, which are able to render all Unicode characters in equally long words. UCS-4 and UTF-32 are technically identical. | All Unicode characters | ISO-10646-UCS-4 UTF-32 |
Single byte character sets | ||||
ISO 8859-x | 1 | An extension of US-ASCII using the eighth bit. | 0..127, 160..255 |
ISO-8859-x |
Windows 125x | 1 | Similar to ISO 8859-x, some characters changed, plus additional characters in the 128..159 range. | 0..255 | windows-125x |
These character sets are extensions of ASCII where the 8th bit is used. The 0..127 range is identical to US-ASCII.
Name | Short Name | Covered Languages | MS Windows counterpart |
---|---|---|---|
ISO 8859-1 | Latin-1 | Western and West European languages (English, German, French, Spanish,
Portuguese, etc.) As these languages are used in large parts of the world (Europe, Americas, Australia, Africa), these are the most widely used character sets. Windows 1252 and ISO 8895-1 are equal in the 160..255 range |
windows-1252 |
ISO 8859-2 | Latin-2 | Central and East European languages (Czech, Polish, etc.) | windows-1250 |
ISO 8859-3 | Latin-3 | South European, Maltese, Esperanto | |
ISO 8859-4 | Latin-4 | North European | |
ISO 8859-5 | Cyrillic | Russian, Ucrainian | windows-1251 |
ISO 8859-6 | Arabic | Arabic | windows-1256 |
ISO 8859-7 | Greek | Modern Greek | windows-1253 |
ISO 8859-8 | Hebrew | Hebrew | windows-1255 |
ISO 8859-9 | Latin-5 | Turkish | windows-1254 |
ISO 8859-10 | Latin-6 | Nordic (Sami, Inuit, Icelandic) | |
ISO 8859-11 | Thai | Thai | windows-874 |
ISO 8859-13 | Latin-7 | Baltic | windows-1257 |
ISO 8859-14 | Latin-8 | Celtic | |
ISO 8859-15 | Latin-9 | Western European languages. Similar to ISO 8859-1, adds Euro sign (€) and a few other characters | |
ISO 8859-16 | Latin-10 | South Eastern European languages (Albanian, Croatian, Hungarian, Italian, Polish, Romanian, Slovenian, but also Finnish, French, German and Irish Gaelic) |
These are character sets specific to Windows. They are similar, but not equal, to the ISO 8859 character sets. While ISO 8859 character sets do not specify characters in the 128..159 range, the Windows character sets do. Characters in the 0..127 range are identical to US-ASCII. Most but not all of the character assignments in the 160..255 range are the same as in ISO 8859.
Number | Name |
---|---|
1250 | Latin 2 |
1251 | Cyrillic |
1252 | Latin 1 |
1253 | Greek |
1254 | Latin 5 |
1255 | Hebrew |
1256 | Arabic |
1257 | Baltic |
1258 | Viet Nam |
874 | Thai |
Every XML document or external parsed entity or external DTD must begin with an XML or text declaration like this:
<?xml version="1.0" encoding="iso-8859-1" ?>
In the encoding attribute, you must declare the character set you will use for the rest of the document. You should use the IANA/MIME-Code from the table above.
In the head of an HTML document you should declare the character set you use for the document:
<head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> ... </head>
Without this declaration (and, BTW, without an additional DOCTYPE declaration), the W3C Validator will not be able to validate your HTML document.
The Internet Assigned Numbers Authority (IANA) maintains a list of character sets and codes for them. This list is:
IANA-CHARSETS Official Names for Character Sets, http://www.iana.org/assignments/character-sets
Stefan Heymann. Last Update 2012-06-04
This documentation is licensed under (choose your favorite): GPL, LGPL, CC, IDPL, GFDL, BSD, (did I forget one?)