The chapter on Character Encoding was probably my favorite chapter to write. I hadn't even submitted it as a chapter in my initial book proposal. Then I was at a customer site who was having some issues with UTF-8 and they asked me if I knew UTF-8. I said yes, and I was right. However, as I examined their problem I found that I knew less than I thought I did. Then I realized that while this 3.5-year PHP consultant knew Unicode, UTF-8, character encodings such as ISO-8859-1 or ISO-8859-7, I didn't understand them as well as I thought I had. With that I threw this chapter in the book. Knowing about character encoding is what many developers have. Not as many truly understand it. In this chapter I try to de-mystify character encoding as a whole. In other words, it's not something that messes up how your web pages look, but rather, it is a tool for you to use to make your site available. With this chapter you will learn the history of character encodings and why they're so messed up. Additionally, you will learn how UTF-8 actually works and how it's related to Unicode (it's not the same thing).
So, without further ado, here is your excerpt.
Chapter 1: Networking and Sockets
Chapter 2: Binary Protocols
Chapter 3: Character Encoding
Chapter 4: Streams
Chapter 5: SPL
Chapter 6: Asynchronous Operations with Some Encryption Thrown In
Chapter 7: Structured File Access
Chapter 8: Daemons
Chapter 9: Debugging, Profiling, and Good Development
Chapter 10: Preparing for Success
Computers know nothing about language. If you really get right down to it, they know nothing of numbers either. All they know is on and off. They can change the ons and offs based off of other ons and offs. If certain ons are on the processor will do one thing and if they are off it will do another. A computer is perfectly happy with this arrangement.
However, the purpose of a computer is to do work for a human. It is a machine whose purpose is to make a human’s job easier, or for that human to blow up zombies. Because of that, the computer has a problem. It doesn’t know how to talk to the human in its own language. For that reason, the humans needed to teach the computers how to talk with them.
In the early days, this was quite simple. Being that computers were basically created in the English-speaking side of the world, English was a natural choice for providing the interface so the humans and the computers could talk. Though if you were to read through some of the mainframe program calls it is quite apparent that there was some give and take going both ways since the commands that the humans give the mainframe, or midrange, are actually quite cryptic… unless you know the language.
The reason for this is the limitation of resources. In the early days of computing, every bit was sacred, every bit was great. And so we started with what is called the American Standard Code for Information Interchange, or ASCII, as we now know it. It is a 7-bit code which gave developers more than enough room to handle the 26 characters of the English alphabet plus some punctuation, special characters and control characters. So, 127 options for managing data.
So when you get right down to it, a character set is the translation for the computer. It translates the numbers that the computer stores into graphical characters. A computer cannot store a letter, but it can store a numeric representation of that letter. The character set is the standard of translation between the number and the graphical character.
For example, you cannot store the letter “A” on a computer. A computer doesn’t really know numbers and it REALLY doesn’t know letters. But numbers can have binary representations. You could have a binary representation of a letter, but it would conflict with the binary representation of a number. Say that you had the number 12 and you wanted to add 4 to it. Would that be the number 4 or the letter “d”? So, a computers primary method of communication is via numbers. The character set is the mapping between the numbers that the computer is able to work with and the human.
And so, the communication problem between the computer and the human is solved. Right? In the early days of computing, ASCII did not have the same prevalence that it has today. Another competitor was EBCDIC, or Extended Binary Coded Decimal Interchange Code. This was written on the IBM 360 and, due to the 360’s popularity, EBCDIC became quite popular.
With that (or situations like it), character encoding issues began. Why was that? Remember that a letter is just a certain number. In ASCII, the number for the letter “A” is 65. In EBCDIC, the number for the letter “A” is 193. Why the difference? I have heard from some i5 (the IBM midrange integrated system) people that EBCDIC is good for punch-cards in that the way the bits were laid out made for more durable cards. Before the web-based whipper-snappers (I am part of this crowd) go off on this, remember that modern Computer Science is a very new science. The problems that people in the ‘50s and ‘60s had to solve were basically the same, but they had different limitations. The first commercially available microprocessor was the Intel 4004 and was a 4 bit machine and was made available in 1971. Both ASCII and EBCDIC were available in 1963, 8 years before the 4004. In other words, punch cards were a necessary step for us to get to the point where we are today.
But we still have the problem of how to represent characters. That is where character set conversion comes into play. If you had one system an IBM mainframe that had started its life in the 1960’s and had been upgraded through the years to a 390 to a Z or an I you may have some applications that are still using EBCDIC. But you run on a computer that uses ASCII has its default character set and you need to read from the older computer.
To show an example of the problem, let’s take a look at what an EBCDIC encoded “hello world” looks like.
$str = pack( 'CCCCCCCCCCC',
0x88, 0x85, 0x93, 0x93, 0x96, 0x40, 0xa6, 0x96,
0x99, 0x93, 0x84
Figure 3.1 EBCDIC encoded character string
Remember that a character encoding is simply just a sequence of numbers, but that the individual number represents an individual character. When we run this code we get something a little different from what we intended.
The reason for this is simple. My browser understands ASCII and according to ASCII, those are the characters that I asked it to print out. To show this, let’s change the code to its ASCII representation.
$str = pack( 'CCCCCCCCCCC',
0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x77,0x6f,
0x72, 0x6c, 0x64
Figure 3.2 ASCI encoded character string
It looks the same, except with different numbers. But when we print it out in our browser we get a different result.
The question then is how do we get something that was returned to us in EBCDIC to display properly on a browser that does not support it? Generally, this is done with some kind of character set conversion. All that character set conversion does is change the numbers in a string of one character set to the numbers in the string of the other character set that matches the letter. This is generally done with a conversion table.
A conversion table is a very simple concept. In PHP it’s also very simple to implement. Basically you have a numerical array of all characters that are convertible, starting at zero. Then for each numerical key of the array you have the corresponding value in the other character set that represents the same letter. So, for example, in ASCII the ordinal for the letter “h” is 0x68, or 104. In EBCDIC, the letter “h” is 0x88 or 136. So, basically, the conversion table for EBCDIC to ASCII will have, at key number 136, the number 104.