Computer programs are forever transforming data from one format to another, from Zip1 files, through Rich Text Format (RTF)2, to GuideML. This process causes no end of problems and at first glance, it seems to have been dreamt up solely to make life difficult.
Well, maybe it was, but there's more to coding things than making life difficult for users. Coding helps you to store within a structure, store compactly, and store safely. Depending on which of these three things you need most, a different code should be used. Codes for transmission can be a separate thing altogether. Using the wrong code can be counter-productive; increasing space instead of decreasing it, or reducing reliability instead of increasing it.
Reasons for Coding
The first reason for coding is structure, and is something you either need or don't. Structure is what distinguishes plain old text, or long lists of numbers, from essays with fancy headers and footers, and nicely coloured charts.
The second important reason for creating a code is to store things more compactly. It might seem initially strange that you can store things in a smaller space than it takes to simply write them out, but it can be done, and the way it is done is by taking advantage of the fact that the source is unlikely to be totally random, but will have patterns in it. For example, 'q' will always be followed by a 'u'.
The third reason to create a code is to be able to detect errors that can occur when transmitting or storing a code. Really good codes can even automatically correct errors that have occurred, provided there aren't too many.
Finally, codes can be used to make information secure; so that only those who are meant to be able to read it can do so. This coding process is called encryption, and the task of obtaining the original information is called decryption.
It's worth mentioning that it's possible to apply more than one code to a given piece of data. If the answer to a question is emailed to a lecturer, it might first be given structure as a word file, then compressed as a zip file, then attached to the email (more structure), then compressed and encrypted before being sent over the network.
Structures Made of Bits and Bytes
Structuring data is needed for a variety of reasons. For example, the writing in a book is split into lines, paragraphs and chapters to make it more readable. A computer program might be split up into a number of procedures, each of which contains a number of commands, and each command will contain a number of expressions, and sub-commands.
A common method of structuring files is to simply declare that, for example, the first eight characters are the name of the file, and the final three are its type. Such a code relies on everyone having the same concept of a file and never changing, so while it is efficient, it is also inflexible.
Sometimes structure can be added using 'control characters'. For example, in morse code the receiver needs to distinguish where one letter ends, and the next begins. This is done by having a short pause between each letter. A longer pause occurs between each word, and a longer pause still between each sentence3. This works because pauses are not dots and dashes; the characters which make up normal morse code.
Creating Control Characters
If control characters are not already available, they can be created. In programming, a special character is needed to distinguish between some text to be shown to the user and the rest of the program: the " character is normally used for this. This can cause problems if the programmer wishes to place his own "s inside the string. The common solution is to use doubling, so the programmer must use "" to represent single quote-mark. This then frees up the single quote-mark to be used as a control character.
An alternative method is used by GuideML and HTML amongst others: in order to free up the < and > symbols for use in tags, when the writer wishes to use them normally, he has to write < and >. This means that the & symbol must also be written differently when used normally as &.
These control characters, however they are obtained, can be used in a variety of ways, but there are really only two good methods: delimitors and tags. Delimitors are less powerful, but simpler, and take up less space. For example, in a shopping list one might write a comma between each item and its successor. In such a list the humble comma would be a control character, and it would be called a comma-delimited list.
Similarly morse code could be said to be pause-delimited. If a little more control is required, it's possible to use a number of delimitors - so in a table one delimitor may be used to signify a new column, and another to signify a new row.
Tags are used in GuideML, RTF, and a number of other formats. The following is the coding for producing text in italics:
The clever bit is that tags can now be placed within one another, for example, a pair of <FOOTNOTE> tags can be placed within a pair of <HEADER> tags. This means that there is no limit on how complicated a structure can be.
XML4 is a metaformat which uses tags; XML can be extended to be any sort of tagged markup language, both HTML, GuideML, and others, with different 'DTD's. Because of this flexibility, it is widely used for all kinds of tasks.
All this structure can be bulky, and it's often desirable to reduce the size of information, so that it takes less space to store and/or less time to send. So codes have been created that compress data. The common thread to all the methods is to spot patterns and eliminate them - a perfectly compressed file should appear to the casual observer to be completely random.
This is the method that is used in morse code. The concept is that because different characters have different frequencies, we should send common characters with fewer bits - or fewer dots and dashes. Mr Morse spent a good deal of time selecting one plausible set of rules for changing letters to bits based on this observation.
Huffman codes use this good work, but extend it by first sending a table of what bits correspond to what letters, and then sending the encoded bits. The table is selected based on the statistics of the message to be sent, so if you use a lot of 'z's in an entry on Buzily Buzzing Beez, then it will still be nice and compact. In order to be adaptable, the table is normally resent about every 500,000 characters.
Move to front buffering is the name for a family of variations of Huffman in which the characters with the shortest codes are those which have been used most recently, or most frequently in the particular message, and this changes dynamically, on a character by character basis. It is slower, but often better.
Arithmetic coding is a clever idea that can be tricky to get one's head around. The idea starts off with some interval between zero and one, and the assignment of sub-intervals within it. So, 0.0 to 0.5 might be the letter 'e', 0.5 to 0.8 might be the letter 'g' and 0.8 to 1.0 might be the number '4'. These subintervals cover the entire character set we're working with, in proportion to how likely those characters are to appear. Once we have looked at the first letter, we adjust the size of the current interval, and repeat the process, continually narrowing the interval. Eventually we end up with a highly precise interval, and this interval encodes the entire message. Then we simply save it. Because of the way the spacing has been chosen, equally likely messages will take equal amounts of interval space, or, to put it another way, will take equal amounts of space to store.
Run length encoding is used in PCX5 files, and is a very simple idea. It is often the case that long runs of the same character occur, so that replacing 'aaaaaa' by 'aa4' will save space. This is a very limited form of compression, but it can be useful.
Lempel Ziv takes advantage of the way that certain combinations of letters are much more common than others. For example, 'the' is far more common than 'eht'. Therefore it uses a vastly extended character set - perhaps using 16 bits for each, rather than 8 bits. These extra characters each stand for a number of 'normal' characters, so a single character could stand for The Hitchhiker's Guide to the Galaxy, if that was a very common phrase in some piece of text.
Burrows Wheeler Block Compression is possibly the most complicated algorithm, but it does seem to work well. The idea is that you generate all possible rotations6, and then sort them into alphabetical order. The message to be transmitted is the last letter of each rotation, in the sorted order. This message is exactly the same size as the original, but for various rather technical reasons, it is a lot easier to compress.
Dealing with Errors
Mistakes and problems are a fact of life. Murphy's Law applies to everything but has an especial affinity for computers and related things. Sometimes we want to know if a mistake has happened. Sometimes we want to know when a problem has happened, sometimes we want to fix problems that have happened, and sometimes we want to make an attempt to stop them occurring in the first place. Naturally, all these jobs require different codes.
The simplest method is to send each piece of information a number of times - the receiver checks each copy and if there are any differences then there has been an error.
A more lightweight approach is called parity - in even parity the number of ones in each part of the message (normally every eight bits) is counted. If the number is even, then the parity bit we send is zero, otherwise the parity bit is one. Odd parity works the opposite way round.
Schemes such as these, where the information is 'in the clear' - are called systematic. Because they can be decoded by simply looking at the message and stripping out some check bits, they are particularly valuable.
Generally, the more efficient such codes are, the larger the blocksize that must be used, and the slower they are to code and decode. Since there is a hard limit on efficient codes, a compromise must be reached somewhere. One exception to this block-based approach is CRC7. Though it is described here in terms of divisions, it can be performed directly in hardware too, and this is faster.
In CRC each message is considered as a polynomial, so 1101 becomes 1x3+1x2+0x+1, and so forth. Then, if we desire 3 check bits, we shift the message by 3 (to become 1101000), and find the remainder when we divide this shifted message by a special polynomial with the same degree8 as the number of check bits. If we choose the special polynomial well, we can detect quite a lot of errors with this. It is especially good for burst9 errors, which occur a lot in network traffic.
The simplest method of correction is like the simplest method of detection - you send each piece of information a number of times, but this time you use some kind of majority voting scheme if the different versions disagree. The most popular version is the one that is accepted as valid. Democracy in action!
Hamming codes are able to correct single-bit errors - so they are rather like an extension to parity. The essence of the code is that for every seven bits, four bits are used to send data, the parity bit is calculated (but not sent) and then three additional bits are sent which are the XOR of the parity bit and either the first, second, and third bit of the data. There are also larger versions of the Hamming code which work for blocks of bitsize 15, 31, 63, 127, and so on, the Golay variant which detects two bit errors, and another, unnamed10, variant which detects three bit errors.
Single step codes attempt to reduce the chance of errors occurring in the first place - one particular type of error in particular. These codes differ only in one single bit if one goes between adjacent numbers. So, in binary (using the Gray code), you might count:
Notice how you only ever change a single zero or one between each number.
The error they solve can be seen when you look at poorly designed digital clocks - although the same problem occurs when measuring any physical dimensions like length or angle. Often, such clocks will display a sequence of times something like:
In both cases the clock is a minute out for a very short period of time - not a problem normally - but if you were taking a camera, and were unlucky, you might have the wrong time captured. Now imagine that photo becomes evidence in a criminal prosecution...
Cryptography is the study of what codes should be used to pass messages from Alice to Bob without a third party (Charlie) being able to find out. There are essentially two types of code which can do this: public key and private key.
For all encryption you need two algorithms - one which takes a key and some plain text and turns it into unreadable ciphertext, and one which takes a key and some ciphertext and turns it into plain text. In symmetric key cryptography, you must use the same key at both encryption and decryption (hence its other name - single key encryption). In public key cryptography keys come in pairs, and a message encrypted with one key of the pair must be decrypted with the other key of the pair. Generally, one of the keys will be published, and anyone can access it, while the other will remain private.
There are a number of methods that can be used to do symmetric key encryption, and each have advantages and disadvantages. There are some common problems - because both sender and receiver have to know the same key. This naturally is tricky - the reason we need to encrypt is because we can't securely transmit anything, so transmitting a key is going to run into the same problems.
In practice some sort of secure channel is set up solely to transmit the first key11. The first key is then used to exchange a second, larger, key, and the larger key is used to hold the actual conversation.
The simplest type of algorithm is one that simply converts each letter to a different letter, so A might go to F, B to Z, and C to A. There are 26 x 25 x 24 x... possible ways to do this, each of which corresponds to one key. This means there are about 4 x 1026 possible keys - that's 42 bits. If more bits are required, the same could be done, but with a map between pairs of letters, so AA maps to CD, while AB maps to FE, etc. This extended version would have a 2587 bit key.
This might seems more than enough to be utterly secure, but there are weaknesses. The code-breaker can use frequency analysis to see which patterns are the most common, and work from that. So, in English, the letter E is the most common, so if the letter Z is the most common in the coded text, then chances are that the key includes E->Z.
One Time Pad
One 'ideal' single key algorithm, in the sense that it can never be cracked, is the 'one time pad'. The idea is that both sender and receiver have a very long list of numbers, randomly distributed between 0 and 25. Of course, both these lists must be identical, or it won't work - the lists are effectively the keys. The method is called a one time pad because it is only secure if each long list once is only used once. As soon as a list is used more then once, the likelihood of the code being cracked, using frequency analysis or suchlike, increases.
Then, each letter is operated by replacing it with the letter that is a certain number further on in the alphabet, so 'a' and '4' would turn into 'e', while 'z' and '2' would turn into 'b'. But where do the numbers come from? From the very long list of numbers, of course - the first letter is encrypted with the first number, the second letter, with the second number, and so on. This means it is necessary to have a key that is as long as the message we want to send.
There are many types of encryption methods which attempt to simulate the one time pad, but without requiring the huge list. Typically such methods will generate the random numbers on the fly, and once set up with the correct initial seed will churn out the same sequence wherever they are run.
Another clever variation is a code based on both sender and receiver having access to the same book (or some other long list of words, but preferably not one with a lot of structure, like a dictionary or thesaurus). The words to be transmitted are then coded up something like 42, 8, 3 for page 42, line eight, third word along. That word is then crossed out of both the sender and receiver's copy of the book, and that particular word can never be used again.
Public Key Cryptography
Public Key systems veer near to the realm of magic - they work, and there are very good mathematical reasons why they work, but you can't understand these things without having a beard and wearing sandals. The method in most widespread use is PGP.
PGP, and other similar systems, generate two symmetrical systems. The first can decrypt messages encrypted by the second, and the second can decrypt messages encrypted by the first. And you can't work out what the other key is based on only one12. To use the scheme you keep the first absolutely secret, and let nobody touch it, while the second you upload all over the place, place on business cards, and so forth.
If somebody wants to send you a message, they encrypt with the second (public) key, and you can then decrypt it with the first (private) key. If you want to send a message and make sure that nobody can tamper with it, you encrypt it with the private key, and the receiver decrypts it with the public key. It would seem that having the public key, and a variety of messages in plain text and encrypted by the private key, would allow the code to be easily breakable, but this seems not to be the case. Though a lot of people have tried...