The Caesar Cipher
One of the simplest examples of a substitution cipher is the Caesar cipher, which is said to have been used by Julius Caesar to communicate with his army. Caesar is considered to be one of the first persons to have ever employed encryption for the sake of securing messages. Caesar decided that shifting each letter in the message would be his standard algorithm, and so he informed all of his generals of his decision, and was then able to send them secured messages. Using the Caesar Shift (3 to the right), the message,
would be encrypted as,
In this example, 'R' is shifted to 'U', 'E' is shifted to 'H', and so on. Now, even if the enemy did intercept the message, it would be useless, since only Caesar's generals could read it.
Thus, the Caesar cipher is a shift cipher since the ciphertext alphabet is derived from the plaintext alphabet by shifting each letter a certain number of spaces. For example, if we use a shift of 19, then we get the following pair of ciphertext and plaintext alphabets:
Plaintext: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Ciphertext: T U V W X Y Z A B C D E F G H I J K L M N O P Q R S
To encipher a message, we perform a simple substitution by looking up each of the message's letters in the top row and writing down the corresponding letter from the bottom row. For example, the message
THE FAULT, DEAR BRUTUS, LIES NOT IN OUR STARS BUT IN OURSELVES.
would be enciphered as
MAX YTNEM, WXTK UKNMNL, EBXL GHM BG HNK LMTKL UNM BG HNKLXEOXL.
Essentially, each letter of the alphabet has been shifted nineteen places ahead in the alphabet, wrapping around the end if necessary. Notice that punctuation and blanks are not enciphered but are copied over as themselves.
Breaking a Caesar Cipher by Hand
Can a computer guess what shift was used in creating a Caesar cipher? The answer, of course, is yes. But how does it work?
The unknown shift is one of 26 possible shifts. One technique might be to try each of the 26 possible shifts and check which of these resulted in readable English text. But this approach has limitations. For one thing how would the computer recognize "readable English text?" For another, what if a muiltiple Caesar shift was used, as is the case for a Vigenere cipher , where each letter of the keyword provides the basis for a Caesar shift. That is, if the key word is bam, then every third letter of the plaintext starting at the first would be shifted by 'b' (=1) and every third letter beginning at the second would be shifted by 'a' (=0) and every third letter beginning at the third would be shifted by 'm' (=12). Obviously we can't depend on obtaining readable English text here.
A better approach makes use of statistical data about English letter frequencies. It is known that in a text of 1000 letters of various English alphabet occur with about the following relative frequencies:
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z |
73 | 9 | 30 | 44 | 130 | 28 | 16 | 35 | 74 | 2 | 3 | 35 | 25 | 78 | 74 | 27 | 3 | 77 | 63 | 93 | 27 | 13 | 16 | 5 | 19 | 1 |
K DKVO DYVN LI KX SNSYD, PEVV YP CYEXN KXN PEBI, CSQXSPISXQ XYDRSXQ.
We can tally the frequencies of the letters in this enciphered message, thus
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z |
0 | 1 | 2 | 4 | 3 | 0 | 0 | 0 | 3 | 0 | 4 | 1 | 0 | 4 | 1 | 4 | 3 | 1 | 6 | 0 | 0 | 4 | 0 | 7 | 4 | 0 |
Now we can now shift the two tallies so that the large and small frequencies from each frequency distribution match up roughly. For example, if we try a shift of ten on the previous example, we get the following correspondence between English language frequencies and the letter frequencies in the message.
English Language Frequencies
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z |
73 | 9 | 30 | 44 | 130 | 28 | 16 | 35 | 74 | 2 | 3 | 35 | 25 | 78 | 74 | 27 | 3 | 77 | 63 | 93 | 27 | 13 | 16 | 5 | 19 | 1 |
Enciphered Message Frequencies
K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | A | B | C | D | E | F | G | H | I | J |
4 | 1 | 0 | 4 | 1 | 4 | 3 | 1 | 6 | 0 | 0 | 4 | 0 | 7 | 4 | 0 | 0 | 1 | 2 | 4 | 3 | 0 | 0 | 0 | 3 | 0 |
Note that in this case the large frequencies for cipher X and Y correspond to large for English N and O, the bare spots for cipher T and U correspond to bare spots for English J and K. Also, an isolated large frequency for cipher S correpsonds to a similar one for English I. In view of this evidence we needn't even worry too much about the drastic mismatch for English E, which is usually the most frequent letter in a random sample of English text.
If we now apply this substitution to the message we get:
A TALE TOLD BY AN IDIOT, FULL OF SOUND AND FURY, SIGNIFIYING NOTHING.
Using the Chi-square statistic
The chi-square statistic allows compare how closely a shift of the English frequency distribution matches the frequency distribution of the secret message. Here's an algorithm for computing the chi-square statistic:- Let ef(c) stand for the english frequency of some letter of the alphabet
- Let mf(c) stand for the frequency of some letter of the message
- For each possible shift s between 0 and 25:
- For each letter c of the alphabet
- Compute the sum of squares of mf((c + s) mod 26) divided by ef(c)