Caesar Cipher Breaking with Least Squares
The Caesar Shift Cipher encrypts the text message in such a way that each letter is substituted by another, next in the alphabet.
For example, if the alphabet shift value, or key is
A is substituted with
G etc. So the
CROW will be encrypted as
Though simple, this method could be easily extended to create more complicated systems. For example
Vigenere cipher is really just a set of several
Caesar's ciphers with different keys - so that with set of
4 ciphers the first one is applied, for example to
1-st, 5-th, 9-th, ... letters of the original message, another cipher is applied to
2-nd, 6-th, 10-th, ... letters,
third is applied to
3-rd, 7-th... and the last is applied to
This article discusses breaking of such encryption system. Thought cipher itself is not practical nowadays due to its simplicity, understanding of its automated decryption provides several ideas important in wide range of code-breaking tasks and other problems.
Suppose we intercept an encrypted message, like the following:
IBBIKS IB BPM LIEV EQBP IUJCAP QV BPM KIVGWV
But of course no one told us the key which should be used for deciphering.
Surely, since only
25 values for key are possible (from
25), we can try each of them and check whether
the result looks like proper phrase.
However, it becomes too complicated if we need to process more messages. For example, if each sentence of the message is encrypted with different key, or with Vigenere cipher described above - the task will require too much efforts.
The other problem is that we are going to check result manually to see whether it is normal text or not. Of course using computers we need some more clever idea.
If you read The Gold-Bug short story by Edgar Allan Poe, you remember the "single-substitution" cipher was broken
starting with the fact that
E is the most frequent letter in English. Searching for the most frequent symbol in the
code gives the hint about which one was used for
However, naive implementation of this approach encounters the significant trouble, especially with short messages - letter frequency distribution may not be idea. For example encrypted message above was created with phrase:
ATTACK AT THE DAWN WITH AMBUSH IN THE CANYON
Where we have six
A-s, and six
H-s and only two
E-s. Obviously trying different letters to be
not better than simply trying all
25 keys one by one.
Luckily, letter-frequency based approach could be significantly improved for Caesar's Cipher, since we can use not only most frequent letters in our calculations.
How it works?
Suppose we write down an array of average letter frequencies in English text (in percents):
# average letter frequencies in English [8.1, 1.5, 2.8, 4.3, 13.0, 2.2, 2.0, 6.1, 7.0, 0.15, 0.77, 7.0, 2.4, 6.8, 7.5, 1.9, 0.095, 6.0, 6.3, 9.1, 2.8, 0.98, 2.4, 0.15, 2.0, 0.074]
Here the first value is frequency of letter
A, the fifth one - of the letter
E and so on.
Of course in a given message (plain-text, not encrypted) letter distribution will not be exactly the same, but it will
anyway be similar. For example let us calculate frequencies in the phrase above (
Attack at dawn...):
# letter frequencies in real message [16.7, 2.8, 5.6, 2.8, 5.6, 0, 0, 11.1, 5.6, 0, 2.8, 0, 2.8, 8.3, 2.8, 0, 0, 0, 2.8, 16.7, 2.8, 0, 2.8, 0, 2.8, 0]
Wee see, that for some letters difference is significant - about
E, however, for most of letters
discrepancy is not that large. Let us subtract ideal frequencies from real ones to see this difference:
# difference between real and ideal frequencies: [8.6, 1.3, 2.8, -1.5, -7.4, -2.2, -2.0, 5.0, -1.4, -0.15, 2.0, -7.0, 0.4, 1.5, -4.7, -1.9, -0.095, -6.0, -3.5, 7.6, 0.0, -0.98, 0.4, -0.15, 0.8, -0.074]
Wee see that for many letters difference is small enough (by absolute value):
0 - 1 %for
1 - 2 %for
2 - 5 %for
21 letters of
26 the difference is less or equal to
Now, what happens if we encode some message with Caesar's Cipher and calculate letter frequencies?
Obviously, the array of frequencies will be alike one mentioned above, but shifted by several positions (according to
cipher key). For example, taking encrypted phrase above (
Ibbiks ib bpm...) we get the following distribution:
# letters distribution in the encrypted message [2.8, 16.7, 2.8, 0, 2.8, 0, 2.8, 0, 16.7, 2.8, 5.6, 2.8, 5.6, 0, 0, 11.1, 5.6, 0, 2.8, 0, 2.8, 8.3, 2.8, 0, 0, 0]
You see, now most frequent letters are
P etc... What useful info can we get out of this distribution?
Let us calculate difference with ideal frequency distribution, like we did it before for plain-text message:
# difference of letter frequencies between encrypted message and ideal distribution [-5.3, 15.2, 0.0, -4.3, -10.2, -2.2, 0.8, -6.1, 9.7, 2.7, 4.8, -4.2, 3.2, -6.8, -7.5, 9.2, 5.5, -6.0, -3.5, -9.1, 0.0, 7.32, 0.4, -0.15, -2.0, -0.074]
Look, there are very big discrepancies (up to
15%). And if we summarize we get the following:
0 - 1 %for
1 - 2 %for
2 - 5 %for
And now in total only
14 letters hit the difference of
5% or lower.
Now we can invent the straightforward algorithm:
1position (remove value from end and add to beginning)
The only question is - how to calculate the difference between arrays?
We can use very popular and simple method, similar to root mean square:
For example, calculating such metric for letter distribution in plain-text message above we get
365, while with
the encrypted message above we get much higher value of
982. And if we will rotate the array of frequencies for
the encrypted message we'll find that minimal value of
365 is achieved after
8 steps, which means that encryption
was performed with shifting each letter by
Raising to power
2 on second step has two goals:
225points to resulting difference), so larger errors are "fined severely".
Obviously the discussed algorithm has
O(N) time complexity - the time is mosty spent on calculating frequencies,
while analysis time does not depend on message size.
If you want to have a practice on this algorithm, see the task Caesar Cipher Cracker there you will find that the method allows to decipher even very short messages.
Of course this approach could be extended to work with Vigenere Cipher and also could be used to check the type of encryption.