Contents
## Caesar Cipher Breaking with Least Squares |

The Caesar Shift Cipher encrypts the text message in such a way that each letter is substituted by another, next in the alphabet.

For example, if the alphabet shift value, or **key** is `2`

then `A`

is substituted with `C`

, `E`

with `G`

etc. So the
word `CROW`

will be encrypted as `ETQY`

.

Though simple, this method could be easily extended to create more complicated systems. For example
Vigenere cipher is really just a set of several
**Caesar's ciphers** with different keys - so that with set of `4`

ciphers the first one is applied, for example to
`1-st, 5-th, 9-th, ...`

letters of the original message, another cipher is applied to `2-nd, 6-th, 10-th, ...`

letters,
third is applied to `3-rd, 7-th...`

and the last is applied to `4-th, 8-th...`

.

This article discusses breaking of such encryption system. Thought cipher itself is not practical nowadays due to its simplicity, understanding of its automated decryption provides several ideas important in wide range of code-breaking tasks and other problems.

Suppose we intercept an encrypted message, like the following:

```
IBBIKS IB BPM LIEV EQBP IUJCAP QV BPM KIVGWV
```

But of course no one told us the **key** which should be used for deciphering.

Surely, since only `25`

values for key are possible (from `1`

to `25`

), we can try each of them and check whether
the result looks like proper phrase.

However, it becomes too complicated if we need to process more messages. For example, if each sentence of the message is encrypted with different key, or with Vigenere cipher described above - the task will require too much efforts.

The other problem is that we are going to check result manually to see whether it is normal text or not. Of course using computers we need some more clever idea.

If you read **The Gold-Bug** short story by Edgar Allan Poe, you remember the "single-substitution" cipher was broken
starting with the fact that `E`

is the most frequent letter in English. Searching for the most frequent symbol in the
code gives the hint about which one was used for `E`

.

However, naive implementation of this approach encounters the significant trouble, especially with short messages - letter frequency distribution may not be idea. For example encrypted message above was created with phrase:

```
ATTACK AT THE DAWN WITH AMBUSH IN THE CANYON
```

Where we have six `A`

-s, and six `T`

-s, four `H`

-s and only two `E`

-s. Obviously trying different letters to be `E`

is
not better than simply trying all `25`

keys one by one.

Luckily, letter-frequency based approach could be significantly improved for Caesar's Cipher, since we can use not only most frequent letters in our calculations.

**How it works?**

Suppose we write down an array of average letter frequencies in English text (in percents):

```
# average letter frequencies in English
[8.1, 1.5, 2.8, 4.3, 13.0, 2.2, 2.0, 6.1, 7.0, 0.15, 0.77, 7.0, 2.4,
6.8, 7.5, 1.9, 0.095, 6.0, 6.3, 9.1, 2.8, 0.98, 2.4, 0.15, 2.0, 0.074]
```

Here the first value is frequency of letter `A`

, the fifth one - of the letter `E`

and so on.

Of course in a given message (plain-text, not encrypted) letter distribution will not be exactly the same, but it will
anyway be similar. For example let us calculate frequencies in the phrase above (`Attack at dawn...`

):

```
# letter frequencies in real message
[16.7, 2.8, 5.6, 2.8, 5.6, 0, 0, 11.1, 5.6, 0, 2.8, 0, 2.8,
8.3, 2.8, 0, 0, 0, 2.8, 16.7, 2.8, 0, 2.8, 0, 2.8, 0]
```

Wee see, that for some letters difference is significant - about `8%`

for `A`

and `E`

, however, for most of letters
discrepancy is not that large. Let us subtract **ideal** frequencies from **real** ones to see this difference:

```
# difference between real and ideal frequencies:
[8.6, 1.3, 2.8, -1.5, -7.4, -2.2, -2.0, 5.0, -1.4, -0.15, 2.0, -7.0, 0.4, 1.5,
-4.7, -1.9, -0.095, -6.0, -3.5, 7.6, 0.0, -0.98, 0.4, -0.15, 0.8, -0.074]
```

Wee see that for many letters difference is small enough (by absolute value):

`0 - 1 %`

for`9`

letters;`1 - 2 %`

for`7`

letters;`2 - 5 %`

for`5`

letters.

I.e. for `21`

letters of `26`

the difference is less or equal to `5%`

.

Now, what happens if we encode some message with Caesar's Cipher and calculate letter frequencies?

Obviously, the array of frequencies will be alike one mentioned above, but shifted by several positions (according to
cipher **key**). For example, taking encrypted phrase above (`Ibbiks ib bpm...`

) we get the following distribution:

```
# letters distribution in the encrypted message
[2.8, 16.7, 2.8, 0, 2.8, 0, 2.8, 0, 16.7, 2.8, 5.6, 2.8, 5.6,
0, 0, 11.1, 5.6, 0, 2.8, 0, 2.8, 8.3, 2.8, 0, 0, 0]
```

You see, now most frequent letters are `B`

, `I`

, `P`

etc... What useful info can we get out of this distribution?

Let us calculate difference with ideal frequency distribution, like we did it before for plain-text message:

```
# difference of letter frequencies between encrypted message and ideal distribution
[-5.3, 15.2, 0.0, -4.3, -10.2, -2.2, 0.8, -6.1, 9.7, 2.7, 4.8, -4.2, 3.2,
-6.8, -7.5, 9.2, 5.5, -6.0, -3.5, -9.1, 0.0, 7.32, 0.4, -0.15, -2.0, -0.074]
```

Look, there are very big discrepancies (up to `15%`

). And if we summarize we get the following:

`0 - 1 %`

for`6`

letters;`1 - 2 %`

for`1`

letters;`2 - 5 %`

for`7`

letters.

And now in total only `14`

letters hit the difference of `5%`

or lower.

**Core idea**

Now we can invent the straightforward algorithm:

- Calculate array of letter frequencies for encrypted message
- Check the
**difference**with ideal letter frequency distribution **Rotate**the array by`1`

position (remove value from end and add to beginning)- Repeat from
`2-nd`

step for`25`

times - Of all
**rotations**of an array choose one for which**difference**is minimal - The amount of steps by which original array was rotated is, really, the value of the cipher
**key**

The only question is - how to calculate the **difference** between arrays?

We can use very popular and simple method, similar to root mean square:

- calculate differences for each value of array (as we done before);
- raise each value to power
`2`

; - get average of all resulting values.

For example, calculating such metric for letter distribution in plain-text message above we get `365`

, while with
the encrypted message above we get much higher value of `982`

. And if we will rotate the array of frequencies for
the encrypted message we'll find that minimal value of `365`

is achieved after `8`

steps, which means that encryption
was performed with shifting each letter by `8`

positions.

Raising to power `2`

on second step has two goals:

- it allows to treat negative and positive differences in the same way;
- larger discrepances add extremely big values to resulting difference (i.e. discrepancy of
`15`

for letter`B`

will add`225`

points to resulting difference), so larger errors are "fined severely".

Obviously the discussed algorithm has `O(N)`

time complexity - the time is mosty spent on calculating frequencies,
while analysis time does not depend on message size.

If you want to have a practice on this algorithm, see the task Caesar Cipher Cracker there you will find that the method allows to decipher even very short messages.

Of course this approach could be extended to work with **Vigenere Cipher** and also could be used to check the type
of encryption.