You see, the truth is I am not a cryptography geek at all. But it is happening that from time to time some question suddenly arises right in front of my eyes. And I just literally can't stop thinking until I would find an answer. You know this feeling, aren't you? It happened that one of things which recently did not give me a chance to sleep was closely related to cryptography.
I was quite curious about the data encryption algorithms. I asked myself: ok, text or data are basically complex patterns with some "regularities". So what about the data after encryption? Maybe there are some regularities too? If yes - what are they and how they may be discovered? Ciphered data we are often dealing with - are just bunch of letters and/or numbers. I don't know about you, but A4 page full with strange characters is not specially informative for me. ;) I would be good to find a way to make data well... more "visual". Did I say visual... So maybe we may use colours to emphasize some pieces of information? Hold on, what if we may play with images as data patterns and see what may happen with them after being encrypted? Human eye is a great tool for detecting some regularities. Maybe they can be found somewhere there? In addition we may also use some statistical tools to look at data from another angle. Couple of evenings and voila - I made a little research in this area and would like to share the results and my way of thinking with you. So if you don't afraid find yourself reading long text - come with me.
- Check if images may act as easy-recognizable data patterns to test the strengths of encryption algorithms: by "visual analysis" and also from statistical analysis's point of view.
- Assure the right understanding of what "well encrypted" data really is.
- Compare different encryption algorithms (polyalphabetic substitution, XOR, RC5, Serpent).
Introduction. Polyalphabetic cipher - what the hell is it?
As we will be dealing with the polyalphabetic substitution cipher - I would like to make a short review what is it and how it works. (FYI: for your convenience in all further illustrations you'd see - every unique character in a table has a unique colour.)
So in its simplest form the polyalphabetic substitution cipher rule looks like this:
Here the alphabet contains only letters [A..H] with 7 dictionaries (just to make it easier to understand). Btw, this thing is also well known as Vigenere's Square. The version of the Vigenere's cipher for the full alphabet may looks like this:
But now let's come back to our first tiny [A..H] cipher and define either a repetitive or running key and try to encrypt our data. :) See the example with the repetitive key: (1234) which is used to pick up appropriate alphabet for substitutions.
Source data: A A A B B B C C C D D D E E E F F F Repetitive key: 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 Encrypted data: A B C E B C E F C E F G E F G A F G
You see how it works, right? Decryption is basically a reversing of the operation described above using the same key (that's why it's called "symmetric").
Now we may do the same but with the runnig key, which is much better. Why? Because it's a sequence of random numbers! Promises promises..., but where to get those numbers? Well you can roll a dice ;) but we are dealing with computers, so we may use a pseudorandom generator algorithms available with virtually every programming language. The trick is to set exactly the same initial random seed for a generator, so the sequence of numbers (which looks pretty random) will be exactly the same for encryption and decryption. Every source letter then would have a corresponding random number which will be used for selecting an appropriate alphabet. So a random seed number is the key. Ok, I am writing sample function which will generate a sequence I need. See the following pseudocode:
randomize(12345); //--- this is my key used to initialize generator for i=0 to 100 do begin print rand(1, 7); //--- generate random number from 1 to 7 end;
This is the sequence I may get:
Now let's encrypt the message with this technique:
Source data: A A A B B B C C C D D D E E E F F F Running key: 1 5 4 4 2 7 4 4 7 1 6 3 6 2 4 4 4 1 Encrypted data: A E D E C H F F A D A F B F H A A F
Looks nice isn't it! The encrypted string looks very different from the original one, so this would not be an easy peanut to crack, oh no. But hold your seatbelts my friend - now we may complicate life for cryptanalyst even more! Let us use slightly modified version of the encryption table:
See the difference? Previously every row (a dictionary) was kind of "shifted" one cell left and a first character was "wrapped" to the end. Here every row still contains all characters we need [A..H] but their positions in the alphabet are random. So we may combine this technique with the one described above (running key) which gives us really strong cipher which basically would require two things for the encryption/decryption: 1) the RND generator's initial seed and the 2) dictionaries table. (The table also may be generated by some algorithm, of course). So let's encrypt our simple message with the running key and modified substitution table:
Source data: A A A B B B C C C D D D E E E F F F Running key: 1 5 4 4 2 7 4 4 7 1 6 3 6 2 4 4 4 1 Encrypted data: B A C H D G D D D G E G C G F A A D
Try to crack this one, dude. ;) Just FYI: this is how the encryption table may looks like for the full A...Z charset and sample 22 different alphabets:
"The owls are not what they seem"* Encrypting images.
Every raster image can be treated as the sequence of bytes (or data stream) where every pixel is represented by three components: R, G, B. Every component has a value 0 - 255. There is no need to go deep into image data formats, we would use this one (24 bit RGB bitmap) as a simplest to understand and easiest to handle programmatically. See the illustration below:
If the data in selected pixels will be read as a sequence - this is what we would have for the first row of pixels:
R: 175 R: 144 R: 157 R: 167 R: 169 G: 137 G: 96 G: 108 G: 118 G: 112 B: 3 B: 3 B: 8 B: 6 B: 15
Sample data stream:
...175, 137, 3, 144, 96, 3, 157, 108, 3, 167, 118, 6, 169, 112, 15, ...
Each value is actually a one byte. Ok, so now we may use same approach as described in the previous chapter to encode/decode images. The only difference now is that the "alphabet" we are dealing with not [A...Z] - 26 values, but [0..255] - 256 values. The rest remains exactly the same. So let's define sample table which will be used for a binary data encryption (in this case - our images):
The length of every dictionary is fixed and equal to 256 elements. We assume that every "alphabet" (row) in this case is an array of bytes with values [0..255] placed in random order. The method of accessing appropriate value is also the same. See the example: how to get the substitution number for the source = 4, where the dictionary ID = 6.
We may manipulate numbers of such dictionaries in the table (we may use the only one substitution dictionary or thousands of them) and see how it helps us to encrypt our data. Cool. Now we have to answer one more question before starting our experiments. And the question is:...
What does it mean: "proper encryption"?
Ok, this is a source image:
Sample image 1
...and this is a kind of "encrypted" version. Here every byte is XOR'ed (or inverted). It is not the same as the original image, but can you recognize it?
Sample image 2
Of course you do! To say "it is encrypted" you would expect to have a result of our encryption function similar to random noise. So every RGB component of each pixel is a random number in between 0 and 255.
Sample image 3
Of course this, what may looks like random, can actually contains the hidden data. And this is exactly what we want.
So what is the difference image 3 and image 2? Obviously you see something on the first one and see nothing on the second. But let's try to count numbers of the same bytes in the image 2 and image 3 and build a frequency distribution curves. Yes, this is about Gaussian curve and I am telling you, don't be afraid, we will draw it right now:
The green one - is the distribution curve of random noise. The red one - is our image. Feel the difference? ;) Let's take a look at another example:
The image is different and the distribution (red line) is changed correspondingly. Now let's see something very simple:
Here the distribution is completely screwed, does not look as bell curve at all (note the red line again).
So now we know that:
- Statistical analysis may give us some interesting information about the image.
- The closer encrypted data to random noise (visually and statistically) - the better (the less chance some hidden patterns may be found there).
And now, ladies and gentleman...
The small program was written, which naturally may be downloaded from [here] to test everything described above (you can get the binary and the source code in Delphi). User interface is rather self-descriptive, so let's start experimenting.
First image and one substitution dictionary is used.
Each time you click [encrypt] button - the program does the following:
- New random seed number is generated.
- Based on this number - the substitution tables (dicitionaries) are generated.
- Based on this number - the running key is generated.
- Program takes selected original image, encrypts it using the selected algorithm and its parameters and renders the result.
- Program calculates statistics and draws the distribution curve for the encrypted image. Program also draws reference curves for a "virtual image" of the same size 300x200 pixels: 1) green one - for pseudo-random numbers, generated by a function 2) blue one - curve for true random numbers (I borrowed some data from here).
- Program also takes encrypted image and decrypts it, aiming to test if everything works correctly.
The function used for the reference curve is a combination of a Fibonacci sequence (with lags of 97 and 33, and operation "subtraction plus one, modulo one") and an "arithmetic sequence" (using subtraction).
Ok, do you see the red curve? Looks familiar to the one you've seen before isn't it? Do you understand why? We are using the one substitution table only (one dictionary), so for every source byte there is a one possible substituted byte. So bytes are different, but the frequencies remains the same! Now look at the encrypted image - you still see something there, especially if you would press [Encrypt] button couple of times (or press [AutoRun]). So definitely the encryption method is far from ideal.
Now move the slider [Nr of dictionaries] to the right and see what happened.
The more dictionaries is in use - the more variations are for each of the source bytes. Now leave the program configured for 100 substitution tables and switch to another image:
Interesting, isn't it. The image is very simple, so even by having 100 different substitution tables (!!!) you can still recognize source pattern in encrypted data (press [Encrypt] button couple of times). Now let's try to increase the number of dictionaries.
Notice how the red curve becomes more and more close to the reference one. But even having 1000 dictionaries you can still find the combination where you can see a shadow of something in encrypted data. Keep attention to the right part of the encrypted image - it's slightly more dark then the left one.
There is another interesting way to catch "regularities" in encrypted image. You can play with the width and height of the image, changing it correspondingly. Because the data stream you have - may not contain the information about the width and height of the image hidden in (just a pure stream of bytes).
Once again - a human eye is our best detector. Move the slider to 100 dictionaries and select image 3. Then uncheck the checkbox [Decrypted], so the encrypted image will be copied here directly.
Move the cursor to the right side of the bottom image, then click and drag it to the right. You would notice that the image width is increasing and height is decreasing automatically. In effect the same number of bytes is rendered in the rectangle of different (continuously changing) proportions now. You immediately notice the diagonal lines - similar to the old broken TV with a wrong synchronization. You see - there is kind of pattern there! Play a little with altering image size. By changing width/height relation you may surprisingly quickly find the right combination.
Ok, now let's try to play the same game with one of the industry-standard algorithms, like RC5. Now you might see what the really good encryption is.
See the red curve now? Almost ideally overlaps the referenced ones. And don't expect to find any patterns in encrypted data this time (well, you may try of course...).
Ok, this is it. You may play with the program by yourself or modify the source code and add some new cool features. Hope it helped you a little bit better understand some ideas about data encryption combined with image processing. Don't forget to drop me a line in case you like the text, or even (or especially) if you think it's lame. Anyway - enjoy! :)
Ok, here is a slightly more detailed review of the steps I made while solving the Cyber Security Challenge UK. I believe the post is not too long so there is a chance you would have some fun anyway. :) Ok, here we go.
It was found that the challenge set comprised of 3 stages and the complexity level of every next stage was gradually increasing.
This one was rather simple. The starting page contained the code which looked like this:
Here is the [full code].
It was quite clear that this is the base64-encoded "something". :) Quick review of the first couple of decoded bytes clearly suggested that it seem to be an image. Even more: a JPEG!
A small tool for conversion had to be rapidly crafted, and in a couple of minutes I've got such nice picture:
Looks quite cool and funny, and at the first moment I thought the game is over, but ach..., obviously I was very wrong...
It was not so difficult to notice that the border of the image looks kind of strange and familiar at the same time. There was nothing to consider and the first thought was right: yes, this is kind of a binary code.
Immediately, there was written another small script to convert values of pixels into their binary representation (pixel_value = (r+g+b) div 3. If pixel_value < 127 the output = 0 otherwise output = 1). Here is a converted sample fragment of the data from the top row:
See the [full code].
Now the next logical question is: what kind of data encoding technique is used? Let's check the size of the image: it's 350 x 175 pixels. Interesting... Strangely I first noticed that the width and height nicely divides by 5:
350 /5 = 70
175/5 = 35
By the way, do you know any weird code which operates on 5-bit values? Of course you do: it's Baudot code. I wrote the small script which converted 5-bit binary packets to the target letters according to this scheme:
I was really disappointed after being testing this hypothesis. The time is spent and there was no a shadow of any sensible text. :( Why I stuck here? Are there any other encodings? Could it be a 8-bit code? LoL, this looks interesting! Now we have to find out the correct direction while reading the data. Experimentally it was found that the right direction was: clock-wise, starting from the top-left corner (I discovered that the same sequence of characters was repeated twice).
This is what was found:
cyrnfr sbyybj guvf yvax uggcf://plorefrphevglpunyyratr.bet.hx/834wgc.ugzy
And this uggcf:// was somehow veeeeeery familiar to me. ;) Ok, if this uggcf:// becomes https:// and if I am following right direction - the cipher would looks like this:
Do we know this cipher? Of course we do: this is so-called ROT13 substitution cipher.
So after decoding our secret message revealed the following:
please follow this link https://cybersecuritychallenge.org.uk/834jtp.html
Cool. So this is not the end of the game yet...
After navigating to the provided URL, I found another encrypted text:
Here is the [full code].
1000 characters of [0..F] - well, it looks like 500 hex-encoded bytes to me. And obviously it is not a plain text. So let's test the following hypothesis that maybe we are dealing with:
- text which is written in English (quite obvious)
- text may have spaces as word dividers (why not?)
- some kind of monoalphabetic substitution cipher (let's start from something simple)
- full charset is used (all characters from 0 to 255)
To play with ciphers of such kind I wrote a small application called: Solver, which literally allows to "solve" or discover substitution cipher rules much quicker then with pen and paper. You can do it pretty much visually. See this:
Input: a plain text file with the list of decimal values of all bytes. Each value is in a new row.
104 237 205 236 78 44 142 ...
Once data loaded to the program, initially this is what we may see:
Left table (Source data): the source data itself. Each source byte has its corresponding value (currently empty) and unique colour code (based on the value). Colours are helping to locate similar letters quicker and also faster discover (potentially existing) hidden patterns.
Middle table (Statistics): characters frequency table. Calculation are based on the source table. We can see that, e.g. most frequent byte is . We can edit cells values in "Value" column and changes automatically propagated everywhere.
Result /text field/: This is where our output text would appear. Now seems to be clean.
Selection /text filed/: Here we may see more information about a characters which are selected (highlited) in the "Result" table. Selection is made with mouse - click and drag.
Ok, let's assume that the most frequently used character is a space. We also remember that the text is written in English, so characters frequency should match some well-known standards [Classical Cryptography Course, by Lanaki, September 27, 1995]. Let's play with it. Double-click the cell in the "Value" column in the row, right to the byte , press space button and <ENTER>. You will see nothing in this cell, but you entered space character and it is still there. You will see that the row is highlighted now (let's call it: "solved"), and the corresponding rows of all instances of byte  in the "Source data" table are also immediately highlighted. Spaces are also automatically inserted into a corresponding places in "Result" text field. Looks much more like a text now.
You've got the idea how it works, right? Now we may enter some values for the other characters, and the changes will be immediately reflected in "Source" table and in "Result" field. The methodology you would use for picking up the right characters is solely up to you (dependent how advanced you are in cryptography and/or how lazy).
Anothe nice feature: it is also possible to highlight a fragment of the text in "Result" filed. Then you may see the containing characters and missing values in the "Selection" field. See this example:
The word "__rson" is selected. It is clear that this rather would be a "person". In the "Selection" field we see the corresponding characters in our "Source" table:
14(0E)=? 172(AC)=? 78(4E)=r 110(6E)=s 237(ED)=o 205(CD)=n
Characters with question marks are not solved yet. So now we know that
 = p
 = e
Let's enter those chars in the "Statistics" table and see what would happen.
This way you may play with the tool again and again until you would find the right combination for as many characters as you can. :)
When you have collected enough data and want to take a broader look (yea yea, a "big picture") - you may export the tables in CSV (tab separated) file and open it for ex. in Excel (see sample file). Now you can build the reference table between the real char code and the encrypted one. This is important step, because we may check: are dealing with cipher or code? And it's a good time to look for some rules and hidden patterns (also correct some inevitable mistakes you made during your manual processing). Look at this example (fragment):
104 C 67 -37 237 o 111 -126 205 n 110 -95 236 g 103 -133 78 r 114 36 44 a 97 53 142 t 116 -26 174 u 117 -57 141 l 108 -33 44 a 97 53 142 t 116 -26 45 i 105 60 237 o 111 -126 205 n 110 -95 ...
Take a look at the first char: "C".
Encrypted char (104) - char (C) - real ASCII value (67) - difference (-37).
Now sort the entire table for all chars we discovered by the third column (ascendenting) and build the graph for the "difference" column (we are still in Excell).
There must be some kind of rule in it, you can see it now!
And this the time when we may try to play with bits. Look at this:
Character "C" encrypted: 104 -> 01101000 Character "C" decrypted: 67 -> 01000011 Character "o" encrypted: 237 -> 11101101 Character "o" decrypted: 111 -> 01101111
So the rule actually quite simple. Once it is known, we may write another small script which will shift bits for every source character. So now we have a final message revealed:
Congratulations – you’ve found and completed the REAL challenge. Your win code is cyb3r=s3cur1ty*ch@ll3nge+26-07-2010.
Please email this code to our team at email@example.com. If you’re the first person to do so, and can prove you meet the eligibility criteria (British citizen currently resident in the UK) we will be in touch to advise how to claim your prize. Well done and good luck in the Cyber Security Challenge competitions taking place throughout the rest of the year.
It was a great fun for me indeed, and well - I am looking for more new challenges now! :)