A Network World article led me to a challenge by the FBI. They had previously put out such a challenge last year. This year they are repeating the challenge with a harder puzzle. They have encrypted some text and encourage you to try to break the code. Here is the text of the code from the FBI web site:
VFWTDLCSWV. YDNSLMIJFWEJFD GSW SLNIJNQBLM FOBV EJFDVFDLNIGTFBSL. KBVBFYYY.AHB.MSK/NSCDC.OFZFS EDF WV QLSY SAGSWI VWNNDVV.
The FBI also provides some background on ciphers. The first one they describe is the old Caesar cipher. Each letter is replaced with another one in the alphabet. You would think this would be too simple of a cipher for the FBI to use. But I thought I would try to write a program that is smart enough to tell whether some encrypted text Is using a Caesar cipher.
My plan was for the program to guess all possible Caesar cipher substitutions. Then it would use brute force to decrypt the text using each guess. The output would be checked against a dictionary to see if the result looked like English words. It would assign a score based on how many words are actually recognizable. The program would then choose the top scoring Caesar substitution set, and declare a victory if some threshold such as 90% of the words are good.
The problem with this plan was that I needed a dictionary for my program to rely on. More specifically, I needed a list of valid English words. You would think this would be an easy thing to find. However I only found lists that were split up among multiple files. Or I would find lists that contained a whole lot more than just words. So the very first task was to write a processing program that would accept some input I found on the Internet, and output a single file full of unique words in the English language.
When I get this done, I will share the code and list with you. Who knew that this was going to take such a long time? And I am only on the first simple Caesar cipher try.
Work Smarter not Harder - We have large data sets in my current project. Every year tons more data is loaded into the system. So we only keep the majority of data for 4 years. After...