|
Hi there, After finishing the final I was left hanging, wanting more! So I decided to tackle the two optional programming exercises. The first was quite straightforward and deeply satisfying to see my program spit out the decoded sentence in seconds. A great project. Thanks to the professors. I'd love to see more of these in future versions of this course. They'd be good earlier on in the course as well. However the second part of the question seems tricky. I built a 3-gram histogram over a project Gutenberg text, and am now trying to algorithmically find a probable permutation of the input columns. With 18 columns there are 18! (6402373705728000) possible permutations. I can't help but feel that I'm on the wrong track with this one. Maybe brute force isn't the greatest approach here. Even split out over 8 cores, this looks like it's going to take some time to complete :) Anyone care to throw some careful hints around? I hope this isn't in violation of any honour code. The exercise is optional and doesn't seem to be addressed in any official discussion. EDIT Firstly I was mistaken, there are 19 columns, which is even worse :) Secondly, to those who solved it using human reasoning, I personally would like to solve this using a repeatable procedure. Such a system could work with these ciphers in any language, for example. Now that I re-think the second part, I see it as some kind of search problem. I will revisit it in that light. Thanks to all who posted ideas. |
|
I used breadth-first search with pruning (using vocabulary), it solved a problem in a few seconds (on ThinkVantage T61 laptop, implemented in Java). Quite surprising for that many possible nodes, but pruning does a really good job. Every node of the search tree was a composition of stripes (left side of the sheet). At every iteration the program: 1. Chose the node and remove it from the frontier. I used breadth first search, but I guess DFS or more sophisticated ways can work as well. 2. Looked in the vocabulary whether the words on the left side of paper are really words (and whether the beginnings of the truncated words can become words, if there are more stripes to add). If any of the the encountered letter sequences was not a word (or had no chance to become one), the branch was pruned and abandoned. 3a. If the node passed the test and all stripes are placed - it is the solution. 3b. If the node passed the test, but there are more stripes to use, the program expanded the node and added its descendants to the frontier. Descendants were produced by adding different yet unused stripes to the right of the current partial paper. The first node in the frontier was empty (no stripes added). Perhaps, the one can think of many enhancements to that approach, but it solved the puzzle and I did not have time to experiment further. Linux users can find a comprehensive dictionary in /usr/share/dict/words . Before that approach there were many attempts that did not work for me. In particular, nothing based on digrams, as well as nothing greedy (find several stripes that stick good together). Perhaps, someone could make that approaches work either. [Update] My programs for both NLP tasks are available at: https://rapidshare.com/files/1410873281/AIClassPack.zip |
|
I approached the second problem by just looking at the last line. This is a subset of the problem and has only seven columns with characters in them, one of which is a full stop. These columns also only have 2 capitals on the first line in the first character position. This makes this subset easy to work with and figure out. After that I looked at columns that had no characters on the first or last line but only in the middle, given we know what printed text should look like you can make a belief state for them. Now the problem is actually more manageable. |
|
Basic track here, so I worked on NLP instead of final. As noted, the combinations even in this simple problem are too many to brute force. On the other hand, you could do it by hand, but that won't work with a larger problem. The goal is to get the program to do it all on its own, not with any hints that it didn't come up with itself. Greedy algorithm avoids trying all combinations, but suffers from two problems. First, it may combine the wrong strips. Second, it may get into situations where it can't combine any more. That's why you want to use a backtracking algorithm, such as A*, which has good heuristics to both pursue possible solutions, but also explore the unexplored solution space for better ones. I've done A* before, so I wanted just a greedy algorithm. :-) You can see my code here: https://github.com/mlepage/ai-class Note that (single) letter frequencies don't help, because it's just shredded. Double letter frequencies (digraphs) are pretty good, my program gets most of the solution just using those. I wanted to find better ones (e.g. all possible ones) but instead I just have the top ones (so I can't penalize bad joins very well). I did get some digraph frequencies for the spaces, which helps I think. On the whole I want to find better digraph data, then add trigraphs. I think with good enough data that should work reasonably on problems of this size even with only greedy algorithm. If not, I'll add backtracking. On my github account I have some code that does backtracking (brute force and Algorithm X) for solving various puzzles, if anyone is interested just watch my space. |
|
Well, I managed to write a program that considers all 18! strips' arrangements. One core, 3.5GHz, F#, 20 minutes. Source code, if you are interested. However, I was required to make strong assumptions about the data to allow program to complete in reasonable time, such as: probability of unseen trigrams is e^-100 (the less the faster), punctuation is irrelevant, letters' case is irrelevant. Pseudo code:
It turned out that assumptions were too strong or abundance of spaces skewed probabilities. Program failed to find correct answer. Later I did another experiment, using 4-grams and less strong assumptions. Answer was found in less than a second, but estimated running time of a program increased to hundreds of years. |
|
My algorithm is order n squared. I pick a strip, then using trigrams find the most likely matching strip. I keep on adding strips until I add the last one on. Done, no backtracking. |
|
Here's how I did it in Haskell http://pceccato.blogspot.com/2012/01/decoding-shredded-messages-in-haskell.html Used a greedy approach, scored only the first 6 lines using trigrams and penalised strings starting with spaces which gave me the correct answer. heres the code: |
|
What I did was to start evaluating at partial permutations. When you have concatenated the first few columns, you'd have already some complete (separated by commas) words to check. If you get a very low probability, you can just discard that permutation. Just like pruning the three at the 5-6th level instead of going deeper and expanding the 18 levels. Myself, I started with a very naive attempt with 1-grams, aggressively discarding permutations that contain words that do no appear at the dictionary I was using. I omitted words starting in capital case, so proper names not in the corpus didn't screw up everything. I wasn't sure that this naive approach would work, but it did. In about 3-4 minutes it found 30 possible solutions it couldn't rule out; including the correct one. |
|
I haven't completed it yet so this will probably change in implementation, but here is my idea. take a strip at random and calculate the most likely next strip, then the most likely strip from there, untill you've included them all. Do this 19 times using each strip as the starting strip. Then calculate which solution has the highest total probability. |
|
I did solve it this way: Made a function that computes the probability that the left letter of right strip goes with the right of left strip. Got the data from wikipedia. Added some more of my own for double spaces, spaces after . Upper case after, quotes , etc. Added Laplacian smoothing too so we don't have P(x)=0. Compute the value of the whole set by multiplying the probabilities vertically then add the result horizontally. That gives you the probability that the current set it correct. Then you need to change it somehow and keep the highest score. I did change them by trying to move each band to each position and keep a move if it increases the probability and reject it if it did not. Then move pairs, then move triplets, etc until you move 1/2 of the set. Then do it again until the algo converges and nothing changes, it's pretty fast, I don't remember how many checks and moves it did but my pc solved the mazes in a blink of an eye. |
|
HUGE SPOILER ALERT!!! DO NOT READ IF YOU DO NOT WANT THE ANSWER TO OPTIONAL-2 First line gives you exclusive possibility for first ordered column. Last line gives you first words alignment after that, and the first line continues to about 2/3rds of the string. Line 2 confirms results of line 1, and finishes remaining all columns except 2. Columns 2 and 10 end up being those last 2 columns. This is what the end results looks like (using Wolfram Mathematica):
output: |
|
I used ngrams approach, and about 500 most common words dictionary as training dataset. Using that dictionary I can figure out ngramm distribution. I can define value function depends on strips order. Thus I have to find function extremum. It gives me solution for strips searched order. Value function is mostly unnormalised probability of given order based on ngramm dictionary. Also I have to add assumption that empty cells should be at the end of the strings more likely. I used some kind of memoization, by storing probability matrix for strip pairs. After that I iterated throughout possible states, and have found the result by maximizing value function. Single iteration(action) moves set of strips to new position. I iterated all possible set of strips and moves. Here is my results: 1)Bigrams does not converge 2)Trigrams needs about 200ms for converging and gives exact solution. It took about 6300 iterations (Value function evaluations) and 28 actions(route to goal) to get the solution. 3)Quadrograms also converges and gives exact solution. But it more slowly and needs more memory. |
|
I've already posted my own solution in another thread, so I shan't repost it. Instead, have a look through some of these: |
import re, collections, math, numpy
def words(text): return re.findall('[a-z][a-z]+', text.lower())
def bispace(text):
# penality for double space between words
return len(re.findall(' ', text))
def train(features):
model = collections.Counter()
for f in features:
model[f] += 1
return model
# For 'big.txt', google peter norvig spell check
NWORDS = train(words(file('big.txt').read()))
# "twocolumn.txt" is the input file to be recovered
myenc = file("twocolumn.txt").readlines()
x = numpy.empty((8,19), '|S2')
for i in range(8): x[i, :] = re.findall('\|(..)', myenc[i])
# column 4 should be first column #
x[:,(0,3)] = x[:,(3,0)]
def score(text):
return sum([math.log(1.0 + (NWORDS[w]>20)) for w in words(text)]) - 10*bispace(text)
def totext(xd):
(I,J) = xd.shape
q = ''
for i in range(I): q = q + '\n' + "".join(xd[i, :])
return q
s1 = (0,)
s2 = set(range(19)[1:])
for iter in range(20):
if len(s2) <= 0 : break
z = [(i,j,k, score(totext(x[:,s1+(i,j,k)]))) for i in s2 for j in s2 for k in s2 if i != j != k != i ]
zval = [e[3] for e in z]
mval = max(zval)
ix = zval.index(mval)
s1 = s1 + z[ix][0:3]
s2 = s2.difference(s1)
print s1, mval
print totext(x[:,s1])
link
This answer is marked "community wiki".
Users with 100 karma points or above may edit this wiki. |
There's another good discussion on the topic here: http://www.aiqus.com/questions/26371/nlp-programming-2-what-language-modelsearch-technique-you-chose