Members don't see the ad below. Register now!
2
1

Hi there,

After finishing the final I was left hanging, wanting more!

So I decided to tackle the two optional programming exercises. The first was quite straightforward and deeply satisfying to see my program spit out the decoded sentence in seconds. A great project. Thanks to the professors. I'd love to see more of these in future versions of this course. They'd be good earlier on in the course as well.

However the second part of the question seems tricky. I built a 3-gram histogram over a project Gutenberg text, and am now trying to algorithmically find a probable permutation of the input columns.

With 18 columns there are 18! (6402373705728000) possible permutations. I can't help but feel that I'm on the wrong track with this one. Maybe brute force isn't the greatest approach here. Even split out over 8 cores, this looks like it's going to take some time to complete :)

Anyone care to throw some careful hints around?

I hope this isn't in violation of any honour code. The exercise is optional and doesn't seem to be addressed in any official discussion.

EDIT

Firstly I was mistaken, there are 19 columns, which is even worse :)

Secondly, to those who solved it using human reasoning, I personally would like to solve this using a repeatable procedure. Such a system could work with these ciphers in any language, for example.

Now that I re-think the second part, I see it as some kind of search problem. I will revisit it in that light. Thanks to all who posted ideas.

asked 19 Dec '11, 18:58

Drew%20Noakes's gravatar image

Drew Noakes
1189

edited 20 Dec '11, 10:49

(20 Dec '11, 13:37) Drew Noakes Drew%20Noakes's gravatar image

I used breadth-first search with pruning (using vocabulary), it solved a problem in a few seconds (on ThinkVantage T61 laptop, implemented in Java). Quite surprising for that many possible nodes, but pruning does a really good job.

Every node of the search tree was a composition of stripes (left side of the sheet). At every iteration the program:

1. Chose the node and remove it from the frontier. I used breadth first search, but I guess DFS or more sophisticated ways can work as well.

2. Looked in the vocabulary whether the words on the left side of paper are really words (and whether the beginnings of the truncated words can become words, if there are more stripes to add). If any of the the encountered letter sequences was not a word (or had no chance to become one), the branch was pruned and abandoned.

3a. If the node passed the test and all stripes are placed - it is the solution.

3b. If the node passed the test, but there are more stripes to use, the program expanded the node and added its descendants to the frontier. Descendants were produced by adding different yet unused stripes to the right of the current partial paper.

The first node in the frontier was empty (no stripes added).

Perhaps, the one can think of many enhancements to that approach, but it solved the puzzle and I did not have time to experiment further.

Linux users can find a comprehensive dictionary in /usr/share/dict/words .

Before that approach there were many attempts that did not work for me. In particular, nothing based on digrams, as well as nothing greedy (find several stripes that stick good together). Perhaps, someone could make that approaches work either.

[Update] My programs for both NLP tasks are available at: https://rapidshare.com/files/1410873281/AIClassPack.zip

link

answered 20 Dec '11, 09:03

Andrey%20Boytsov's gravatar image

Andrey Boytsov
9721717

edited 27 Dec '11, 09:59

I approached the second problem by just looking at the last line. This is a subset of the problem and has only seven columns with characters in them, one of which is a full stop. These columns also only have 2 capitals on the first line in the first character position.

This makes this subset easy to work with and figure out. After that I looked at columns that had no characters on the first or last line but only in the middle, given we know what printed text should look like you can make a belief state for them.

Now the problem is actually more manageable.

link

answered 19 Dec '11, 19:16

BCFX2011's gravatar image

BCFX2011
3.3k219

edited 20 Dec '11, 01:09

Basic track here, so I worked on NLP instead of final. As noted, the combinations even in this simple problem are too many to brute force. On the other hand, you could do it by hand, but that won't work with a larger problem. The goal is to get the program to do it all on its own, not with any hints that it didn't come up with itself.

Greedy algorithm avoids trying all combinations, but suffers from two problems. First, it may combine the wrong strips. Second, it may get into situations where it can't combine any more. That's why you want to use a backtracking algorithm, such as A*, which has good heuristics to both pursue possible solutions, but also explore the unexplored solution space for better ones.

I've done A* before, so I wanted just a greedy algorithm. :-) You can see my code here: https://github.com/mlepage/ai-class

Note that (single) letter frequencies don't help, because it's just shredded. Double letter frequencies (digraphs) are pretty good, my program gets most of the solution just using those. I wanted to find better ones (e.g. all possible ones) but instead I just have the top ones (so I can't penalize bad joins very well). I did get some digraph frequencies for the spaces, which helps I think. On the whole I want to find better digraph data, then add trigraphs. I think with good enough data that should work reasonably on problems of this size even with only greedy algorithm. If not, I'll add backtracking.

On my github account I have some code that does backtracking (brute force and Algorithm X) for solving various puzzles, if anyone is interested just watch my space.

link

answered 19 Dec '11, 21:54

mlepage's gravatar image

mlepage
9407

Those were interesting to solve by hand.

link

answered 20 Dec '11, 01:48

washbdan's gravatar image

washbdan
1243

As Norvig pointed out, trying all possible combinations is not practical: there are too many possible combinations of 19 strips. It's also not AI. The point is to use what we've learned about NLP and probabilities and such. If you're counting permutations, you're driving on the wrong side of the road.

I solved both problems with a combination of Delphi programming and HI (human intelligence). I didn't really see the point of using probabilities in the first part. I clicked a button 15 times, and voila! a sentence.

The part that interested me was the second. I did it fairly quick-and-dirty (for me), and I got as far parsing the text column-wise (a column per strip) into the data structure of a TStringGrid. By then I was stressing about the Final, so instead of spending time making a computer figure it out (the whole point, really), I spent probably 10 minutes moving the columns of the grid around until I had it right. Kind of a fun mental game, actually. Doing it with with gray matter did help to reinforce my ideas and intuition for solving the problem.

If anyone wants either of my little Windows (Win32, not .Net) GUI utilities to play with, let me know, and I'll email one or both to you.

link

answered 20 Dec '11, 02:31

headsmoke's gravatar image

headsmoke
6268

edited 20 Dec '11, 03:46

I used probabilities, but for the HI. First letter of Optional-1 was "E" in a thre letter word. I assumed it had to be "The" and looked at the distance T-E=15. Rest is history. Second one used probabilities VERY heavily, but more within HI, not programmatically.

(20 Dec '11, 03:11) Gregory Klopper Gregory%20Klopper's gravatar image

Well, I managed to write a program that considers all 18! strips' arrangements. One core, 3.5GHz, F#, 20 minutes. Source code, if you are interested.

However, I was required to make strong assumptions about the data to allow program to complete in reasonable time, such as: probability of unseen trigrams is e^-100 (the less the faster), punctuation is irrelevant, letters' case is irrelevant.

Pseudo code:

bestLogProbability=-infinity
bestSequence=empty list
function findBestSequence(sequence, currentLogProbability, unassignedStrips)
  if unassignedStrips is empty then
    if currentLogProbability>bestLogProbability then
      bestLogProbability=currentLogProbability
      bestSequence=sequence
    return
  for each strip in unassignedStrips
    logProb=currentLogProbability + probability that strip follows last strip in sequence
    if  logProb < bestLogProbability then
      return // Prune subtree. We have better match already
    else
      // process the rest recursively 
      findBestSequence(sequence+strip, logProb, unassignedStrips-strip)
  return

It turned out that assumptions were too strong or abundance of spaces skewed probabilities. Program failed to find correct answer.

Later I did another experiment, using 4-grams and less strong assumptions. Answer was found in less than a second, but estimated running time of a program increased to hundreds of years.

link

answered 20 Dec '11, 04:47

red75prim's gravatar image

red75prim
1.3k10

edited 20 Dec '11, 04:54

I used trigrams of the Brown corpus and it required only (19+18+17+...+1)*7 = 1330 steps, the first 12 strips were ordered correctly, it was easy to order the last 7 strips manually.

link

answered 20 Dec '11, 05:19

alukanin's gravatar image

alukanin
556

My algorithm is order n squared. I pick a strip, then using trigrams find the most likely matching strip. I keep on adding strips until I add the last one on. Done, no backtracking.

link

answered 27 Dec '11, 10:59

Anne%20Paulson's gravatar image

Anne Paulson
4.2k16

Here's how I did it in Haskell

http://pceccato.blogspot.com/2012/01/decoding-shredded-messages-in-haskell.html

Used a greedy approach, scored only the first 6 lines using trigrams and penalised strings starting with spaces which gave me the correct answer.

heres the code:

http://ideone.com/XJVvf

link

answered 28 Jan, 06:55

PaulC's gravatar image

PaulC
448

edited 28 Jan, 07:58

I must confess I was too lazy to program and just solved it on a piece of paper... They are rather easy as the first word of the first task is obviously "The" which makes it easy, and for the second task, as BCFX2011 correctly pointed out, the last sentence is "in this year" which is obvious, this gives us the left half of the text and in a couple of minutes you just figure out the second part (for example by recognizing the word "information" or recognizing the fact that the next word starts either with " f" or with " i".

anyways, i was too tired to program an AI, so i used my natural I :)

hope this still counts as a solution and would not be counted as a cheat:)

gosh I just realized that in most other courses USING the computer would be considered cheating during the exam, but in the AI course this is vice versa - using your brain to solve it by hand is CHEATING whereas writing a program is INTENDED action:)

funny, huh?

link

answered 19 Dec '11, 19:24

schapkin's gravatar image

schapkin
1566

edited 19 Dec '11, 19:25

Although I did program, in the end, I sorted out #2 by hand... Is that kind of like Thrun driving his robot car in the DARPA Grand challenge? Yes, but it worked, and I was still proud (although not INSANELY proud).

(20 Dec '11, 02:34) headsmoke headsmoke's gravatar image

What I did was to start evaluating at partial permutations. When you have concatenated the first few columns, you'd have already some complete (separated by commas) words to check. If you get a very low probability, you can just discard that permutation. Just like pruning the three at the 5-6th level instead of going deeper and expanding the 18 levels.

Myself, I started with a very naive attempt with 1-grams, aggressively discarding permutations that contain words that do no appear at the dictionary I was using. I omitted words starting in capital case, so proper names not in the corpus didn't screw up everything.

I wasn't sure that this naive approach would work, but it did. In about 3-4 minutes it found 30 possible solutions it couldn't rule out; including the correct one.

link

answered 19 Dec '11, 19:29

glueball's gravatar image

glueball
3694

I haven't completed it yet so this will probably change in implementation, but here is my idea. take a strip at random and calculate the most likely next strip, then the most likely strip from there, untill you've included them all. Do this 19 times using each strip as the starting strip. Then calculate which solution has the highest total probability.

link

answered 19 Dec '11, 19:30

Mike%20H's gravatar image

Mike H
2.8k112

I did solve it this way: Made a function that computes the probability that the left letter of right strip goes with the right of left strip. Got the data from wikipedia. Added some more of my own for double spaces, spaces after . Upper case after, quotes , etc. Added Laplacian smoothing too so we don't have P(x)=0. Compute the value of the whole set by multiplying the probabilities vertically then add the result horizontally. That gives you the probability that the current set it correct. Then you need to change it somehow and keep the highest score. I did change them by trying to move each band to each position and keep a move if it increases the probability and reject it if it did not. Then move pairs, then move triplets, etc until you move 1/2 of the set. Then do it again until the algo converges and nothing changes, it's pretty fast, I don't remember how many checks and moves it did but my pc solved the mazes in a blink of an eye.

link

answered 20 Dec '11, 02:00

Benoit%20Ambry's gravatar image

Benoit Ambry
1566

HUGE SPOILER ALERT!!! DO NOT READ IF YOU DO NOT WANT THE ANSWER TO OPTIONAL-2

First line gives you exclusive possibility for first ordered column. Last line gives you first words alignment after that, and the first line continues to about 2/3rds of the string. Line 2 confirms results of line 1, and finishes remaining all columns except 2. Columns 2 and 10 end up being those last 2 columns.

This is what the end results looks like (using Wolfram Mathematica):

text=Grid[StringJoin@#[[{4,7,1,16,12,14,19,3,15,18,6,8,5,13,11,9,17,2,10}]]&/@StringSplit[#,"|"]&/@ { {"de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on"}, {"ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h "}, {"ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c "}, {"he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br"}, {"wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e "}, {" t|ca| t|wi| M|d |th|\"A|ma|l |he| p|at|ap|it|he|ti|le|er"}, {"ry|d |un|Th|\" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm"}, {"hi| | |in| | | t| | | | |ye| |ar| |s | | |. "} }]

output:
Claude Shannon founded information
theory, which is the basis of
probabilistic language models and
of the code breaking methods that
you would use to solve this problem,
with the paper titled "A Mathematical
Theory of Communication," published
in this year.

link

answered 20 Dec '11, 03:08

Gregory%20Klopper's gravatar image

Gregory Klopper
935

edited 20 Dec '11, 03:09

I used ngrams approach, and about 500 most common words dictionary as training dataset. Using that dictionary I can figure out ngramm distribution.

I can define value function depends on strips order. Thus I have to find function extremum. It gives me solution for strips searched order. Value function is mostly unnormalised probability of given order based on ngramm dictionary. Also I have to add assumption that empty cells should be at the end of the strings more likely. I used some kind of memoization, by storing probability matrix for strip pairs.

After that I iterated throughout possible states, and have found the result by maximizing value function. Single iteration(action) moves set of strips to new position. I iterated all possible set of strips and moves.

Here is my results:

1)Bigrams does not converge

2)Trigrams needs about 200ms for converging and gives exact solution. It took about 6300 iterations (Value function evaluations) and 28 actions(route to goal) to get the solution.

3)Quadrograms also converges and gives exact solution. But it more slowly and needs more memory.

link

answered 20 Dec '11, 07:31

Weerel's gravatar image

Weerel
103

I took a greedy approach starting with each strip. Needed to add penalty for starting spaces to get the correct answer, but it turned out that the corpus was quite bad.

link

answered 29 Dec '11, 08:37

maacruz's gravatar image

maacruz
2178

I've already posted my own solution in another thread, so I shan't repost it. Instead, have a look through some of these:

http://www.aiqus.com/questions/25796/updated-this-ai-class-is-broken-down-into-2-groups-made-abundantly-clear-by-nlp

link

answered 29 Dec '11, 09:32

GlennS's gravatar image

GlennS
10616

import re, collections, math, numpy

def words(text): return re.findall('[a-z][a-z]+', text.lower())

def bispace(text):
  # penality for double space between words
  return len(re.findall('  ', text))

def train(features):
  model = collections.Counter()
  for f in features:
    model[f] += 1
  return model

# For 'big.txt', google peter norvig spell check
NWORDS = train(words(file('big.txt').read()))

# "twocolumn.txt" is the input file to be recovered
myenc = file("twocolumn.txt").readlines()

x = numpy.empty((8,19), '|S2')

for i in range(8): x[i, :] = re.findall('\|(..)', myenc[i])

# column 4 should be first column #
x[:,(0,3)] = x[:,(3,0)]

def score(text):
  return sum([math.log(1.0 + (NWORDS[w]>20)) for w in words(text)]) - 10*bispace(text)

def totext(xd):
  (I,J) = xd.shape
  q = ''
  for i in range(I): q = q + '\n' + "".join(xd[i, :])
  return q

s1 = (0,)
s2 = set(range(19)[1:])

for iter in range(20):
  if len(s2) <= 0 : break
  z = [(i,j,k, score(totext(x[:,s1+(i,j,k)]))) for i in s2 for j in s2 for k in s2 if i != j != k != i ]
  zval = [e[3] for e in z]
  mval = max(zval)
  ix = zval.index(mval)
  s1 = s1 + z[ix][0:3]
  s2 = s2.difference(s1)
  print s1, mval

print totext(x[:,s1])

link
This answer is marked "community wiki".
Users with 100 karma points or above may edit this wiki.

answered 21 Feb, 05:25

dcchen's gravatar image

dcchen
112

edited 21 Feb, 05:30

Members don't see the ad below. Register now!
Your answer
toggle preview

Follow this Question via Email

Once you sign in you will be able to subscribe for any updates here

Q&A Editor Basics

  • to upload an image into your question or answer hit
  • to create bulleted or numbered lists hit or
  • to add a title or header hit
  • to section your text hit
  • to make a link clickable, surround it with <a> and </a> (for example, <a>www.google.com</a>)
  • basic HTML tags are also supported (for those who know a bit of HTML)
  • To insert an EQUATION you can use LaTeX. (backslash \ has to be escaped, so in your LaTeX code you have to replace \ with \\). You can see more examples and info here

powered by OSQA