CS153 lab: Shortest Superstring Problem

Today we solve the Genome Assembly as Shortest Superstring problem (long) on Rosalind. This is a warmup for Chapter 3, where we study genome assembly problems. For more background, you might also like the wikipedia article on shotgun sequencing.

We use a greedy approach: we merge two of our strings, so that the length of their overlap is as long as possible. We repeat this merging until we get down to just one string, which is our solution.

Remark: This greedy algorithm usually finds a good solution, but it is not guaranteed to find the best solution. Finding the shortest superstring is in fact "NP-hard" (a CS theory concept), so there is probably no algorithm which is fast and optimal on all inputs.

SCRIPT (your steps)

  1. First find this web page.
    If you omit 'SCRIPT.html' in the address bar, you should see a listing of today's files.

  2. Open a terminal (start 'bash' if necessary). To make your initial copy of the lab files, copy-and-paste these commands: You should see a listing of your lab files. File long.py is the main program for you to edit. File rosalind_long.txt is a sample input, which you are welcome to overwrite when you download a larger test input from Rosalind.

  3. Edit your copy of long.py, and look for "TODO" items. We'll do at least the first TODO item during the lab meeting, and we'll discuss the others.

  4. When you think you are done with long.py, you should try to solve the problem on Rosalind. When Rosalind gives you a test input to download, just overwrite your copy of rosalind_long.txt.

    REMARK: when solving problems for CS153 on Rosalind, make sure you see ?class=435 at end of the address bar. That means you are solving the problem in the context of our class, not just as an independent learner.

  5. Finally, be sure to leave a copy of your working program in your lab5 directory. That is, I will look for your ~/cs153/lab5/long.py file. This should be automatic if you finish this in the CS lab, but it may require an extra step if you do your Rosalind submission from outside the lab (like from a laptop). Let me know if you need advice on copying files into the CS lab.

Extra Credit: Speed

As we have designed it, this program is slow. It takes about a minute to solve the Rosalind challenge input. If you know some CS, then you might be able to devise a much faster solver. In particular, try using a python dict (a kind of hash table) to quickly find equal suffix/prefix pairs. If you manage to submit such a fast solver for this lab, let me know by email.