CS153 lab: Shortest Superstring Problem
Today we solve
Assembly as Shortest Superstring problem (long) on Rosalind. This
is a warmup for Chapter 3, where we study genome assembly problems.
For more background, you might also like the wikipedia article on
We use a greedy approach: we merge two of our strings, so that the
length of their overlap is as long as possible. We repeat this
merging until we get down to just one string, which is our solution.
Remark: This greedy algorithm usually finds a good solution,
but it is not guaranteed to find the best solution. Finding the
shortest superstring is in fact "NP-hard" (a CS theory concept), so
there is probably no algorithm which is fast and optimal on all
SCRIPT (your steps)
- First find this web page.
If you omit 'SCRIPT.html' in the address bar, you should see a listing
of today's files.
Open a terminal (start 'bash' if necessary).
To make your initial copy of the lab files, copy-and-paste these commands:
cp /home/cs153001/share/lab5/* .
You should see a listing of your lab files.
File long.py is the main program for you to
edit. File rosalind_long.txt is a
sample input, which you are welcome to overwrite when you download a
larger test input from Rosalind.
Edit your copy of long.py, and look for "TODO" items.
We'll do at least the first TODO item during the lab meeting,
and we'll discuss the others.
When you think you are done with long.py, you should try to solve
the problem on Rosalind. When Rosalind gives you a test input to
download, just overwrite your copy of rosalind_long.txt.
REMARK: when solving problems for CS153 on Rosalind, make sure
you see ?class=435 at end of the address bar. That means you
are solving the problem in the context of our class, not just as an
Finally, be sure to leave a copy of your working program in your lab5
directory. That is, I will look for
your ~/cs153/lab5/long.py file. This should be automatic if
you finish this in the CS lab, but it may require an extra step if you
do your Rosalind submission from outside the lab (like from a laptop).
Let me know if you need advice on copying files into the CS lab.
Extra Credit: Speed
As we have designed it, this program is slow. It takes about a minute
to solve the Rosalind challenge input. If you know some CS, then you
might be able to devise a much faster solver. In particular, try
using a python dict (a kind of hash table) to quickly find
equal suffix/prefix pairs. If you manage to submit such a fast solver
for this lab, let me know by email.