CS153 lab: Median String Problem

Today we solve the "Median String Problem", ba2b on Rosalind (if that link does not work, you may need to login first). It is discussed in Chapter 2, pages 80 to 83. We will use the approach on page 82 of your book, which works well when k is not too large. I'll give you some python functions to help put together a solution. As usual this is due in a week (start of next lab), but my hope is that many of you can do it today. If you get stuck, please ask for help.

SCRIPT (your steps)

  1. First find this web page.
    If you omit 'SCRIPT.html' in the address bar, you should see a listing of today's files.

  2. As before, you should open a terminal. To make your initial copy of the lab files, copy-and-paste these commands: You should see a listing of your lab files. File ba2b.py is the main program for you to edit. File rosalind_ba2b.txt is a sample input, which you are welcome to overwrite when you download a test input from Rosalind.

  3. Edit your copy of ba2b.py, and look for "TODO" items. We will finish some of these items during the lab meeting. To finish the "allDNA" function, I want you adapt the method in program allBinary.py.

  4. When you think you are done with ba2b.py, you should solve the problem on Rosalind. When Rosalind gives you a test input to download, just overwrite your copy of the rosalind_ba2b.txt file.

    REMARK: when solving problems for CS153 on Rosalind, make sure you see ?class=435 at end of the address bar. That means you are solving the problem in the context of our class, not just as an independent learner.

  5. Finally, be sure to leave a copy of your working program in your lab3 directory. That is, I will look for your ~/cs153/lab3/ba2b.py file. This should be automatic if you finish this in the CS lab, but it may require an extra step if you do your Rosalind submission from outside the lab (like from a laptop). Let me know if you need advice on copying files into the CS lab.

Big-Oh Addendum

As argued in Chapter 2, this problem is equivalent to the "Motif Finding Problem" on page 77, because our output (the median string) equals the consensus string of an optimal motif. As discussed on page 82, the running time of this approach is O(4k·k·n·t), assuming our input has t DNA strings of length n. The other exact solver (BruteForceMotifSearch sketched on page 77) would have running time O(nt·k·t). Neither of these is fast enough for problems that biologists want to solve, where both k and t can be large. That is why Chapter 2 explores various inexact but faster methods to find motifs.