4_DNA: Motif Finder | EngageCSEdu

Course Level

CS1

Knowledge Unit

Fundamental Programming Concepts

Collection Item Type

Assignment

Synopsis

This is the fourth of five programming assignments in a semester-long CS-1-like course named DNA to introduce students to programming within the context of genomics: the analysis of DNA within a single cell of an organism. Originally, the course targeted students in the life sciences but it now attracts students across the academy. The goal of these assignments is to prepare students to obtain enough confidence with scripting and associated scientific write-ups to conduct a small computational experiment in a final project.

This programming assignment assumes that you have already located a specific gene (perhaps using some of the software written in the previous assignment) but now you want to investigate the regulatory DNA sequences “upstream” (just prior to or to the left of) that gene. Regulatory (or promoter) sequences in intergenic regions (between the genes) are vitally important in the process of protein production. Promoter motifs (DNA "words") often are repetitive and/or "fuzzy" (variable) DNA sequences upstream of genes. This assignment applies regular expressions to locate certain categories of repetitions (direct and mirror repeats).

Recommendations

For recommendations about this specific assignment as well as general comments for the entire set of DNA-focused programming assignments, please see the attached recommendations document.

Engagement Highlights

We live in a post-genomic world where strings of sequenced DNA are the starting point for discovery from basic research to personalized medicine. In addition to the human genome, exciting interdisciplinary areas such as the computational explorations of the thousands of genomes in the microbial communities within us are leading to new definitions of personalized medical diagnosis and treatment. "If Charles Darwin had taken a couple of undergraduate interns with him on 'The Beagle', those students would have discovered, described and catalogued their share of new species ... therefore it is perhaps ironic that we are experiencing once again an age of exploration and discovery via the old fashion activities of collecting and cataloguing. This time it is not only organisms but DNA sequences ... That enticing, exhilarating idea of being on an expedition is (or could be) an aspect of DNA sequence analysis. The balance is tipped heavily toward vast, unknown territories of undeciphered data waiting to be explored" (LeBlanc and Dyer, 2007).

In this particular assignment, students see the power of fuzzy pattern matching by applying regular expressions. I remember how excited I was to initially learn that mirror repeats (a.k.a. DNA palindromes) had biological significance. While my biology colleague could locate short, specific motifs by hand on a print out (e.g., TATA), finding fuzzy patterns is best handled computationally. In short, we were thrilled to apply our knowledge of regular expressions to help her search through all the intergenic regions of an entire genome!

Note: I originally taught this course in Perl where we used the text written by my colleague in biology Betsey Dyer and myself. I now use Python in this course so I no longer use my own book, but the text includes some wonderful, interdisciplinary insights (most of which are written by Betsey Dyer). Perl for Exploring DNA (Oxford University Press, 2007). Relevant to this assignment, the text also includes a chapter with a facing page translation format where the reader first learns regular expression syntax as applied to pattern matching on English words (e.g., finding palindromes) and then applies that same regex syntax to sequences of DNA (e.g., finding mirror repeats).

Engagement Practices Employed

Encourage Student Interaction

Make Interdisciplinary Connections to CS

Use Meaningful and Relevant Content

Materials and Links

Materials

Materials.zip

Recommendations 4_DNA Motif Finder.docx

Computer Science Details

Programming Language

Python

Material Format and Licensing Information

Creative Commons License

CC BY-NC