Welcome to the international (English-speaking) blog of the Department of Computer Science at the University of Helsinki. Our blog invites views on research, education, student life, and other societal themes connected to our computer science fields. The intention is to build a forum as an open window for readers from inside and outside computer science. If you have any good ideas or articles to share on this blog, please feel free to contact us: cs-blog [ät] cs.helsinki.fi.
Assembling the Genomic Puzzle of a Butterfly
by Leena Salmela
Since 2009 when I came to work at the department I have been involved with the Glanville fritillary genome project. In this blog post I will tell about my experiences in the project.
The genome of an organism consists of chromosomes, long molecules of DNA, which can be represented as strings of A, C, G, and T. For current technology the reading of a whole genome in a single lump is not possible. Instead the DNA molecule is broken into small pieces which can then be read by the sequencing machines. Typically in a whole genome project, billions of pieces are generated and the genome is then reconstructed from these pieces similar to putting together a giant jigsaw puzzle. If the genome was a random string, this would be a rather straightforward process as the sequenced pieces are long enough to uniquely characterize a position in a random string. However, genomes consist largely of repetitions making it impossible to say for many of the pieces from which part of the genome they come from. This is like a jigsaw puzzle consisting largely of grass or clear blue sky.
In 2003 when the human genome was published, DNA sequencing was still too expensive for a single biological group to investigate the genome of a higher organism. But the technology has evolved rapidly and by the end of the last decade the new, cheap high-throughput technology was accessible to many biological researchers. However the characteristics of this data were quite different from the data used in the early genome projects like the human. The pieces were much shorter and contained more errors, rendering the old methods for reconstructing a genome impractical. It was at this time that Prof. Ilkka Hanski's group in the Faculty of Biosciences in Viikki decided to sequence the genome of the Glanville fritillary butterfly which you can see in the jigsaw puzzle below (at least if you assemble it first).
In a sense this was a pioneering project as no big genomes had been sequenced in Finland before. It was also known that insect genomes are especially repetitive and the plan was to use a mix of different technologies which was a new strategy at the time. Therefore some trouble was anticipated and we as computer scientists of the Algodan CoE were asked to join in. However, probably none of us understood exactly how challenging the project would be.
In a couple of months the first pieces of the puzzle, or reads as they are called, arrived at my desktop and the journey into the fascinating world of genome assembly began in earnest. The field turned out to be rich in both computational and molecular biology challenges. The raw data coming from the sequencing machines first needs to be filtered and then the sequencing errors are corrected to make further processing easier. The pieces are then joined into larger contiguous chunks called contigs. After this, special reads, called paired ends or mate pairs, are utilized. These reads come in pairs such that the approximate distance in the genome between the two reads in a pair is known which allows to organize the contigs into longer linear sequences containing gaps. Finally the gaps are filled by reusing the reads and utilizing the knowledge of which contigs are consecutive.
While we were struggling with the errors in the reads and the repeats of the genome, the molecular biologists had a different problem. The DNA of the butterfly tended to fragment on its own into small pieces which caused problems in producing paired reads which are crucial in handling the repeats while reconstructing a genome. All these issues were discussed in regular project meetings along with more exotic topics for a computer scientist like gathering the butterflies in the Åland islands and rearing and breeding them in Lammi.
Finally after successful production of the data and overcoming the last computational problems in early 2012, the first draft of the genome was freezed. In the end the draft genome contained over 8,000 pieces whose combined length totalled almost 390 million base pairs. Then the analysis of the sequence began to understand the biological meaning of the sequence of A's, C's, G's and T's. For me it was exciting to see what kind of biological research and new findings could be done based on our reconstruction of the genome. For example it turned out that the genes of the butterfly had stayed in the same place in the genome for a very long time. This is very different from for example mammalian genomes like the human and the mouse where blocks of genes have moved from one chromosome to another over the course of evolution. After the analysis was complete, the genome was released in 2014 to be publicly available as a resource for biological research.
Additional information:
Comments
The CS Blog Task Force
Paba is a PhD student in Ubiquitous Interaction Group (UIx group) at
HIIT, CS Department of UH. Her research focus is on developing
interaction models to predict user interests and information-needs in
exploratory search.
Aaron is doing his PhD in the NODES group at the CS department. His research focuses on mobile computing and energy efficient design for multi-interfaced mobile devices.
Ella is a PhD student in the Nodes group. She is interested in e.g. distributed algorithms, real-life data mining, clouds and ubiquitous computing.
Giulio is a Professor at the CS department. His area is Human-Computer Interaction. For more information, please find his homepage here
Tomi is a Professor at the CS department. His area is Software Engineering. For more information, please check
Add comment