CPM SUMMER SCHOOL

July 3-4, 1995
Hanasaari Cultural Centre
Espoo, Finland

Themes: String matching, computational biology, data mining

The program of this mini-school consists of four tutorials. Each tutorial contains four hours of lectures. The registration for the school is on Sunday, July 2, at 18-20 at Hanasaari Centre. The program starts on Monday morning, July 3. The first half of each tutorial is given on Monday and the second half on Tuesday, July 4. Immediately after the school there will be on July 5-7 the Sixth International Symposium on Combinatorial Pattern Matching ( CPM '95) in the same place.

School venue: Hanasaari Cultural Centre, Espoo, Finland, tel.: +358-0-461566

Registration: Advance registration is required. Dead-line is June 5, 1995. A joint registration form of the summer school and the symposium is available in WWW.

Fee: FIM 300. Includes materials, lunches and coffee breaks. Payable upon arrival to the Hanasaari Centre.

Accommodation: Accommodation in the Hanasaari Hotel or in near-by Helsinki can be reserved through the organizers (see the registration form)

Internet contacts:
Email: cpm95@cs.helsinki.fi
URL: http://www.cs.helsinki.fi/events/cpm95

Contact person:
Erkki Sutinen
tel.: +358-0-708 44163
fax: +359-0-708 44441

Tentative schedule

Sunday, July 2
18:00-20:00 Registration

Monday, July 3
8:30-10:30 Gusfield
10:45-12:45 Kececioglu
14:00-16:00 Mannila
16:15-18:15 Vingron

Tuesday, July 4:
8:15-10:15 Mannila
10:30-12:30 Vingron
13:45-15:45 Gusfield
16:00-18:00 Kececioglu

TUTORIALS

Dan Gusfield (University of California at Davis):
String algorithms and sequence comparisons

ABSTRACT

We will begin with a quick review of some basic string algorithms for exact matching. Then we will discuss suffix trees and non-trivial applications of them. Next we move to inexact matching and alignment with emphasis on applications that come from computational molecular biology. We will look at various definitions of inexact matching and associated algorithms that have proven effective for biological problems. We will consider in detail choices for gap models, and parameter settings, and the biological and algorithmic effects of those choices. Multiple alignment will then be particularly emphasized. Finally, we will discuss the interplay between string algorithms and algorithms for reconstructing evolutionary history, both in the form of evolutionary trees and in non-tree representations. Rather than being a "how to" course, the talks will emphasize precise statements of problems, algorithms and analyses.

BIOGRAPHY

Dan Gusfield received his Ph.D. from U.C. Berkeley in 1980 and has concentrated on problems of combinatorial optimization on graphs, including several problems of reconstructing evolutionary trees. This interest in algorithms for evolutionary trees lead to a general interest in molecular biology and genetics. Starting in 1988, Dan was part of a small group, centered at the Human Genome Center of Lawrence Berkeley Labs, studying computational issues in molecular biology. Due to the nature of the human genome project, the group focussed more on algorithms for strings than on algorithms for trees. That lead to an interest in string algorithms generally and their application in computational biology. Dan is presently finishing a book on this topic entitled "Algorithms on strings: A dual perspective from computer science and computational molecular biology". Dan is professor of computer science at U.C. Davis.

John Kececioglu (University of Georgia):
Combinatorial algorithms for DNA sequencing and mapping

ABSTRACT

DNA sequencing and mapping, the tasks of determining the location of landmarks along a DNA molecule, as well as the DNA sequence itself, are basic tasks in molecular biology, and the design of algorithms to support these tasks is an area of active research. While the field is still relatively new, a few fundamental approaches are emerging, many based on combinatorial optimization. This tutorial will survey what's known about combinatorial algorithms for DNA sequence assembly and physical mapping, and point to some of the directions of current research. Specific topics include approximation of shortest superstrings, generalization to error and reverse complementarity, local search for mapping by unique and non-unique probes, and guarantees in the presence of chimerism.

OUTLINE

DNA sequence assembly
- Basic biology
- Formulations and alternate approaches
  - Shortest common superstring with error
  - Depth of coverage distribution
  - Sequencing by hybridization
- Three-phase decomposition
  - Overlap by suffix-prefix alignment
  - Layout from shortest tours
  - Consensus from multiple-sequence alignment
- New directions
  - Disambiguating repeats
  - Detecting chimeric clones
  - Constraining placement and orientation
DNA physical mapping
- Basic biology
- Variations
  - Restriction-fragment mapping
  - Clone-probe-hybridization mapping
  - Radiation-hybrid mapping
- Approaches
  - Non-unique probes
  - Unique probes
  - Guarantees in the presence of chimeric clones

LITERATURE

Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortest superstrings. STOC 23 (1991) 328-336.

J.D. Kececioglu and E.W. Myers. Combinatorial algorithms for DNA sequence assembly. Algorithmica 13 (1995) 7-51.

F. Alizadeh, R. Karp, D. Wisser, and J. Zweig. Physical mapping of chromosomes using unique probes. SODA 5 (1994) 489-500.

BIOGRAPHY

John Kececioglu (Ph.D. University of Arizona) is an Assistant Professor of Computer Science at the University of Georgia. His research interests are in the design and analysis of algorithms, computational biology, and applied algorithms and optimization. His current research is in DNA sequence assembly, genome rearrangements, and multiple sequence alignment.

Heikki Mannila (University of Helsinki):
Foundations of data mining

ABSTRACT

Data mining, or knowledge discovery in databases, aims at formation of interesting knowledge from large masses of data. Data mining combines methods from machine learning, statistics, and databases. Data mining has recently attracted a lot of attention from the research community and from the industrial sector.

In these lectures we discuss briefly the motivation and the application areas of the research in data mining. We concentrate on the basic algorithmic problems of data mining and study how the problems and methods of data mining differ from those used in machine learning and in statistics.

BIOGRAPHY

Heikki Mannila is a professor of computer science at the University of Helsinki, Finland. He obtained his Ph.D. in 1985 from the University of Helsinki. He has since then worked at the University of Tampere, at the National Public Health Institute of Finland, at the Technical University of Vienna, and at the University of Helsinki, as well as a consultant in industry. His research interests include data mining, machine learning, algorithms, and databases.

Martin Vingron (GMD, Bonn):
Reconstructing evolutionary trees from molecular sequences

ABSTRACT

DNA and protein sequences have become a major source of information for the reconstruction of the course of evolution. This tutorial will survey methods to derive phylogenetic trees from molecular sequences. Over time many fields have contributed to this problem. Therefore some of the methods are mainly algorithmic while others are statistical or mathematical. However, no special background will be required for the lectures.

TOPICS

Character based methods: Perfect phylogeny, parsimony.
Distance based methods:
- Sequence comparison and evolutionary distance,
- ultrametric and additive trees, clustering methods
Maximum Likelihood
Comparing different trees
[and possibly some non-orthodox newer development].

LITERATURE

David L. Swofford and Gary J. Olsen: Phylogeny Reconstruction. In: Molecular Systematics, eds. David M. Hillis and Craig Moritz, Sinauer Associates, 1990.

BIOGRAPHY

Martin Vingron earned a PhD in Mathematics from Heidelberg University doing his graduate work at the European Molecular Biology Laboratory in Heidelberg. There and during his time as a postdoc at the University of Southern California he worked mainly on algorithmic and statistical question in the comparison of molecular sequences. He is currently at the German National Research Center for Information Technology (GMD) and his research interest has focussed on the study of molecular evolution.

CPM SUMMER SCHOOL

Tentative schedule

TUTORIALS

Dan Gusfield (University of California at Davis): String algorithms and sequence comparisons

John Kececioglu (University of Georgia): Combinatorial algorithms for DNA sequencing and mapping

Heikki Mannila (University of Helsinki): Foundations of data mining

Martin Vingron (GMD, Bonn): Reconstructing evolutionary trees from molecular sequences

Dan Gusfield (University of California at Davis):
String algorithms and sequence comparisons

John Kececioglu (University of Georgia):
Combinatorial algorithms for DNA sequencing and mapping

Heikki Mannila (University of Helsinki):
Foundations of data mining

Martin Vingron (GMD, Bonn):
Reconstructing evolutionary trees from molecular sequences