Ph.D. Theses

Machine Learning Approaches For Designing DNA Sequence Assembly Algorithms

By Darren Lim
Advisor: Mark K. Goldberg
April 14, 2003

We present two separate algorithms for solving the DNA sequence assembly problem. The sequence assembly problem is the reconstruction of a large sequence of DNA from a set of subsequences called fragments. Fragments are created by breaking, at various intervals, several copies of the original DNA sequence. This creates a system of fragments in which many of the fragments originate from overlapping regions. Identifying overlapping fragments is the key to reforming the original strand.

The first algorithm initially identifies a "correct" series of fragment merges which would result in producing the original sample from which they were obtained. It enters each series into a database of solutions, which is then used to sequence DNA different than those used to create the database.

The second algorithm uses a k-mer based approach to identifying overlapping regions in fragments. The method is an improvement over the first algorithm in two ways: (1) it is designed to sequence real fragments, which are different in composition from simulated fragments; (2) it can be used to sequence much longer strands of DNA.

For both algorithms, parameters of computation are learned through experimentation with sequences of previously assembled DNA. Our experiments show that the parameters of computation generated by learning on a set of DNAs can be used to successfully sequence a separate set of DNA sequences.

Return to main PhD Theses page