T-Space at The University of Toronto Libraries >
School of Graduate Studies - Theses >
Please use this identifier to cite or link to this item:
|Title: ||Machine Learning in Computational Biology: Models of Alternative Splicing|
|Authors: ||Shai, Ofer|
|Advisor: ||Frey, Brendan J.|
|Department: ||Electrical and Computer Engineering|
|Keywords: ||Machine Learning|
|Issue Date: ||3-Mar-2010|
|Abstract: ||Alternative splicing, the process by which a single gene may code for similar but different proteins, is an important process in biology, linked to development, cellular differentiation, genetic diseases, and more. Genome-wide analysis of alternative splicing patterns and regulation has been recently made possible due to new high throughput techniques for monitoring gene expression and genomic sequencing. This thesis introduces two algorithms for alternative splicing analysis based on large microarray and genomic sequence data. The algorithms, based on generative probabilistic models that capture structure and patterns in the data, are used to study global properties of alternative splicing.
In the first part of the thesis, a microarray platform for monitoring alternative splicing is introduced. A spatial noise removal algorithm that removes artifacts and improves data fidelity is presented. The GenASAP algorithm (generative model for alternative splicing array platform) models the non-linear process in which targeted molecules bind to a microarray’s probes and is used to predict patterns of alternative splicing. Two versions of GenASAP have been developed. The first uses variational approximation to infer the relative amounts of the targeted molecules, while the second incorporates a more accurate noise and generative model and utilizes Markov chain Monte Carlo (MCMC) sampling.
GenASAP, the first method to provide quantitative predictions of alternative splicing patterns on large scale data sets, is shown to generate useful and precise predictions based on independent RT-PCR validation (a slow but more accurate approach to measuring cellular expression patterns).
In the second part of the thesis, the results obtained by GenASAP are analysed to
reveal jointly regulated genes. The sequences of the genes are examined for potential regulatory factors binding sites using a new motif finding algorithm designed for this purpose. The motif finding algorithm, called GenBITES (generative model for binding sites) uses a fully Bayesian generative model for sequences, and the MCMC approach used for inference in the model includes moves that can efficiently create or delete motifs, and extend or contract the width of existing motifs.
GenBITES has been applied to several synthetic and real data sets, and is shown
to be highly competitive at a task for which many algorithms already exist. Although
developed to analyze alternative splicing data, GenBITES outperforms most reported
results on a benchmark data set based on transcription data.|
|Appears in Collections:||Doctoral|
The Edward S. Rogers Sr. Department of Electrical & Computer Engineering - Doctoral theses
Items in T-Space are protected by copyright, with all rights reserved, unless otherwise indicated.