Modelling Allelic and DNA Copy Number Variations using Continuous-index Hidden Markov Models

Susann Stjernqvist

Centre for Mathematical Sciences
Mathematical Statistics
Lund University

ISBN 978-91-7473025-8

In human cells there are usually two copies of each chromosome, but in cancer cells abnormalities could exist. The differences consist of segments of chromosomes with an altered number of copies. There can be deletions as well as amplifications and the lengths of the segments can also vary. Localising the deviant regions is of great importance for increasing the knowledge of the disease. In this thesis the copy numbers are modelled using Hidden Markov Models (HMMs). A hidden Markov process can be described as a Markov process observed in noise; thus it consists of two differens processes such that one is an unobservable Markov process, while the other is the observed process.
In paper A we present a method suitable for a CGH data from tiling BAC arrays, i.e. the probes are rather long and could overlap. In addition they are of unequal lengths and unevenly spread over the genome, which makes it suitable to apply a continuous-index process. We assume the Markov model to have a discrete state space and the parameters are estimated with an MCEM algorithm. The model in paper B is a modification of the model in paper A, such that the Markov process takes values in a continuous state space. This makes the method more realistic since it can handle larger differences in the data, including systematic errors. In addition we assume some of the transition rates to be common to get a parsimonious model. We take a Bayesian approach and use reversible jump MCMC to simulate the Markov process.
In paper C we present a model designed for SNP data which consists of allelic intensities for the two alleles at each SNP. We assume a discrete number of states, but keep the parsimonious approach from paper B such that some of the transition rates are common. The SNPs are point measurements but unevenly spread over the genome which motivates a continuous-index process. Further on in paper D we present an MCMC sampler, which is suitable for hidden Markov models, when taking a Bayesian approach. We alternate between updating the parameters and the trajectory, and for the latter update we present a sequential Monte Carlo method based on forward filtering-backward simulation. The method is applied on oligonucleotide copy number data with the same model as in paper B.
Key words:
Hidden Markov models, DNA copy number, allelic copy number, Markov chain Monte Carlo