Improving the calculation of statistical significance in genomewide scans
Lars Ängquist and Ola Hössjer
Centre for Mathematical Sciences
Mathematical Statistics
Lund Institute of Technology,
Lund University,
2003
ISSN 14039338

Abstract:

This article deals with some topics regarding linkage analysis and significance.
Imagine that one has found a maximum NPLscore in a (complete/partial) genome
scan, then the next step is to calculate the significance ($p$value) of
the result in a satisfactory way simple and reliable. This calculation may
be perfomed by simulation or by theoretical approximation, with or without
the assumption of perfect marker information. Here we will concentrate on
the context of theoretical approximation with the further assumption of fully
informative data (perfect marker information). Our starting point is the
asymptotic approximation formula presented by Lander and Kruglyak (1995)
which is based on extreme value theory for Gaussian processes (cf. e.g. Lander
and Botstein 1989). The major focus and possible importance of this article
will then be the suggestions of two distinct improvements to this formula.


Firstly, we present a formula for calculating the crossover rate $\rho$ for
a pedigree of a general family structure. These values may then be weighted
into an overall crossover rate which finally may be used in the significance
calculations using the original approximation formula.


Secondly, the existing $p$value formulas are based on the assumption
of a normally distributed NPL score and the implication
(conservative/anticonservative $p$values) of this proposition is
depending on the pedigree structure. Here we are using the following
approach to adjust for nonnormality. The first step is to calculate the
marginal distribution of the NPL score under the null hypothesis of no linkage
with an arbitrarily small error. Then the NPL score is transformed to have
a marginal standard normal distribution. The transformed maximal NPL score
may, together with a slightly corrected value of the overall crossover rate,
be inserted into the Lander and Kruglyak formula when performing $p$value
calculations.


We have used pedigrees of seven different structures to compare the performance
of the adjusted approximation formula and the traditional approximation formula
with respect to results found by simulation. We have also performed the same
comparisons applied to two real data sets e.g. the BOTNIA study data set
(cf. Parker et al. 2001; Lindgren et al. 2002). The result is that our suggested
improvements, in general, seem to strongly improve the correctness of the
$p$value calculations, especially for pedigree sets which correspond to
distributions of obvious nonnormality



Key words:

nonparametric linkage analysis, genomwide significance, extreme value formulas,
crossover rate, adjusted approximation formula, deviation from normality,
approximation of distributions, Hermite polynomials
