Improving the calculation of statistical significance in genome-wide scans

Lars Ängquist and Ola Hössjer

Centre for Mathematical Sciences
Mathematical Statistics
Lund Institute of Technology,
Lund University,

ISSN 1403-9338
This article deals with some topics regarding linkage analysis and significance. Imagine that one has found a maximum NPL-score in a (complete/partial) genome scan, then the next step is to calculate the significance ($p$-value) of the result in a satisfactory way- simple and reliable. This calculation may be perfomed by simulation or by theoretical approximation, with or without the assumption of perfect marker information. Here we will concentrate on the context of theoretical approximation with the further assumption of fully informative data (perfect marker information). Our starting point is the asymptotic approximation formula presented by Lander and Kruglyak (1995) which is based on extreme value theory for Gaussian processes (cf. e.g. Lander and Botstein 1989). The major focus and possible importance of this article will then be the suggestions of two distinct improvements to this formula.
Firstly, we present a formula for calculating the crossover rate $\rho$ for a pedigree of a general family structure. These values may then be weighted into an overall crossover rate which finally may be used in the significance calculations using the original approximation formula.
 Secondly, the existing $p$-value formulas are based on the assumption of a normally distributed NPL score and the implication (conservative/anticonservative $p$-values) of this proposition is  depending on the pedigree structure. Here we are using the following approach to adjust for non-normality. The first step is to calculate the marginal distribution of the NPL score under the null hypothesis of no linkage with an arbitrarily small error. Then the NPL score is transformed to have a marginal standard normal distribution. The transformed maximal NPL score may, together with a slightly corrected value of the overall crossover rate, be inserted into the Lander and Kruglyak formula when performing $p$-value calculations.
We have used pedigrees of seven different structures to compare the performance of the adjusted approximation formula and the traditional approximation formula with respect to results found by simulation. We have also performed the same comparisons applied to two real data sets- e.g. the BOTNIA study data set (cf. Parker et al. 2001; Lindgren et al. 2002). The result is that our suggested improvements, in general, seem to strongly improve the correctness of the $p$-value calculations, especially for pedigree sets which correspond to distributions of obvious nonnormality
Key words:
nonparametric linkage analysis, genom-wide significance, extreme value formulas, crossover rate, adjusted approximation formula, deviation from normality, approximation of distributions, Hermite polynomials