Lecture 03: Fitting Likelihoods, Linkage, and Association

PUBH 8878, Statistical Genetics

Review: Likelihood Function Basics

  • Likelihood function: \(L(\boldsymbol{\theta} \mid \text{data}) = P(\text{data} \mid \boldsymbol{\theta})\)
  • Maximum Likelihood Estimation (MLE)
  • Simple one-parameter examples

However often direct MLE is not possible because

  • Closed-form solution does not exist
  • Missing or latent data
  • High-dimensional or constrained parameter space leads to non-convex surface with multiple local maxima

Agenda

  • Optimization for MLE: Newton–Raphson, Gradient Descent, SGD
  • EM algorithm: gene frequencies and incomplete data
  • Linkage analysis: two-point, unknown phase EM, multipoint HMM
  • Linkage disequilibrium (LD) and its measures (D, D’, r)
  • Association: single-marker/haplotype tests and basic QC

Newton-Raphson Method for MLE

  • General procedure to find \(x\) such that \(g(x) = 0\)
  • Uses Taylor Expansion, for \(g(x)\) about a point \(x = a\):

\[\begin{align*} g(x) = g(a) &+ \frac{g'(a)(x-a)}{1!} + \frac{g''(a)(x-a)^2}{2!} \\ &+ ... + \frac{g^{(n)}(a)(x-a)^n}{n!} \end{align*}\]

Gradient Descent

  • First-order iterative optimization algorithm
  • Does not rely on matrix inversions \(\longrightarrow\) better for high dimensional problems
  • Visualizer from UCLA

The EM Algorithm

  • Expectation-Maximization Algorithm
  • Key idea: transform an incomplete data problem into a complete data problem

\[ p\left(y \mid \theta \right) = \frac{p\left(y, z \mid \theta \right)}{p\left(z \mid y, \theta\right)} \]

EM for Gene Frequencies

Genotype Phenotype Observed Counts Genotype Frequency
\(AA\) \(A\) \(n_A\) \(p_A^2\)
\(AO\) \(A\) \(2 p_A p_O\)
\(AB\) \(AB\) \(n_{AB}\) \(2 p_A p_B\)
\(BB\) \(B\) \(n_B\) \(p_B^2\)
\(BO\) \(B\) \(2 p_B p_O\)
\(OO\) \(O\) \(n_O\) \(p_O^2\)

EM for Gene Frequencies

  • We don’t know if \(n_A\) is from \(AA\) or \(AO\)! (And similarly for \(n_B\))
  • However, given our data model, we can find values of \((\hat{p}_A, \hat{p}_B)\) via EM

EM for Gene Frequencies

  • Start with initial \((p_A^{(0)}, p_B^{(0)})\) and set \(p_O^{(0)} = 1 - p_A^{(0)} - p_B^{(0)}\).

  • E-step: allocate ambiguous phenotype counts:

    • \(A\) phenotype: \(\tilde n_{AA} = n_A \frac{(p_A^{(t)})^2}{(p_A^{(t)})^2 + 2 p_A^{(t)} p_O^{(t)}}\), \(\tilde n_{AO} = n_A \frac{2 p_A^{(t)} p_O^{(t)}}{(p_A^{(t)})^2 + 2 p_A^{(t)} p_O^{(t)}}\).
    • \(B\) phenotype: \(\tilde n_{BB} = n_B \frac{(p_B^{(t)})^2}{(p_B^{(t)})^2 + 2 p_B^{(t)} p_O^{(t)}}\), \(\tilde n_{BO} = n_B \frac{2 p_B^{(t)} p_O^{(t)}}{(p_B^{(t)})^2 + 2 p_B^{(t)} p_O^{(t)}}\).

EM for Gene Frequencies

  • M-step (with \(N\) total individuals):
\[\begin{aligned} p_A^{(t+1)} &= \frac{2\tilde n_{AA} + \tilde n_{AO} + n_{AB}}{2N},\\ p_B^{(t+1)} &= \frac{2\tilde n_{BB} + \tilde n_{BO} + n_{AB}}{2N},\\ p_O^{(t+1)} &= 1 - p_A^{(t+1)} - p_B^{(t+1)} \end{aligned}\]
  • Iterate until parameter changes are below a tolerance.

EM for Gene Frequencies

Genotype Phenotype Observed Counts Genotype Frequency
\(AA\) \(A\) \(725\) \(p_A^2\)
\(AO\) \(A\) \(2 p_A p_O\)
\(AB\) \(AB\) \(72\) \(2 p_A p_B\)
\(BB\) \(B\) \(258\) \(p_B^2\)
\(BO\) \(B\) \(2 p_B p_O\)
\(OO\) \(O\) \(1073\) \(p_O^2\)

EM: Starting Values Matter

  • Different initial \((p_A, p_B, p_O)\) can change speed (not the final MLE)
  • Strategies:
    • Equal: \((1/3, 1/3, 1/3)\)
    • Unbalanced guess: \((0.01, 0.98, 0.01)\)
    • Smart (use \(p_O^{(0)} = \sqrt{n_O/N}\), solve for \((p_A^{(0)}, p_B^{(0)})\) from \(2p_A p_B\))
  • Tolerance of \(1 \times 10^{-6}\)

EM: Starting Values Matter

Strategy Iterations \(\hat{p}_A\) \(\hat{p}_B\) \(\hat{p}_O\)
Equal 7 0.209131 0.080801 0.710068
Smart 4 0.209131 0.080801 0.710068
Unbalanced 8 0.209131 0.080801 0.710068

EM Parameter Trajectories

When does EM fail?

  • Non-identifiability / flat likelihood: Ridges in segregation or linkage models; EM wanders or stalls.
  • Local maxima & starts: Multiple modes (e.g., mixture of penetrance classes); poor initialization traps EM.
  • Boundary degeneracy: Rare allele/component weight driven to 0; variance or frequency estimates collapse.
  • Model misspecification: Violated assumptions (e.g., Hardy-Weinberg, stratification) give misleading “convergence.”

Linkage vs Association Testing

  • Linkage: Tracks co-segregation of markers and traits within families to map disease genes.
    • Null hypothesis: no linkage (independent assortment), \(\theta = 0.5\).
  • Association: Tests for correlation between variants and traits in populations to pinpoint causal loci.
    • Null hypothesis: no association (e.g., \(\beta=0\) or OR=1).

Two-Point Linkage: Terms

  • Independent assortment (recall): when loci are unlinked, transmissions are independent and the chance a crossover separates them is \(\theta=0.5\) (Mendel’s Second Law).
  • Informative transmission/meiosis: a parent \(\to\) child transmission where the transmitting parent is heterozygous at the loci of interest and the genotypes allow us to tell whether a crossover occurred.
    • Example: parent \(Aa/Bb\) transmits \(AB\) or \(ab\) (nonrecombinant) vs \(Ab\) or \(aB\) (recombinant).

Two-Point Linkage: Terms

  • Recombinant fraction \(\theta\): the probability that a crossover occurs between two loci in a single meiosis; \(0 \le \theta \le 0.5\).
  • Nonrecombinant transmission: the child receives an allele combination that matches one of the transmitting parent’s original allele pairs (no crossover between the loci).
  • Recombinant transmission: the child receives a new combination relative to the transmitting parent’s original pair (a crossover occurred between the loci).

Two-Point Linkage: Repulsion vs Coupling

Two-Point Linkage: Direct Counting

  • Count \(R\) recombinants and \(NR\) nonrecombinants; total informative \(I=R+NR\).

  • Point estimate: \(\hat{\theta}=R/I\).

  • LOD from counts (vs. independence at \(\theta=0.5\)):

\[ \mathrm{LOD}(\theta) = \log_{10}\!\left\{ \frac{\theta^{R} (1-\theta)^{NR}}{0.5^{I}} \right\} \]

  • This is a likelihood ratio test!

Worked Example: Direct Counting LOD

  • Suppose \(I=40\) informative meioses with \(R=12\) recombinants.
  • MLE \(\hat{\theta}=R/I=12/40=0.3\).
  • For \(\theta=0.3\)

\[ \mathrm{LOD}(0.3) = \log_{10}\!\left\{ \frac{0.3^{12} \cdot 0.7^{28}}{0.5^{40}} \right\}. \]

Worked Example: Direct Counting LOD

R <- 12
N <- 40
thetahat <- R / N

lod <- log10((thetahat^R * (1 - thetahat)^(N - R)) / (0.5^N))
lod
[1] 1.4294
10^lod
[1] 26.87819
  • Maximize \(\mathrm{LOD}(\theta)\) over \(\theta \in [0,0.5]\) to estimate the recombination fraction; \(LOD \ge 3\) and \(\leq -2\) are classic heuristics (context-dependent).

Worked Example: Direct Counting LOD

  • Under \(H_0:\theta=0.5\) (a boundary point), \(D = 2\ln(10)\cdot \mathrm{LOD}(\hat\theta)\) has the mixture limit \(\tfrac{1}{2}\chi^2_0 + \tfrac{1}{2}\chi^2_1\); thus \(p = \tfrac{1}{2}\, \Pr(\chi^2_1 \ge D)\).
D <- 2 * log(10) * lod
pval <- 0.5 * pchisq(D, df = 1, lower.tail = FALSE)
pval
[1] 0.005148931

Two-Point Linkage: EM with Unknown Phase

  • Transmitting parent: double het \(Aa/Bb\) with unknown phase.
  • Mate: Aa/bb (heterozygous at \(A\), homozygous at \(B\)).
  • Observed per child: two-locus genotypes \((A\text{-genotype}, B\text{-genotype})\).

Two-Point Linkage: EM with Unknown Phase

  • Missing-data view: the parental phase (coupling vs repulsion) is not observed. We make it a complete-data problem by introducing a weight \(w\in[0,1]\) that blends the two phase-specific models.
  • Model (given recombination fraction \(\theta\)):
    • Coupling: \(P(\text{NR})=1-\theta\), \(P(\text{R})=\theta\) (split equally across the two categories).
    • Repulsion: \(P(\text{NR})=\theta\), \(P(\text{R})=1-\theta\).

Two-Point Linkage: EM with Unknown Phase

E-step (update the phase weight)

\[ w = \frac{(1-\theta^{(t)})^{\,n_{\text{NR}}}\,(\theta^{(t)})^{\,n_{\text{R}}}} {(1-\theta^{(t)})^{\,n_{\text{NR}}}\,(\theta^{(t)})^{\,n_{\text{R}}} + (\theta^{(t)})^{\,n_{\text{NR}}}\,(1-\theta^{(t)})^{\,n_{\text{R}}}} \]

and use \(1-w\) as the weight on the repulsion configuration.

M-step (update \(\theta\))

\[\begin{align*} \theta^{(t+1)} &= \frac{\text{(weighted recombinants)}}{N} = \frac{w\,n_{\text{R}} + (1-w)\,n_{\text{NR}}}{N}. \end{align*}\]

From Two-Point EM to Multipoint

  • Same objective: estimate recombination while marginalizing over latent transmissions/phase under current \(\theta\).
  • E-step engine: compute required state weights via Elston–Stewart peeling (pedigrees) or Lander–Green forward–backward (marker HMM) when you have multiple markers and missing genotypes.
  • Quantities needed: \(\mathbb E[\#\,\text{recombinants between adjacent markers}]\) and \(\mathbb E[\#\,\text{informative transmissions}]\) under current map.
  • Use in practice: either (a) plug these expectations into an EM-style update, or (b) more commonly, scan positions to build a LOD curve and report peak and 1-LOD interval.

Multipoint Linkage (HMM)

  • Markers along a chromosome define an HMM over inheritance states; adjacent recombination rates drive transitions.
  • Combine penetrance with the marker HMM to compute \(L(\theta)\) efficiently as you slide along the map (forward–backward yields the needed state probabilities even with missing phase/genotypes).
  • Pros: more information than two-point; better localization. Cons: requires a genetic map and error modeling.

Missing-Data EM in Linkage

  • Complete data: informative meioses labeled as recombinant/nonrecombinant; incomplete data: untyped genotypes/phase.
  • E-step: compute \(E[\text{\# recombinants}]\) and \(\mathrm E[\text{\# informative meioses}]\) given current \(\theta^{(t)}\) via peeling/HMM.
  • M-step: \(\displaystyle \theta^{(t+1)} = \frac{\mathrm E[R\mid\text{data},\theta^{(t)}]}{\mathrm E[I\mid\text{data},\theta^{(t)}]}\).
  • Iterate across \(\theta\) grid to produce a LOD curve; maximize or report support interval.

Practical Issues in Linkage

  • Model specification: penetrance, phenocopy, allele frequencies; misspecification can distort LOD.
  • Marker selection: highly polymorphic markers increase information; account for sex-specific recombination if needed.
  • Computational trade-offs: many markers and large pedigrees require approximations or pruning; consider parametric vs nonparametric linkage.

Linkage Disequilibrium

  • Let alleles at two markers be denoted \(A, a\) and \(B, b\)
  • Define allele frequencies by \(P_A, P_a, P_B, P_b\)
  • Define frequencies of haplotypes as \(P_{AB}, P_{Ab}, P_{aB}, P_{ab}\)
  • Define linkage equilibrium as independence in the \(2 \times 2\) table
    • For example, \(P_{AB} = P_A \times P_B\)

Linkage Disequilibrium

  • Define the LD coefficient as \(D\)

\[D = p_{AB} - p_A p_B\]

  • What are we looking at here?

Linkage Disequilibrium

  • Define the LD coefficient as \(D\)

\[D = p_{AB} - p_A p_B\]

  • Note that
    • \(\operatorname{E}[A] = p_A, \operatorname{E}[B] = p_B\)
    • \(\operatorname{E}[AB] = p_{AB}\)
    • \(\operatorname{Cov}(A, B) = \operatorname{E}[AB] - \operatorname{E}[A]\operatorname{E}[B]\)

Linkage Disequilibrium

  • \(D\), or \(\operatorname{Cov}(A, B)\), is scale-dependent
  • Normalized measures:
    • \(D' = D / D_{\max}\), where \(D_{\max} = \min(p_A p_b, p_a p_B)\) if \(D > 0\)
    • Pearson correlation:

\[\begin{align} r &= \frac{\operatorname{Cov}(A, B)}{\sqrt{\operatorname{Var}(A) \operatorname{Var}(B)}} \\ &= \frac{D}{\sqrt{p_A(1 - p_A) p_B(1 - p_B)}} \end{align}\]

Single-Marker Association: Core Test (Linear)

  • Linear trait: \(Y=\alpha+\beta G+\gamma^T C+\varepsilon\); test \(H_0{:}\,\beta=0\).
  • Genotype encodings: additive (0/1/2) or genotypic (AA/Aa/aa); include dominance to test non-additivity.
  • Binary outcomes (case–control) and logistic models will be covered later.

Basic QC Pipeline (Variant-level)

  • Call rate/missingness (e.g., variant missingness < 2%).
  • Hardy–Weinberg equilibrium in controls (e.g., P > 1e−6), mindful of true deviations.
  • Minor allele frequency threshold (e.g., MAF ≥ 0.01 unless rare-variant methods used).
  • Differential missingness across case/control; strand/allele checks.

Basic QC Pipeline (Sample-level)

  • Sample call rate; sex checks from X chr; heterozygosity outliers.
  • Relatedness/duplicates via IBD; ancestry via PCA; remove outliers or adjust with PCs.
  • Duplicates/cryptic relatedness: retain one per pair or use mixed models.

Summary & Key Takeaways

  • Likelihood optimization: Newton–Raphson (score/Hessian), Gradient Descent/SGD (scalable), and EM (latent-data) are core tools.
  • EM for ABO gene frequencies: initialization affects speed; log-likelihood increases monotonically until convergence.
  • Linkage: two-point LOD and unknown phase via EM; multipoint mapping framed as an HMM for efficient inference with missing data.
  • LD basics: r and r^2 quantify correlation between markers and explain why nearby markers can show similar association signals when a causal variant is unobserved

Hands-On: Lab 03 (Applied)

  • Implement ABO EM and compare starts; plot log-likelihood gap.
  • Two-point linkage: LOD grid search + EM with unknown phase.
  • Two-SNP haplotype EM; compute LD (D, r^2).
  • Mini association demo: linear regression + HWE QC in controls.

See labs/unsolved/lab-03.R for the scaffold (with TODOs) and labs/solved/lab-03.R for a worked solution.