PUBH 8878, Statistical Genetics
Probability vs Likelihood:
L(\theta\mid y)\propto p(y\mid\theta) (fix y, vary \theta).
Log-likelihood / score / information:
\ell(\theta)=\sum_i \log p(y_i\mid\theta),
S(\theta)=\partial\ell/\partial\theta,
I(\theta)=-\partial^2\ell/\partial\theta\partial\theta^\top.
MLE: \hat\theta=\arg\max_\theta \ell(\theta).
Large-sample:
\hat\theta \approx \mathcal N\!\big(\theta_0,\; i(\theta_0)^{-1}\big),
i(\theta)=\mathbb E[I(\theta\mid Y)].
Invariance: MLE of g(\theta) is g(\hat\theta).
Tip
Regression as a likelihood
If e\sim\mathcal N(0,\sigma^2 I), then OLS = MLE for \beta;
\hat\sigma^2=\tfrac{1}{n}\sum \hat e_i^2.
Warning
Pitfalls
Non-unique maxima, flat ridges, boundary solutions, small-sample failures of asymptotics.
Warning
Variance-component caveat
When testing a variance component \sigma^2=0 (boundary), the LRT null is a mixture (e.g., \tfrac{1}{2}\chi^2_0 + \tfrac{1}{2}\chi^2_1); standard \chi^2 reference is invalid. Restricted LRTs or score tests are common alternatives.
Note
These analyses do not always use molecular genetic data… so why should we care?

The author frames the “missing heritability” problem as the discrepancy between estimates from twin studies and those from early GWAS. Based on the text, what is the core assumption of the classical twin study design for estimating heritability? How might a violation of this assumption lead to an overestimation of heritability for certain traits?
The author argues that methods like RDR and Sib-Regression are more robust against certain biases. What specific confounding factor, prevalent in population-based genomic studies, are these family-based designs better at controlling for?
If, as the author suggests, the true narrow-sense heritability of a trait like height is closer to the 60-70% estimated by RDR and Sib-Regression than the 80% from twin studies, what are the primary sources of the remaining “gap”? Does this completely solve the missing heritability problem, or does it redefine it?
Let Y = G + E
If we assume independence of G and E, then \operatorname{Var}(Y) = \operatorname{Var}(G) + \operatorname{Var}(E)
We can decompose \operatorname{Var}(G) into different genetic effects: - Additive effects: A - Dominance effects: D - Epistatic effects: I
So we can write:
\operatorname{Var}(Y) = \operatorname{Var}(A) + \operatorname{Var}(D) + \operatorname{Var}(I) + \operatorname{Var}(E)
Warning
Familial aggregation \neq heritability. Shared environment and assortment can produce aggregation without genetic causation.
Y_{\text{offspring}} = \alpha + \beta \cdot Y_{\text{mid-parent}} + \varepsilon
By definition, \beta = \frac{\operatorname{Cov}(Y_{\text{offspring}}, Y_{\text{mid-parent}})}{\operatorname{Var}(Y_{\text{mid-parent}})}.
Given that an offspring inherits half its genes from each parent, \operatorname{Cov}(Y_{\text{offspring}}, Y_{\text{mid-parent}}) = \frac{1}{2}\operatorname{Var}(A)
\begin{align*} \operatorname{Var}(Y_{\text{mid-parent}}) &= \operatorname{Var}\left(\frac{Y_{\text{parent1}} + Y_{\text{parent2}}}{2}\right) \\ &= \frac{1}{4}\left(\operatorname{Var}(Y) + \operatorname{Var}(Y)\right) \\ &= \frac{1}{2}\operatorname{Var}(Y) \end{align*}
Plugging this back into the expression for \beta:
\beta = \frac{\frac{1}{2}\operatorname{Var}(A)}{\frac{1}{2}\operatorname{Var}(Y)} = \frac{\operatorname{Var}(A)}{\operatorname{Var}(Y)} = h^2
where r_{\text{MZ}} and r_{\text{DZ}} are the correlations between monozygotic and dizygotic twins, respectively.
Model: \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{u} + \boldsymbol{\varepsilon},\quad \mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{A}\sigma_A^2),\quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}\sigma_E^2),
where \mathbf{A} is the pedigree additive relationship matrix with A_{ij}=2\phi_{ij}.
Warning
Boundary testing: testing \sigma_A^2=0 is on the boundary. The LRT null is the mixture \tfrac{1}{2}\chi_0^2 + \tfrac{1}{2}\chi_1^2 (or use an RLRT).
Idea: a binary phenotype arises when a continuous liability \ell crosses a threshold T.
Y = \begin{cases} 1 & \text{if } \ell > T,\ 0 & \text{if } \ell \le T, \end{cases}
\ell = G + E,\quad G \sim \mathcal{N}(0,\sigma_G^2),\ E \sim \mathcal{N}(0,\sigma_E^2)
Population prevalence: K = \Pr(\ell > T) = 1 - \Phi(T) with T = \Phi^{-1}(1-K).
Liability-scale heritability: h_\ell^2 = \sigma_G^2 / (\sigma_G^2 + \sigma_E^2).
Warning
Assumptions: normal liability, single threshold, no G\times E on the liability scale, correct K.
For an unascertained sample with sample prevalence P=K, h_{\text{obs}}^2 ;\approx; h_\ell^2 \cdot \frac{\phi(T)^2}{K(1-K)},\qquad T=\Phi^{-1}(1-K).
For case–control sampling with sample prevalence P \ne K, h_\ell^2 ;\approx; h_{\text{obs}}^2 \cdot \frac{K^2(1-K)^2}{\phi(T)^2\, P(1-P)}.
Tip
Low-prevalence traits (K \ll 0.5) often have h_{\text{obs}}^2 \ll h_\ell^2. Always report the scale, K, and (if applicable) P.
If we only observe Y under an event A (e.g., Y recorded only when Y>T or when a proband is affected), the correct likelihood uses the conditional density p(y \mid A, \theta) ;=; \frac{p(y \mid \theta),\mathbf{1}{y \in A}}{\Pr(A \mid \theta)}.
Goal: compare inheritance models (major gene vs polygenic vs mixed) using family phenotypes.