Skip to content

Incorrect SE deflation for sparse variants in subset analyses (Step 2 approximation error) #678

@gushamilton

Description

@gushamilton

When running Step 2 on a subset of samples (e.g., males only, using --phenoFile), we observe a bimodal distribution of Standard Errors (SE).

  • Dense variants (cluster 1) show correct SE scaling (Ratio ~1.4 vs full sample).
  • Sparse variants (cluster 2) show deflated SEs (Ratio ~1.0, matching full sample OLS SEs).

The issue appears to be in src/Step2_Models.cpp inside compute_score_qt.
When dt_thr->is_sparse is true, the code uses a fast expansion for the variance term:
denum_arr(ph) = Gm.squaredNorm() - 2 * XtGm.dot(XtG) + XtG_ss;

The term XtG_ss represents the variance explained by covariates on the full sample, not the subset. For subset analyses (where N_subset < N_total), this overestimates the residual variance term, artificially inflating the denominator and deflating the SE.

The correct calculation (projecting covariates onto the mask) is currently commented out in the source code immediately below the approximation. Enabling this path fixes the bimodality in subset analyses.

Run Step 2 on a sparse variant (MAC < 100) using a subset of samples (e.g. 50% of the cohort) and compare the SE to the OLS SE.

  • Regenie Version: v4.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions