Non-local prior based single-step variable selection for high-dimensional data

nlpsLM, nlpsGLM, nlpsAFTM perform variable selection in a single iteration respectively for continuous, binary and survival outcomes, combining the computational efficiency of the 'structured screen-and-select' variable selection strategy based on some association learning and the parsimonious uncertainty quantification provided by the use of non-local priors (see the References).

nlpsLM(y, x, cor_xy, prior = c("mom", "imom", "emom", "zellner", 
  "horseshoe"), tau, priorDelta = modelbbprior(1,1), 
  k0, rxx, niter = 2000, verbose = F, 
  tau.hs.method = "halfCauchy", sigma.hs.method = "Jeffreys" )

nlpsGLM(y, x, mmle_xy, prior = c("mom", "imom", "zellner"), 
  tau, priorDelta = modelbbprior(1,1), 
  k0, rxx, niter = 2000, verbose = F )

nlpsAFTM(y, event, x, mu_xy, prior = c("mom", "imom", "emom", 
  "zellner"), tau, priorDelta = modelbbprior(1,1), 
  k0, rxx, niter = 2000, verbose = F )

Arguments

y: The vector of continuous response (phenotype) for linear models (LM), or binary response (phenotype) for generalized linear models (GLM), or survival times for accelerated failure time models (AFTM). Binary response values must be 0 or 1.
event: Only for AFTM. The vector of status indicator for the survival data.
x: The design matrix with subjects in rows and independent variables (SNPs) in columns. Missing values are not accepted currently.
cor_xy: Only for LM. Vector of (Pearson) correlations of y with the columns of x.
mmle_xy: Only for GLM. Vector of maximum marginal likelihood estimates of the regression parameters for the variables (SNPs) in x. These may be obtained from individual GLM fits of y with the columns of x.
mu_xy: Only for AFTM. Vector of marginal utility estimates of the variables (SNPs) in x. These may be obtained by fitting AFT model to y with individual columns of x using the survreg function of the package survival.
prior: "mom" for pMOM prior, "imom" for piMOM prior, "emom" for peMOM prior, "zellner" for Zellner's g-prior, "horseshoe" for horseshoe prior. For GLM, "zellner" considers group Zellner prior and "emom" and "horseshoe" are not available. For AFTM, "horseshoe" is not available.
tau: the value of the scale parameter tau of the non-local prior.
priorDelta: Prior for model space. Defaults to modelbbprior(1,1), which is beta-binomial(1,1) prior.
k0: GWASinlps tuning parameter denoting the number of leading SNPs (see Details).
rxx: GWASinlps tuning parameter denoting the correlation threshold to determine leading sets (see References).
niter: Number of MCMC iterations for non-local prior based Bayesian variable selection. Defaults to 2000.
verbose: If TRUE, prints result from the iterations progressively. FALSE by default.
tau.hs.method: Necessary only when prior="horseshoe". See horseshoe function reference.
sigma.hs.method: Necessary only when prior="horseshoe". See horseshoe function reference.

Details

The nlpsLM, nlpsGLM and nlpsAFTM functions perform SNP selection in one iteration for continuous data, binary data, and survival data, respectively. The GWASinlps function repeatedly calls these functions. For details of the procedure, see the reference for the GWASinlps function.

Value

A list with elements

hppm: The names of variables (SNPs) appearing in the highest posterior probability model (HPPM) of at least one leading set.
not.selected: The names of variables (SNPs) appearing in at least one leading set but in none of the HPPMs.

References

Sanyal et al. (2019), "GWASinlps: Non-local prior based iterative SNP selection tool for genome-wide association studies". Bioinformatics, 35(1), 1-11.

Sanyal, N. (2022). "Iterative variable selection for high-dimensional data with binary outcomes". arXiv preprint arXiv:2211.03190.

Author

Nilotpal Sanyal <nilotpal.sanyal@gmail.com>

Examples

n = 100
p = 1000
m = 10

# Generate design matrix (genotype matrix)
set.seed(1) 
f = runif( p, .1, .2 ) # simulate minor allele frequency
x = matrix( nrow = n, ncol = p )
colnames(x) = 1:p
for(j in 1:p)
  x[,j] = rbinom( n, 2, f[j] )

# Generate true effect sizes
causal_snps = sample( 1:p, m )
beta = rep( 0, p )
set.seed(1)
beta[causal_snps] = rnorm(m, mean = 0, sd = 2 )

# Generate continuous (phenotype) data
y.cont = x %*% beta + rnorm(n, 0, 1) 

# Generate binary (phenotype) data
prob = exp(x %*% beta)/(1 + exp(x %*% beta))
y.bin = sapply(1:n, function(i)rbinom(1,1,prob[i]) )

# Fix scale parameter tau 
tau = 0.022

# GWASinlps analysis
cor_xy = c(cor(x,y.cont))
names(cor_xy) = colnames(x)
nlps_cont = nlpsLM(y.cont, x, cor_xy=cor_xy, prior="mom", 
  tau=tau, k0=2, rxx=0.3, niter=10000, verbose=TRUE) 
#> j = 1 
#> input : 124 288 347 413 
#> selected : 347 
#> j = 2 
#> input : 83 102 191 553 727 752 778 
#> selected : 102 
nlps_cont
#> $hppm
#> [1] "347" "102"
#> 
#> $not.selected
#> [1] "124" "288" "413" "83"  "191" "553" "727" "752" "778"
#> 

library(fastglm)
#> Loading required package: bigmemory
mode(x) = "double"  #needed for fastglm() function below
mmle_xy = apply( x, 2, function(z) coef( fastglm(y=y.bin, 
x=cbind(1,matrix(z)), family = binomial(link = "logit")) )[2] )
nlps_bin = nlpsGLM(y.bin, x, mmle_xy=mmle_xy, prior="mom", 
  tau=tau, k0=2, rxx=0.3, niter=10000, verbose=TRUE) 
#> j = 1 
#> input : 124 288 347 413 
#> selected : 347 
#> j = 2 
#> input : 83 102 191 553 727 752 778 
#> selected : 102 
nlps_bin
#> $hppm
#> [1] "347" "102"
#> 
#> $not.selected
#> [1] "124" "288" "413" "83"  "191" "553" "727" "752" "778"
#>