get_leadvars screens some predictors as "leading variables" based on predictor-response associations in linear, generalized linear, and survival models.

get_leadvars(y, X, family = c("normal","binomial","survival"), 
  surv_model = c("AFT", "COX"), 
  method = c("topk", "fixedthresh", "percthresh"), param, 
  varsselected = NULL, varsleft = colnames(X), parallel = FALSE)

Arguments

y

Response. If family = "normal", a numeric vector. If family = "binomial", a numeric/integer/logical vector with values in {0,1}. If family = "survival", a list with components time and status (1 = event, 0 = censored).

X

Predictor matrix. Can be a base matrix or something as.matrix() can coerce. No missing values are allowed.

family

Model family; one of c("normal","binomial","survival"). Determines which engine is called (get_leadvars_LM, get_leadvars_GLM, or get_leadvars_SURV).

surv_model

Character string specifying the survival model (family="survival" only). Must be explicitly provided; there is no default. Values are "Cox" for proportional hazards models, "AFT" for accelerated failure time models.

method

Screening rule, one of c("topk", "fixedthresh", "percthresh"). The association measure depends on family (e.g., correlation for "normal", eta-squared for "binomial", or marginal utility for "survival"). "topk" keeps the predictors with the largest \(k\) association values; "fixedthresh" keeps predictors whose association is greater than or equal to a specified threshold; "percthresh" keeps predictors whose association is within a given percentage of the best.

param

Tuning parameter for method. If "topk", supply an integer \(k\) (keep the top \(k\)). If "fixedthresh", supply a numeric threshold (keep predictors with association \(\ge\) threshold). If "percthresh", supply a percentage in \((0,100]\) (keep predictors with association \(\ge\) that percent of the highest association).

varsselected

Used only when family=survival. A character vector containing the predictors that are already selected in previous iterations. The association measure, conditional utility, is computed controling for these predictors. NULL, by default.

varsleft

Used only when family=survival. A character vector containing the predictors that are neither selected, nor removed from consideration in previous iterations. Leading predictors are chosen from these predictors. colnames(X), by default.

parallel

Logical. If TRUE, attempts to perform some computations in parallel mode in binomial and survival families, which is strongly recommended for faster execution. Defaults to FALSE.

Value

A character vector containing the names of the leading variables.

Author

Nilotpal Sanyal <nsanyal@utep.edu>, Padmore N. Prempeh <pprempeh@albany.edu>

Examples

# Simulate continuous data
set.seed(123)
n <- 100
p <- 150
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("V", 1:p)
y <- X[,1] + 0.5 * X[,2] + rnorm(n)
# Select leading variables
leadvars <- get_leadvars(y = y, X = X, family = "normal", 
                         method = "topk", param = list(k=2))
leadvars
#> [1] "V1"   "V136"