S3VS is the main function that performs variable selection based on the structured screen-and-select framework in linear, generalized linear, and survival models.

S3VS(
  y, 
  X, 
  family = c("normal", "binomial", "survival"), 
  cor_xy = NULL, 
  surv_model = c("COX", "AFT"), 
  method_xy = c("topk", "fixedthresh", "percthresh"), 
  param_xy, 
  method_xx = c("topk", "fixedthresh", "percthresh"), 
  param_xx, 
  vsel_method = NULL, 
  alpha = 0.5,
  method_sel = c("conservative", "liberal"), 
  method_rem = c("conservative_begin", "conservative_end", "liberal"), 
  sel_regout = FALSE, 
  rem_regout = FALSE, 
  update_y_thresh = 0.5, 
  m = 100, 
  nskip = 3, 
  verbose = FALSE, 
  seed = NULL, 
  parallel = FALSE
)

Arguments

y

Response. If family = "normal", a numeric vector. If family = "binomial", a numeric/integer/logical vector with values in {0,1}. If family = "survival", a list with components time and status (1 = event, 0 = censored).

X

Design matrix of predictors. Can be a base matrix or something as.matrix() can coerce. No missing values are allowed.

family

Model family; one of c("normal","binomial","survival"). Determines which engine is called (S3VS_LM, S3VS_GLM, or S3VS_SURV).

cor_xy

Optional numeric vector of precomputed marginal correlations between y and each column of X. Used only when family="normal" to speed up or reproduce screening by \(|cor(y, X_j)|\). If NULL, correlations are computed internally.

surv_model

Character string specifying the survival model (for family="survival" only). Must be explicitly provided; there is no default. Values are "Cox" for proportional hazards models, "AFT" for accelerated failure time models.

method_xy

Rule for screening some predictors as "leading variables" based on their association with the response; one of c("topk", "fixedthresh", "percthresh"). The association measure depends on family (e.g., correlation for "normal", eta-squared for "binomial", or marginal utility for "survival").

"topk" keeps the predictors with the largest \(k\) association values; "fixedthresh" keeps predictors whose association is greater than or equal to a specified threshold; "percthresh" keeps predictors whose association is within a given percentage of the best.

param_xy

Tuning parameter for method_xy. If "topk", supply a list with an integer k (keep the top \(k\)). If "fixedthresh", supply a list with a numeric threshold thresh (keep predictors with association \(\ge\) threshold). If "percthresh", supply a list with a numeric percentage thresh in \((0,100]\) (keep predictors with association \(\ge\) that percent of the highest association).

method_xx

Rule for constructing, for each leading variable, the set of associated predictors (the "leading set") using inter-predictor association (absolute value of the correlation coefficient); one of c("topk", "fixedthresh", "percthresh") with same interpretation as method_xy.

param_xx

Tuning parameter for method_xx; same interpretation as param_xy but applied to inter-predictor association (absolute value of the correlation coefficient).

vsel_method

Character string specifying the variable selection method to be used within each leading set. Available options depend on the model type:

  • For linear models (S3VS_LM) and generalized linear models (S3VS_GLM): "NLP", "LASSO", "ENET", "SCAD", "MCP".

  • For survival models (S3VS_SURV): "LASSO", "ENET" for surv_model=COX and "AFTGEE", "BRIDGE", "PVAFT" for surv_model=AFT.

alpha

Only used when vsel_method == "ENET". Elastic net mixing parameter, with \(\alpha \in (0,1)\).

method_sel

Policy for aggregating predictors selected across leading sets in an iteration; one of c("conservative","liberal"). "conservative" selects the smallest admissible set of predictors by intersecting the selected sets of predictors across leading sets, beginning with all and gradually reducing from the end until a non-empty intersection is found; this ensures only predictors consistently selected across leading sets are retained. "liberal" selects the largest admissible set of predictors by taking the union of all selected sets of predictors, so any predictor chosen in at least one leading set is included. If no predictor is selected from the first leading set, the iteration does not contribute to final selection and exclusion rules (method_rem) are applied instead.

method_rem

Policy for excluding predictors when no selections are made in an iteration; one of c("conservative_begin","conservative_end","liberal"). "conservative_begin" excludes the smallest admissible set of predictors by intersecting the non-selected sets of predictors starting from the first leading set; "conservative_end" does the same but begins from the last leading set and moves backward; "liberal" excludes the largest admissible set of predictors by taking the union of all non-selected sets of predictor. Predictors excluded under this rule are removed from subsequent iterations.

sel_regout

Logical (GLM only). If TRUE, when predictors are selected in an iteration, the working response y is updated using the selected predictors via update_y_GLM. Ignored for other families.

rem_regout

Logical (for LM and GLM only). If TRUE, when no predictors are selected in an iteration and some are removed instead, the working response y is updated using the removed predictors via update_y_LM or update_y_GLM. Ignored for other families.

update_y_thresh

Numeric scalar threshold controlling how the working response y is updated in GLMs when sel_regout=TRUE or rem_regout=TRUE. When \(|y - fitted\_y| > update\_y\_thresh\), y is kept, else y replaced by the rounded value of fitted_y, where fitted_y is the fitted probability from the logistic model. The default value is 0.5. Ignored for other families.

m

Integer. Maximum number of S3VS iterations to perform. Defaults to 100.

nskip

Integer. Maximum number of iterations in which no new predictors are selected before the algorithm stops. Defaults to 3.

verbose

Logical. If TRUE, prints detailed progress information at each iteration (e.g., iteration number, predictors selected or removed). Defaults to FALSE.

seed

If supplied, sets the random seed via set.seed() to ensure reproducibility of stochastic components. If NULL, no seed is set.

parallel

Logical. If TRUE, attempts to perform some computations in parallel mode in binomial and survival families, which is strongly recommended for faster execution. Defaults to FALSE.

Details

Model

For a continuous response, S3VS considers the linear model (LM) $$ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} $$

For a binary response, S3VS considers the generalized linear model (GLM) $$ g\!\left( E\!\left( \boldsymbol{y} \mid \boldsymbol{X} \right) \right) = \boldsymbol{X}\boldsymbol{\beta} $$

For a survival type response, S3VS considers two choices of models–the Cox model $$ \lambda(t\mid \boldsymbol{x}_i) = \lambda_0(t) \exp(\boldsymbol{x}_i^T \boldsymbol{\beta}) $$ and the AFT model $$ \log(\boldsymbol{T}) = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} $$

S3VS algorithm

The general form of the S3VS algorithm consists of the following steps, repeated iteratively until convergence:

  1. Determination of leading variables: `Leading variables` are determined based on the association of the predictors with the response, following one of three rules. The rule is fixed by the arguments method_xy and param_xy.

  2. Determination of leading sets: For each leading variable, a group of related predictors, called the `leading set`, is determined based on the association of all candidate predictors with the leading variable, following one of three rules. The rule is fixed by the arguments method_xx and param_xx.

  3. Variable selection: Within each leading set, small to moderate-dimensional variable selection is performed using a method fixed by vsel_method.

  4. Aggregation of selected/not-selected variables: Variables selected/not-selected in different leading sets are aggregated using several possible rules, fixed by method_sel and method_rem.

  5. Updation of response and/or set of covariates: At the end of each iteration, the response and predictors may be chosen to be updated or not through argumentsm sel_regout, rem_regout, and update_y_thresh.

The convergence criterion is determined by the arguments m and nkip jointly. For ore details of the individual steps, see the manual of the functions linked below.

Value

A list with the following components:

selected

A character vector of predictor names that were selected across all iterations.

selected_iterwise

A list recording the predictors selected at each iteration, in the order they were considered.

runtime

Runtime in seconds.

Author

Nilotpal Sanyal <nsanyal@utep.edu>, Padmore N. Prempeh <pprempeh@albany.edu>

Examples

### [1] For linear model
# Simulate continuous data
set.seed(123)
n <- 100
p <- 150
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("V", 1:p)
y <- X[,1] + 0.5 * X[,2] + rnorm(n)
# Run S3VS for LM
res_lm <- S3VS(y = y, X = X, family = "normal",
               method_xy = "topk", param_xy = list(k=1),
               method_xx = "topk", param_xx = list(k=3),
               vsel_method = "LASSO", method_sel = "conservative", 
               method_rem = "conservative_begin", rem_regout = FALSE, 
               m = 100, nskip = 3, verbose = TRUE, seed = 123)
#> -------------
#> Iteration 1
#> -------------
#> input : V1 V119 V70 
#> selected : V1 
#> -------------
#> Iteration 2
#> -------------
#> input : V2 V43 V17 
#> selected : V2 
#> -------------
#> Iteration 3
#> -------------
#> input : V76 V3 V15 
#> selected :  
#> *** nskip= 1 *** 
#> -------------
#> Iteration 4
#> -------------
#> input : V14 V121 V11 
#> selected :  
#> *** nskip= 2 *** 
#> -------------
#> Iteration 5
#> -------------
#> input : V149 V70 V8 
#> selected :  
#> *** nskip= 3 *** 
#> =================================
#> Number of selected variables: 2
#> Time taken: 0.1 sec
#> =================================
# View selected predictors
res_lm$selected
#> [1] "V1" "V2"

### [2] For generalized linear model
# Simulate binary data
set.seed(123)
n <- 100
p <- 150
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("V", 1:p)
eta <- X[,1] + 0.5 * X[,2]
prob <- 1 / (1 + exp(-eta))
y <- rbinom(n, size = 1, prob = prob)
# Run S3VS for for GLM (logistic)
res_glm <- S3VS(y = y, X = X, family = "binomial",
                method_xy = "topk", param_xy = list(k = 1),
                method_xx = "topk", param_xx = list(k = 3),
                vsel_method = "LASSO", 
                method_sel = "conservative", method_rem = "conservative_begin", 
                sel_regout = FALSE, rem_regout = FALSE,
                m = 100, nskip = 3, verbose = TRUE, seed = 123)
#> -------------
#> Iteration 1
#> -------------
#> [[1]]
#> [1] "V32" "V80" "V49"
#> 
#> input variables: V32 V80 V49 
#> Parallel disabled.
#> Selected variables: 
#> Not selected variables: V32 V80 V49 
#> [[1]]
#> NULL
#> 
#> *** nskip= 1 *** 
#> -------------
#> Iteration 2
#> -------------
#> [[1]]
#> [1] "V1"   "V119" "V70" 
#> 
#> input variables: V1 V119 V70 
#> Parallel disabled.
#> Selected variables: 
#> Not selected variables: V1 V119 V70 
#> [[1]]
#> NULL
#> 
#> *** nskip= 2 *** 
#> -------------
#> Iteration 3
#> -------------
#> [[1]]
#> [1] "V12" "V30" "V56"
#> 
#> input variables: V12 V30 V56 
#> Parallel disabled.
#> Selected variables: 
#> Not selected variables: V12 V30 V56 
#> [[1]]
#> NULL
#> 
#> *** nskip= 3 *** 
#> =================================
#> Number of selected variables: 0
#> Time taken: 0.05 sec
#> =================================
# View selected predictors
res_glm$selected
#> NULL

### [3] For survival model
# Simulate survival data (Cox)
set.seed(123)
n <- 100
p <- 150
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("V", 1:p)
eta <- X[,1] + 0.5 * X[,2]
base_rate <- 0.05
T_event <- rexp(n, rate = base_rate * exp(eta))
C <- rexp(n, rate = 0.03)
time <- pmin(T_event, C)
status <- as.integer(T_event <= C)
y_surv <- list(time = time, status = status)
# Run S3VS for linear models
res_surv <- S3VS(y = y_surv, X = X, family = "survival", 
                 surv_model = "COX", 
                 method_xy = "topk", param_xy = list(k = 1),
                 method_xx = "topk", param_xx = list(k = 3),
                 vsel_method = "LASSO",
                 method_sel = "conservative", method_rem = "conservative_begin",
                 sel_regout = FALSE, rem_regout = FALSE, 
                 m = 100, nskip = 3, verbose = TRUE, seed = 123)
#> -------------
#> Iteration 1
#> -------------
#> Input Variables: V1 V119 V70 
#> Selected Variables:  
#> *** nskip= 1 *** 
#> -------------
#> Iteration 2
#> -------------
#> Input Variables: V28 V117 V94 
#> Selected Variables:  
#> *** nskip= 2 *** 
#> -------------
#> Iteration 3
#> -------------
#> Input Variables: V5 V26 V77 
#> Selected Variables:  
#> *** nskip= 3 *** 
#> =================================
#> Number of selected variables: 0
#> Time taken: 0.1 sec
#> =================================
# View selected predictors
res_surv$selected
#> NULL