S3VS_LM performs variable selection based on the structured screen-and-select framework in linear models.

S3VS_LM(y, X, cor_xy = NULL, 
  method_xy = c("topk", "fixedcorthresh", "perccorthresh"), param_xy, 
  method_xx = c("topk", "fixedcorthresh", "perccorthresh"), param_xx, 
  vsel_method = c("NLP", "LASSO", "ENET", "SCAD", "MCP"), 
  alpha = 0.5,
  method_sel = c("conservative", "liberal"), 
  method_rem = c("conservative_begin", "conservative_end", "liberal"), 
  rem_regout = FALSE, 
  m = 100, nskip = 3, verbose = FALSE, seed = NULL)

Arguments

y

Response. A numeric vector.

X

Design matrix of predictors. Can be a base matrix or something as.matrix() can coerce. No missing values are allowed.

cor_xy

Optional numeric vector of precomputed marginal correlations between y and each column of X. Used to speed up or reproduce screening by \(|cor(y, X_j)|\). If NULL, correlations are computed internally.

method_xy

Rule for screening some predictors as 'leading variables' based on their association with the response; one of c("topk", "fixedthresh", "percthresh"). The association measure is correlation.

"topk" keeps the predictors with the largest \(k\) association values; "fixedthresh" keeps predictors whose association is greater than or equal to a specified threshold; "percthresh" keeps predictors whose association is within a given percentage of the best.

param_xy

Tuning parameter for method_xy. If "topk", supply a list with an integer k (keep the top \(k\)). If "fixedthresh", supply a list with a numeric threshold thresh (keep predictors with association \(\ge\) threshold). If "percthresh", supply a list with a numeric percentage thresh in \((0,100]\) (keep predictors with association \(\ge\) that percent of the highest association).

method_xx

Rule for constructing, for each leading variable, the set of associated predictors (the "leading set") using inter-predictor association (absolute value of the correlation coefficient); one of c("topk", "fixedthresh", "percthresh") with same interpretation as method_xy.

param_xx

Tuning parameter for method_xx; same interpretation as param_xy but applied to inter-predictor association (absolute value of the correlation coefficient).

vsel_method

Character string specifying the variable selection method to be used within each leading set. Available options are "NLP", "LASSO", "ENET", "SCAD", "MCP".

alpha

Only used when vsel_method == "ENET". Elastic net mixing parameter, with \(\alpha \in (0,1)\).

method_sel

Policy for aggregating predictors selected across leading sets in an iteration; one of c("conservative","liberal"). "conservative" selects the smallest admissible set of predictors by intersecting the selected sets of predictors across leading sets, beginning with all and gradually reducing from the end until a non-empty intersection is found; this ensures only predictors consistently selected across leading sets are retained. "liberal" selects the largest admissible set of predictors by taking the union of all selected sets of predictors, so any predictor chosen in at least one leading set is included. If no predictor is selected from the first leading set, the iteration does not contribute to final selection and exclusion rules (method_rem) are applied instead.

method_rem

Policy for excluding predictors when no selections are made in an iteration; one of c("conservative_begin","conservative_end","liberal"). "conservative_begin" excludes the smallest admissible set of predictors by intersecting the non-selected sets of predictors starting from the first leading set; "conservative_end" does the same but begins from the last leading set and moves backward; "liberal" excludes the largest admissible set of predictors by taking the union of all non-selected sets of predictor. Predictors excluded under this rule are removed from subsequent iterations.

rem_regout

Logical. If TRUE, when no predictors are selected in an iteration and some are removed instead, the working response y is updated using the removed predictors via update_y_LM.

m

Integer. Maximum number of S3VS iterations to perform. Defaults to 100.

nskip

Integer. Maximum number of iterations in which no new predictors are selected before the algorithm stops. Defaults to 3.

verbose

Logical. If TRUE, prints detailed progress information at each iteration (e.g., iteration number, predictors selected or removed). Defaults to FALSE.

seed

If supplied, sets the random seed via set.seed() to ensure reproducibility of stochastic components. If NULL, no seed is set.

Details

For a continuous response, S3VS considers the linear model (LM) $$ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} $$

For the S3VS algorithm, see the manual of the top-level function S3VS.

Value

A list with the following components:

selected

A character vector of predictor names that were selected across all iterations.

selected_iterwise

A list recording the predictors selected at each iteration, in the order they were considered.

runtime

Runtime in seconds.

Author

Nilotpal Sanyal <nsanyal@utep.edu>, Padmore N. Prempeh <pprempeh@albany.edu>

Examples

# Simulate continuous data
set.seed(123)
n <- 100
p <- 150
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("V", 1:p)
y <- X[,1] + 0.5 * X[,2] + rnorm(n)
# Run S3VS for LM
res_lm <- S3VS_LM(y = y, X = X,
               method_xy = "topk", param_xy = list(k=1),
               method_xx = "topk", param_xx = list(k=3),
               vsel_method = "LASSO", method_sel = "conservative", 
               method_rem = "conservative_begin", rem_regout = FALSE, 
               m = 100, nskip = 3, verbose = TRUE, seed = 123)
#> -------------
#> Iteration 1
#> -------------
#> input : V1 V119 V70 
#> selected : V1 
#> -------------
#> Iteration 2
#> -------------
#> input : V2 V43 V17 
#> selected : V2 
#> -------------
#> Iteration 3
#> -------------
#> input : V76 V3 V15 
#> selected :  
#> *** nskip= 1 *** 
#> -------------
#> Iteration 4
#> -------------
#> input : V14 V121 V11 
#> selected :  
#> *** nskip= 2 *** 
#> -------------
#> Iteration 5
#> -------------
#> input : V149 V70 V8 
#> selected :  
#> *** nskip= 3 *** 
#> =================================
#> Number of selected variables: 2
#> Time taken: 0.07 sec
#> =================================
# View selected predictor
res_lm$selected
#> [1] "V1" "V2"