S3VS_LM.RdS3VS_LM performs variable selection based on the structured screen-and-select framework in linear models.
S3VS_LM(y, X, cor_xy = NULL,
method_xy = c("topk", "fixedcorthresh", "perccorthresh"), param_xy,
method_xx = c("topk", "fixedcorthresh", "perccorthresh"), param_xx,
vsel_method = c("NLP", "LASSO", "ENET", "SCAD", "MCP"),
alpha = 0.5,
method_sel = c("conservative", "liberal"),
method_rem = c("conservative_begin", "conservative_end", "liberal"),
rem_regout = FALSE,
m = 100, nskip = 3, verbose = FALSE, seed = NULL)Design matrix of predictors. Can be a base matrix or something as.matrix() can coerce. No missing values are allowed.
Optional numeric vector of precomputed marginal correlations between y and each column of X. Used to speed up or reproduce screening by \(|cor(y, X_j)|\). If NULL, correlations are computed internally.
Rule for screening some predictors as 'leading variables' based on their association with the response; one of c("topk", "fixedthresh", "percthresh"). The association measure is correlation.
"topk" keeps the predictors with the largest \(k\) association values; "fixedthresh" keeps predictors whose association is greater than or equal to a specified threshold; "percthresh" keeps predictors whose association is within a given percentage of the best.
Tuning parameter for method_xy. If "topk", supply a list with an integer k (keep the top \(k\)). If "fixedthresh", supply a list with a numeric threshold thresh (keep predictors with association \(\ge\) threshold). If "percthresh", supply a list with a numeric percentage thresh in \((0,100]\) (keep predictors with association \(\ge\) that percent of the highest association).
Rule for constructing, for each leading variable, the set of associated predictors (the "leading set") using inter-predictor association (absolute value of the correlation coefficient); one of c("topk", "fixedthresh", "percthresh") with same interpretation as method_xy.
Tuning parameter for method_xx; same interpretation as param_xy but applied to inter-predictor association (absolute value of the correlation coefficient).
Character string specifying the variable selection method to be used within each leading set. Available options are "NLP", "LASSO", "ENET", "SCAD", "MCP".
Only used when vsel_method == "ENET". Elastic net mixing parameter, with \(\alpha \in (0,1)\).
Policy for aggregating predictors selected across leading sets in an iteration; one of c("conservative","liberal"). "conservative" selects the smallest admissible set of predictors by intersecting the selected sets of predictors across leading sets, beginning with all and gradually reducing from the end until a non-empty intersection is found; this ensures only predictors consistently selected across leading sets are retained. "liberal" selects the largest admissible set of predictors by taking the union of all selected sets of predictors, so any predictor chosen in at least one leading set is included. If no predictor is selected from the first leading set, the iteration does not contribute to final selection and exclusion rules (method_rem) are applied instead.
Policy for excluding predictors when no selections are made in an iteration; one of c("conservative_begin","conservative_end","liberal"). "conservative_begin" excludes the smallest admissible set of predictors by intersecting the non-selected sets of predictors starting from the first leading set; "conservative_end" does the same but begins from the last leading set and moves backward; "liberal" excludes the largest admissible set of predictors by taking the union of all non-selected sets of predictor. Predictors excluded under this rule are removed from subsequent iterations.
Logical. If TRUE, when no predictors are selected in an iteration and some are removed instead, the working response y is updated using the removed predictors via update_y_LM.
Integer. Maximum number of iterations in which no new predictors are selected before the algorithm stops. Defaults to 3.
Logical. If TRUE, prints detailed progress information at each iteration (e.g., iteration number, predictors selected or removed). Defaults to FALSE.
If supplied, sets the random seed via set.seed() to ensure reproducibility of stochastic components. If NULL, no seed is set.
For a continuous response, S3VS considers the linear model (LM) $$ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} $$
For the S3VS algorithm, see the manual of the top-level function S3VS.
A list with the following components:
A character vector of predictor names that were selected across all iterations.
A list recording the predictors selected at each iteration, in the order they were considered.
Runtime in seconds.
# Simulate continuous data
set.seed(123)
n <- 100
p <- 150
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("V", 1:p)
y <- X[,1] + 0.5 * X[,2] + rnorm(n)
# Run S3VS for LM
res_lm <- S3VS_LM(y = y, X = X,
method_xy = "topk", param_xy = list(k=1),
method_xx = "topk", param_xx = list(k=3),
vsel_method = "LASSO", method_sel = "conservative",
method_rem = "conservative_begin", rem_regout = FALSE,
m = 100, nskip = 3, verbose = TRUE, seed = 123)
#> -------------
#> Iteration 1
#> -------------
#> input : V1 V119 V70
#> selected : V1
#> -------------
#> Iteration 2
#> -------------
#> input : V2 V43 V17
#> selected : V2
#> -------------
#> Iteration 3
#> -------------
#> input : V76 V3 V15
#> selected :
#> *** nskip= 1 ***
#> -------------
#> Iteration 4
#> -------------
#> input : V14 V121 V11
#> selected :
#> *** nskip= 2 ***
#> -------------
#> Iteration 5
#> -------------
#> input : V149 V70 V8
#> selected :
#> *** nskip= 3 ***
#> =================================
#> Number of selected variables: 2
#> Time taken: 0.07 sec
#> =================================
# View selected predictor
res_lm$selected
#> [1] "V1" "V2"