S3VS_GLM performs variable selection based on the structured screen-and-select framework in generalized linear models.

S3VS_GLM(y, X, 
  method_xy = c("topk", "fixedetasqthresh", "percetasqthresh"), param_xy, 
  method_xx = c("topk", "fixedcorthresh", "perccorthresh"), param_xx, 
  vsel_method = c("NLP", "LASSO", "ENET", "SCAD", "MCP"), 
  alpha = 0.5,
  method_sel = c("conservative", "liberal"), 
  method_rem = c("conservative_begin", "conservative_end", "liberal"), 
  sel_regout = FALSE, rem_regout = FALSE, update_y_thresh = NULL, 
  m = 100, nskip = 3, verbose = FALSE, seed = NULL, parallel = FALSE)

Arguments

y

Response. A numeric/integer/logical vector with values in {0,1}.

X

Design matrix of predictors. Can be a base matrix or something as.matrix() can coerce. No missing values are allowed.

method_xy

Rule for screening some predictors as "leading variables" based on their association with the response; one of c("topk", "fixedthresh", "percthresh"). The association measure is eta-squared.

"topk" keeps the predictors with the largest \(k\) association values; "fixedthresh" keeps predictors whose association is greater than or equal to a specified threshold; "percthresh" keeps predictors whose association is within a given percentage of the best.

param_xy

Tuning parameter for method_xy. If "topk", supply a list with an integer k (keep the top \(k\)). If "fixedthresh", supply a list with a numeric threshold thresh (keep predictors with association \(\ge\) threshold). If "percthresh", supply a list with a numeric percentage thresh in \((0,100]\) (keep predictors with association \(\ge\) that percent of the highest association).

method_xx

Rule for constructing, for each leading variable, the set of associated predictors (the "leading set") using inter-predictor association (absolute value of the correlation coefficient); one of c("topk", "fixedthresh", "percthresh") with same interpretation as method_xy.

param_xx

Tuning parameter for method_xx; same interpretation as param_xy but applied to inter-predictor association (absolute value of the correlation coefficient).

vsel_method

Character string specifying the variable selection method to be used within each leading set. Available options are "NLP", "LASSO", "ENET", "SCAD", "MCP".

alpha

Only used when vsel_method == "ENET". Elastic net mixing parameter, with \(\alpha \in (0,1)\).

method_sel

Policy for aggregating predictors selected across leading sets in an iteration; one of c("conservative","liberal"). "conservative" selects the smallest admissible set of predictors by intersecting the selected sets of predictors across leading sets, beginning with all and gradually reducing from the end until a non-empty intersection is found; this ensures only predictors consistently selected across leading sets are retained. "liberal" selects the largest admissible set of predictors by taking the union of all selected sets of predictors, so any predictor chosen in at least one leading set is included. If no predictor is selected from the first leading set, the iteration does not contribute to final selection and exclusion rules (method_rem) are applied instead.

method_rem

Policy for excluding predictors when no selections are made in an iteration; one of c("conservative_begin","conservative_end","liberal"). "conservative_begin" excludes the smallest admissible set of predictors by intersecting the non-selected sets of predictors starting from the first leading set; "conservative_end" does the same but begins from the last leading set and moves backward; "liberal" excludes the largest admissible set of predictors by taking the union of all non-selected sets of predictor. Predictors excluded under this rule are removed from subsequent iterations.

sel_regout

Logical. If TRUE, when predictors are selected in an iteration, the working response y is updated using the selected predictors via update_y_GLM.

rem_regout

Logical. If TRUE, when no predictors are selected in an iteration and some are removed instead, the working response y is updated using the removed predictors via update_y_GLM.

update_y_thresh

Numeric scalar threshold controlling how the working response y is updated when sel_regout=TRUE or rem_regout=TRUE. When \(|y - fitted\_y| > update\_y\_thresh\), y is kept, else y replaced by the rounded value of fitted_y, where fitted_y is the fitted probability from the logistic model. The default value is 0.5.

m

Integer. Maximum number of S3VS iterations to perform. Defaults to 100.

nskip

Integer. Maximum number of iterations in which no new predictors are selected before the algorithm stops. Defaults to 3.

verbose

Logical. If TRUE, prints detailed progress information at each iteration (e.g., iteration number, predictors selected or removed). Defaults to FALSE.

seed

If supplied, sets the random seed via set.seed() to ensure reproducibility of stochastic components. If NULL, no seed is set.

parallel

Logical. If TRUE, attempts to perform some computations in parallel mode, which is strongly recommended for faster execution. Defaults to FALSE.

Details

For a binary response, S3VS considers the generalized linear model (GLM) $$ g\!\left( E\!\left( \boldsymbol{y} \mid \boldsymbol{X} \right) \right) = \boldsymbol{X}\boldsymbol{\beta} $$

For the S3VS algorithm, see the manual of the top-level function S3VS.

Value

A list with the following components:

selected

A character vector of predictor names that were selected across all iterations.

selected_iterwise

A list recording the predictors selected at each iteration, in the order they were considered.

runtime

Runtime in seconds.

Author

Nilotpal Sanyal <nsanyal@utep.edu>, Padmore N. Prempeh <pprempeh@albany.edu>

Examples

# Simulate binary data
set.seed(123)
n <- 100
p <- 150
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("V", 1:p)
eta <- X[,1] + 0.5 * X[,2]
prob <- 1 / (1 + exp(-eta))
y <- rbinom(n, size = 1, prob = prob)
# Run S3VS for for GLM (logistic)
res_glm <- S3VS_GLM(y = y, X = X,
                method_xy = "topk", param_xy = list(k = 1),
                method_xx = "topk", param_xx = list(k = 3),
                vsel_method = "LASSO", 
                method_sel = "conservative", method_rem = "conservative_begin", 
                sel_regout = FALSE, rem_regout = FALSE,
                m = 100, nskip = 3, verbose = TRUE, seed = 123)
#> -------------
#> Iteration 1
#> -------------
#> [[1]]
#> [1] "V32" "V80" "V49"
#> 
#> input variables: V32 V80 V49 
#> Parallel disabled.
#> Selected variables: 
#> Not selected variables: V32 V80 V49 
#> [[1]]
#> NULL
#> 
#> *** nskip= 1 *** 
#> -------------
#> Iteration 2
#> -------------
#> [[1]]
#> [1] "V1"   "V119" "V70" 
#> 
#> input variables: V1 V119 V70 
#> Parallel disabled.
#> Selected variables: 
#> Not selected variables: V1 V119 V70 
#> [[1]]
#> NULL
#> 
#> *** nskip= 2 *** 
#> -------------
#> Iteration 3
#> -------------
#> [[1]]
#> [1] "V12" "V30" "V56"
#> 
#> input variables: V12 V30 V56 
#> Parallel disabled.
#> Selected variables: 
#> Not selected variables: V12 V30 V56 
#> [[1]]
#> NULL
#> 
#> *** nskip= 3 *** 
#> =================================
#> Number of selected variables: 0
#> Time taken: 0.04 sec
#> =================================
# View selected predictors
res_glm$selected
#> NULL