S3VS performs variable selection using the structured screen-and-select (S3VS) framework in linear models, generalized linear models with binary data, and survival models such as the Cox model and accelerated failure time (AFT) model.

Installation

Install from CRAN

Install from GitHub

# install.packages("devtools")
devtools::install_github("nilotpalsanyal/S3VS")

Description

S3VS() is the main and the top-level function of this package. Depending on the family, S3VS() dispatches to S3VS_LM(), S3VS_GLM(), or S3VS_SURV(), each of which calls the following functions to perform the whole analysis:

  • get_leadvars*() for choosing leading variables,
  • get_leadsets() for construting leading sets,
  • VS_method*() for within-set variable selection,
  • select_vars() for aggregating selected variables across sets,
  • remove_vars() for aggregating not-selected variables across sets,
  • update_y*() for updating the working response, and
  • looprun() to check when the algorithm should be terminated.

Examples

Below, we provide three examples of the usage of S3VS() for linear, generalized linear, and survival models, respectively. For a detailed manual, see the package vignette.

Example 1: Linear model

We simulate a correlated design and a sparse linear signal.

set.seed(1)

n <- 200
p <- 300

rho <- 0.6
Sigma <- rho ^ abs(outer(1:p, 1:p, "-"))
X_lm <- matrix(rnorm(n * p), n, p) %*% chol(Sigma)
colnames(X_lm) <- paste0("X", seq_len(p))

beta <- rep(0, p)
active <- c(5, 17, 50, 120, 201)
beta[active] <- c(2.0, -1.5, 1.2, 1.8, -2.2)

y_lm <- as.numeric(X_lm %*% beta + rnorm(n, sd = 2))

Next, we run un S3VS with LASSO within leading sets

fit_lm <- S3VS(
  y = y_lm,
  X = X_lm,
  family = "normal",
  method_xy = "percthresh",
  param_xy = list(thresh = 95),
  method_xx = "topk",
  param_xx = list(k = 10),
  vsel_method = "LASSO",
  method_sel = "conservative",
  method_rem = "conservative_begin",
  verbose = FALSE,
  seed = 1
)
#> =================================
#> Number of selected variables: 6
#> Time taken: 0.2 sec
#> =================================

The selected variables are

fit_lm$selected
#> [1] "X5"   "X201" "X17"  "X120" "X50"  "X270"

Iteration-wise selected variables are

str(fit_lm$selected_iterwise)
#> List of 9
#>  $ : chr "X5"
#>  $ : chr "X201"
#>  $ : chr "X17"
#>  $ : chr "X120"
#>  $ : chr "X50"
#>  $ : chr "X270"
#>  $ : chr ""
#>  $ : chr ""
#>  $ : chr ""

pred_S3VS() refits a model on the selected variables using the method appropriate for the family.

Xsel <- X_lm[, fit_lm$selected, drop = FALSE]
pred_lm <- pred_S3VS(y = y_lm, X = Xsel, family = "normal", method = "LASSO")

head(pred_lm$y.pred)
#> [1]  2.4719666 -3.5834730 -3.9231277  2.7764653 -7.4049709  0.8373651

Example 2: Binary classification model

We simulate a correlated design and a sparse binary signal.

set.seed(2)

n <- 250
p <- 400

rho <- 0.4
Sigma <- rho ^ abs(outer(1:p, 1:p, "-"))
X_glm <- matrix(rnorm(n * p), n, p) %*% chol(Sigma)
colnames(X_glm) <- paste0("X", seq_len(p))

beta <- rep(0, p)
active <- c(10, 25, 90, 200)
beta[active] <- c(1.2, -1.0, 0.9, -1.4)

eta <- as.numeric(X_glm %*% beta)
pr  <- 1 / (1 + exp(-eta))
y_glm   <- rbinom(n, 1, pr)

We run un S3VS with SCAD within leading sets

fit_glm <- S3VS(
  y = y_glm,
  X = X_glm,
  family = "binomial",
  method_xy = "_survtopk",
  param_xy = list(k = 2),
  method_xx = "percthresh",
  param_xx = list(thresh = 90),
  vsel_method = "SCAD",
  sel_regout = TRUE,
  rem_regout = TRUE,
  update_y_thresh = 0.5,
  verbose = FALSE,
  seed = 1
)
#> =================================
#> Number of selected variables: 9
#> Time taken: 0.04 sec
#> =================================

fit_glm$selected
#> [1] "X200" "X25"  "X10"  "X277" "X101" "X195" "X90"  "X340" "X116"

Example 3: Survival model

For survival models, y must be a list with components:

  • time: observed times
  • status: event indicator (1 = event, 0 = censored)

We generate a survival data as follows.

set.seed(3)

n <- 300
p <- 200
X_surv <- matrix(rnorm(n*p), n, p)
colnames(X_surv) <- paste0("X", seq_len(p))

beta <- rep(0, p)
active <- c(3, 30, 77, 150)
beta[active] <- c(0.6, -0.5, 0.7, -0.8)

linpred <- as.numeric(X_surv %*% beta)
base_rate <- 0.05

time <- rexp(n, rate = base_rate * exp(linpred))
cens <- rexp(n, rate = 0.02)
status <- as.integer(time <= cens)
time <- pmin(time, cens)

y_surv <- list(time = time, status = status)

First, we run S3VS assuming a Cox model.

fit_cox <- S3VS(
  y = y_surv,
  X = X_surv,
  family = "survival",
  surv_model = "COX",
  method_xy = "topk",
  param_xy = list(k = 2),
  method_xx = "percthresh",
  param_xx = list(thresh = 90),
  vsel_method = "LASSO",
  verbose = FALSE,
  seed = 1
)
#> =================================
#> Number of selected variables: 4
#> Time taken: 0.31 sec
#> =================================

fit_cox$selected
#> [1] "X150" "X77"  "X3"   "X30"

Next, we run S3VS assuming an AFT model. AFT screening is often more expensive because adding a candidate can require repeated model fitting. Start with smaller k values.

fit_aft <- S3VS(
  y = y_surv,
  X = X_surv,
  family = "survival",
  surv_model = "AFT",
  method_xy = "topk",
  param_xy = list(k = 2),
  method_xx = "topk",
  param_xx = list(k = 2),
  vsel_method = "AFTGEE",
  verbose = FALSE,
  seed = 1,
  parallel = FALSE
)
#> =================================
#> Number of selected variables: 8
#> Time taken: 6.02 sec
#> =================================

fit_aft$selected
#> [1] "X150" "X77"  "X3"   "X30"  "X148" "X74"  "X62"  "X116"