| Title: | Stepwise Clustered Ensemble |
|---|---|
| Description: | Implementation of Stepwise Clustered Ensemble (SCE) and Stepwise Cluster Analysis (SCA) for multivariate data analysis. The package provides comprehensive tools for feature selection, model training, prediction, and evaluation in hydrological and environmental modeling applications. Key functionalities include recursive feature elimination (RFE), Wilks feature importance analysis, model validation through out-of-bag (OOB) validation, and ensemble prediction capabilities. The package supports both single and multivariate response variables, making it suitable for complex environmental modeling scenarios. For more details see Li et al. (2021) <doi:10.5194/hess-25-4947-2021>. |
| Authors: | Kailong Li [aut, cre] |
| Maintainer: | Kailong Li <[email protected]> |
| License: | GPL-3 |
| Version: | 1.1.4 |
| Built: | 2026-05-11 10:42:30 UTC |
| Source: | https://github.com/cran/SCE |
These datasets contain air quality measurements for training and testing purposes. They include various air pollutant concentrations and meteorological variables measured at different locations and times.
data("air_quality_training") data("air_quality_testing")data("air_quality_training") data("air_quality_testing")
Both datasets are data frames with 8760 rows and 12 variables:
Date and time of measurement (POSIXct format)
Particulate matter with diameter less than 2.5 micrometers (g/m^3)
Particulate matter with diameter less than 10 micrometers (g/m^3)
Sulfur dioxide concentration (g/m^3)
Nitrogen dioxide concentration (g/m^3)
Carbon monoxide concentration (g/m^3)
Ozone concentration (g/m^3)
Temperature (C)
Atmospheric pressure (hPa)
Dew point temperature (C)
Precipitation amount (mm)
Wind speed (m/s)
Dataset Differences:
air_quality_training: Used for training SCA and SCE models
air_quality_testing: Used for testing trained models
Variable Descriptions:
PM2.5, PM10: Particulate matter concentrations, important indicators of air quality
SO2, NO2, CO, O3: Major air pollutants regulated by environmental agencies
TEMP, PRES, DEWP: Meteorological variables affecting air quality
RAIN, WSPM: Weather conditions that influence pollutant dispersion
Air quality monitoring stations
Evaluate model performance for SCE or SCA models.
## S3 method for class 'sce' evaluate(object, testing_data, training_data, digits = 3, ...) ## S3 method for class 'sca' evaluate(object, testing_data, training_data, digits = 3, ...)## S3 method for class 'sce' evaluate(object, testing_data, training_data, digits = 3, ...) ## S3 method for class 'sca' evaluate(object, testing_data, training_data, digits = 3, ...)
object |
An SCE or SCA model object |
testing_data |
Testing dataset |
training_data |
Training dataset |
digits |
Number of decimal places (default: 3) |
... |
Additional arguments |
Model performance metrics.
Calculate variable importance for SCE or SCA models.
## S3 method for class 'sce' importance(object, oob_weight = TRUE, digits = 2, ...) ## S3 method for class 'sca' importance(object, digits = 2, ...)## S3 method for class 'sce' importance(object, oob_weight = TRUE, digits = 2, ...) ## S3 method for class 'sca' importance(object, digits = 2, ...)
object |
An SCE or SCA model object |
oob_weight |
Use out-of-bag weights for importance calculation (SCE only, default: TRUE) |
digits |
Number of decimal places to round the returned relative importance values (default: 2) |
... |
Additional arguments |
Variable importance rankings. For convenience, relative importance values are rounded to digits decimal places.
Plot Recursive Feature Elimination results.
plot_rfe(rfe_result, main = "OOB Validation and Testing R2 vs Number of Predictors", col_validation = "blue", col_testing = "red", pch = 16, lwd = 2, cex = 1.2, legend_pos = "bottomleft", ...)plot_rfe(rfe_result, main = "OOB Validation and Testing R2 vs Number of Predictors", col_validation = "blue", col_testing = "red", pch = 16, lwd = 2, cex = 1.2, legend_pos = "bottomleft", ...)
rfe_result |
Result object from |
main |
Plot title |
col_validation |
Color for validation line |
col_testing |
Color for testing line |
pch |
Point character |
lwd |
Line width |
cex |
Point size |
legend_pos |
Legend position |
... |
Additional arguments |
Plot showing validation and testing R2 vs number of predictors.
Make predictions on new data using SCE or SCA models.
## S3 method for class 'sce' predict(object, newdata, ...) ## S3 method for class 'sca' predict(object, newdata, ...)## S3 method for class 'sce' predict(object, newdata, ...) ## S3 method for class 'sca' predict(object, newdata, ...)
object |
An SCE or SCA model object |
newdata |
New data for prediction |
... |
Additional arguments |
Predictions for the new data.
Print information about SCE or SCA model objects.
## S3 method for class 'sce' print(x, ...) ## S3 method for class 'sca' print(x, ...)## S3 method for class 'sce' print(x, ...) ## S3 method for class 'sca' print(x, ...)
x |
An SCE or SCA model object |
... |
Additional arguments (not used) |
For SCE objects, prints ensemble information including number of trees, parameters, predictors, predictants, and OOB performance metrics.
For SCA objects, prints tree structure information including total nodes, leaf nodes, cutting/merging actions, and variable names.
Prints model information and returns the object invisibly.
Recursive Feature Elimination for SCE models to identify the most important predictors.
rfe_sce(training_data, testing_data, predictors, predictant, nmin, ntree, alpha = 0.05, resolution = 1000, step = 1, verbose = TRUE, parallel = TRUE)rfe_sce(training_data, testing_data, predictors, predictant, nmin, ntree, alpha = 0.05, resolution = 1000, step = 1, verbose = TRUE, parallel = TRUE)
training_data |
Training dataset |
testing_data |
Testing dataset |
predictors |
Character vector of predictor names |
predictant |
Character vector of predictant names |
nmin |
Minimum samples per node |
ntree |
Number of trees |
alpha |
Significance level (default: 0.05) |
resolution |
Resolution for splitting (default: 1000) |
step |
Number of predictors to remove per iteration (default: 1) |
verbose |
Print progress (default: TRUE) |
parallel |
Use parallel processing (default: TRUE) |
RFE results with performance metrics and importance scores.
Builds a single Stepwise Cluster Analysis (SCA) tree model that recursively partitions the data space based on Wilks' Lambda statistic.
sca(training_data, x, y, nmin, alpha = 0.05, resolution = 1000, verbose = FALSE)sca(training_data, x, y, nmin, alpha = 0.05, resolution = 1000, verbose = FALSE)
training_data |
A data.frame containing the training data |
x |
Character vector of predictor variable names |
y |
Character vector of predictant variable names |
nmin |
Minimum number of samples in a leaf node |
alpha |
Significance level for clustering (default: 0.05) |
resolution |
Resolution for splitting (default: 1000) |
verbose |
Print progress information (default: FALSE) |
An S3 object of class "sca" containing the tree model.
sce, predict, importance, evaluate
# Load example data data(streamflow_training_10var) data(streamflow_testing_10var) # Define variables Predictors <- c("Prcp","SRad","Tmax","Tmin","VP","smlt","swvl1","swvl2","swvl3","swvl4") Predictants <- c("Flow") # Build SCA model sca_model <- sca( training_data = streamflow_training_10var, x = Predictors, y = Predictants, nmin = 5, alpha = 0.05, resolution = 1000 ) # Use S3 methods print(sca_model) summary(sca_model) sca_predictions <- predict(sca_model, streamflow_testing_10var) sca_importance <- importance(sca_model) sca_evaluation <- evaluate(sca_model, streamflow_testing_10var, streamflow_training_10var)# Load example data data(streamflow_training_10var) data(streamflow_testing_10var) # Define variables Predictors <- c("Prcp","SRad","Tmax","Tmin","VP","smlt","swvl1","swvl2","swvl3","swvl4") Predictants <- c("Flow") # Build SCA model sca_model <- sca( training_data = streamflow_training_10var, x = Predictors, y = Predictants, nmin = 5, alpha = 0.05, resolution = 1000 ) # Use S3 methods print(sca_model) summary(sca_model) sca_predictions <- predict(sca_model, streamflow_testing_10var) sca_importance <- importance(sca_model) sca_evaluation <- evaluate(sca_model, streamflow_testing_10var, streamflow_training_10var)
Builds a Stepwise Clustered Ensemble (SCE) model, which is an ensemble of SCA trees built using bootstrap samples and random feature selection, providing improved prediction accuracy and robustness.
sce(training_data, x, y, mfeature, nmin, ntree, alpha = 0.05, resolution = 1000, verbose = FALSE, parallel = TRUE)sce(training_data, x, y, mfeature, nmin, ntree, alpha = 0.05, resolution = 1000, verbose = FALSE, parallel = TRUE)
training_data |
A data.frame containing the training data |
x |
Character vector of predictor variable names |
y |
Character vector of predictant variable names |
mfeature |
Number of features to randomly select for each tree |
nmin |
Minimum number of samples in a leaf node |
ntree |
Number of trees in the ensemble |
alpha |
Significance level for clustering (default: 0.05) |
resolution |
Resolution for splitting (default: 1000) |
verbose |
Print progress information (default: FALSE) |
parallel |
Use parallel processing (default: TRUE) |
An S3 object of class "sce" containing the ensemble model.
sca, predict, importance, evaluate
# Load example data data(streamflow_training_10var) data(streamflow_testing_10var) # Define variables Predictors <- c("Prcp","SRad","Tmax","Tmin","VP","smlt","swvl1","swvl2","swvl3","swvl4") Predictants <- c("Flow") # Build SCE model sce_model <- sce( training_data = streamflow_training_10var, x = Predictors, y = Predictants, mfeature = round(0.5 * length(Predictors)), nmin = 5, ntree = 48, alpha = 0.05, resolution = 1000, parallel = FALSE ) # Use S3 methods print(sce_model) summary(sce_model) sce_predictions <- predict(sce_model, streamflow_testing_10var) sce_importance <- importance(sce_model) sce_evaluation <- evaluate(sce_model, streamflow_testing_10var, streamflow_training_10var)# Load example data data(streamflow_training_10var) data(streamflow_testing_10var) # Define variables Predictors <- c("Prcp","SRad","Tmax","Tmin","VP","smlt","swvl1","swvl2","swvl3","swvl4") Predictants <- c("Flow") # Build SCE model sce_model <- sce( training_data = streamflow_training_10var, x = Predictors, y = Predictants, mfeature = round(0.5 * length(Predictors)), nmin = 5, ntree = 48, alpha = 0.05, resolution = 1000, parallel = FALSE ) # Use S3 methods print(sce_model) summary(sce_model) sce_predictions <- predict(sce_model, streamflow_testing_10var) sce_importance <- importance(sce_model) sce_evaluation <- evaluate(sce_model, streamflow_testing_10var, streamflow_training_10var)
These datasets contain streamflow and related environmental variables for training and testing purposes. They are used in examples to demonstrate the SCE package functionality with different levels of complexity.
data("streamflow_training_10var") data("streamflow_training_22var") data("streamflow_testing_10var") data("streamflow_testing_22var")data("streamflow_training_10var") data("streamflow_training_22var") data("streamflow_testing_10var") data("streamflow_testing_22var")
streamflow_training_10var: Basic environmental variables (12 columns):
Date and time of measurement
Monthly mean daily precipitation (mm)
Monthly mean daily solar radiation (W/m^2)
Monthly mean daily maximum temperature (C)
Monthly mean daily minimum temperature (C)
Monthly mean daily vapor pressure (Pa)
Monthly snowmelt (m)
Soil water content layer 1 (m^3/m^3)
Soil water content layer 2 (m^3/m^3)
Soil water content layer 3 (m^3/m^3)
Soil water content layer 4 (m^3/m^3)
Monthly mean daily streamflow (cfs)
streamflow_training_22var: Extended variables with climate indices (24 columns):
Streamflow measurements
Interdecadal Pacific Oscillation
IPO with 1-month lag
IPO with 2-month lag
Nino 3.4 index
Nino 3.4 with 1-month lag
Nino 3.4 with 2-month lag
Pacific Decadal Oscillation
PDO with 1-month lag
PDO with 2-month lag
Pacific North American pattern
PNA with 1-month lag
PNA with 2-month lag
Monthly precipitation
2-month precipitation
Solar radiation
2-month solar radiation
Maximum temperature
2-month maximum temperature
Minimum temperature
2-month minimum temperature
Vapor pressure
2-month vapor pressure
Testing datasets: Same structure as corresponding training datasets.
Dataset Structure:
10var datasets: Basic environmental variables (12 columns)
22var datasets: Extended variables with climate indices (24 columns)
Training datasets: Used for model building
Testing datasets: Used for model evaluation
Climate Indices: IPO (Interdecadal Pacific Oscillation), Nino3.4 (El Nino), PDO (Pacific Decadal Oscillation), PNA (Pacific North American pattern)
Data Sources: ERA5 Land, Daymet, USGS, and climate indices databases
Environmental monitoring stations, climate indices databases, ERA5 Land, Daymet, and USGS
Provide concise summaries of model structure and performance for SCE and SCA objects.
## S3 method for class 'sce' summary(object, ...) ## S3 method for class 'sca' summary(object, ...)## S3 method for class 'sce' summary(object, ...) ## S3 method for class 'sca' summary(object, ...)
object |
An SCE or SCA model object |
... |
Additional arguments passed to or ignored by methods |
For summary.sce, the method prints ensemble configuration, out-of-bag (OOB) performance statistics, tree structure information, and tree weight distribution.\
For summary.sca, the method prints tree structure information and variable summaries for the single SCA tree.
Invisibly returns the input object after printing the summary.
sce, sca, print, importance, evaluate