API Reference
Datasets
NovaML.Datasets.load_boston — Methodload_boston(; return_X_y=false)Load and return the Boston house prices dataset (regression).
This function creates a synthetic version of the Boston Housing dataset for demonstration purposes, as the original dataset might not be available.
Arguments
return_X_y::Bool: If true, returns(X, y)instead of a dict-like object.
Returns
If
return_X_yis false, returns a Dict with the following keys:- "data": ndarray of shape (506, 13) The data matrix.
- "target": ndarray of shape (506,) The regression target.
- "feature_names": list The names of the dataset columns.
- "DESCR": str The full description of the dataset.
If
return_X_yis true, returns a tuple(data, target):- data : ndarray of shape (506, 13)
- target : ndarray of shape (506,)
Description
The Boston Housing dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms.
Note: This function generates synthetic data based on the structure of the original Boston Housing dataset. The actual values and relationships in the data are simulated and do not represent real housing data.
Features
1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT: % lower status of the populationTarget
- MEDV: Median value of owner-occupied homes in $1000'sExample
```julia
Load the Boston Housing dataset
boston = load_boston()
Access the data and target
X = boston["data"] y = boston["target"]
Get feature names
featurenames = boston["featurenames"]
Alternatively, get data and target directly
X, y = loadboston(returnX_y=true)
NovaML.Datasets.load_breast_cancer — Methodload_breast_cancer(; return_X_y=false)Load and return the Wisconsin Breast Cancer dataset (classification).
Arguments
return_X_y::Bool: If true, returns(X, y)instead of a dict-like object.
Returns
If
return_X_yis false, returns a Dict with the following keys:- "data": Matrix{Float64} of shape (569, 30) The data matrix.
- "target": Vector{Bool} of shape (569,) The classification target.
- "feature_names": Vector{String} The names of the dataset columns.
- "target_names": Vector{String} The names of target classes.
- "DESCR": String The full description of the dataset.
If
return_X_yis true, returns a tuple(data, target):- data: Matrix{Float64} of shape (569, 30)
- target: Vector{Bool} of shape (569,)
Description
The Wisconsin Breast Cancer dataset is a classic and very easy binary classification dataset.
Features
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
Ten real-valued features are computed for each cell nucleus: 1) radius (mean of distances from center to points on the perimeter) 2) texture (standard deviation of gray-scale values) 3) perimeter 4) area 5) smoothness (local variation in radius lengths) 6) compactness (perimeter^2 / area - 1.0) 7) concavity (severity of concave portions of the contour) 8) concave points (number of concave portions of the contour) 9) symmetry 10) fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.
Target
- 0: benign
- 1: malignantDataset Characteristics
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information: 10 real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
:Class Distribution: 212 Malignant, 357 BenignExample
```julia
Load the Breast Cancer dataset
breastcancer = loadbreast_cancer()
Access the data and target
X = breastcancer["data"] y = breastcancer["target"]
Get feature names and target names
featurenames = breastcancer["featurenames"] targetnames = breastcancer["targetnames"]
Alternatively, get data and target directly
X, y = loadbreastcancer(returnXy=true)
Notes
This function downloads the Wisconsin Breast Cancer dataset from the UCI Machine Learning Repository if it's not already present in the local directory.
The dataset was created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin-Madison.
NovaML.Datasets.load_iris — Methodload_iris(; return_X_y=false)Load and return the iris dataset (classification).
Arguments
return_X_y::Bool: If true, returns(X, y)instead of a dict-like object.
Returns
If
return_X_yis false, returns a Dict with the following keys:- "data": Matrix{Float64} of shape (150, 4) The data matrix.
- "target": Vector{Int} of shape (150,) The classification target.
- "feature_names": Vector{String} The names of the dataset columns.
- "target_names": Vector{String} The names of target classes.
- "DESCR": String The full description of the dataset.
If
return_X_yis true, returns a tuple(data, target):- data: Matrix{Float64} of shape (150, 4)
- target: Vector{Int} of shape (150,)
Description
The iris dataset is a classic and very easy multi-class classification dataset.
Features
1. sepal length (cm)
2. sepal width (cm)
3. petal length (cm)
4. petal width (cm)Target
- Iris-setosa (1)
- Iris-versicolor (2)
- Iris-virginica (3)Dataset Characteristics
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
:Class:
- Iris-Setosa
- Iris-Versicolour
- Iris-VirginicaExample
```julia
Load the Iris dataset
iris = load_iris()
Access the data and target
X = iris["data"] y = iris["target"]
Get feature names and target names
featurenames = iris["featurenames"] targetnames = iris["targetnames"]
Alternatively, get data and target directly
X, y = loadiris(returnX_y=true)
Notes
This function downloads the Iris dataset from the UCI Machine Learning Repository if it's not already present in the local directory.
NovaML.Datasets.load_wine — Methodload_wine(; return_X_y=false)Load and return the wine dataset (classification).
Arguments
return_X_y::Bool: If true, returns(X, y)instead of a dict-like object.
Returns
If
return_X_yis false, returns a Dict with the following keys:- "data": Matrix{Float64} of shape (178, 13) The data matrix.
- "target": Vector{Int} of shape (178,) The classification target.
- "feature_names": Vector{String} The names of the dataset columns.
- "target_names": Vector{String} The names of target classes.
- "DESCR": String The full description of the dataset.
If
return_X_yis true, returns a tuple(data, target):- data: Matrix{Float64} of shape (178, 13)
- target: Vector{Int} of shape (178,)
Description
This dataset is a classic and very easy multi-class classification dataset.
Features
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10) Color intensity
11) Hue
12) OD280/OD315 of diluted wines
13) ProlineTarget
- class 1 (0)
- class 2 (1)
- class 3 (2)Dataset Characteristics
:Number of Instances: 178
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
:Class:
- class 1
- class 2
- class 3Example
```julia
Load the Wine dataset
wine = load_wine()
Access the data and target
X = wine["data"] y = wine["target"]
Get feature names and target names
featurenames = wine["featurenames"] targetnames = wine["targetnames"]
Alternatively, get data and target directly
X, y = loadwine(returnX_y=true)
Notes
This function downloads the Wine dataset from the UCI Machine Learning Repository if it's not already present in the local directory.
The data set contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample.
The classes are ordered and not balanced (class 1 has 59 samples, class 2 has 71 samples, and class 3 has 48 samples).
This dataset is also excellent for visualization techniques.
NovaML.Datasets.make_blobs — Methodmake_blobs(;
n_samples::Union{Int, Vector{Int}} = 100,
n_features::Int = 2,
centers::Union{Int, Matrix{Float64}} = nothing,
cluster_std::Union{Float64, Vector{Float64}} = 1.0,
center_box::Tuple{Float64, Float64} = (-10.0, 10.0),
shuffle::Bool = true,
random_state::Union{Int, Nothing} = nothing,
return_centers::Bool = false
)Generate isotropic Gaussian blobs for clustering.
Arguments
n_samples::Union{Int, Vector{Int}}: The total number of points equally divided among clusters, or the number of samples per cluster.n_features::Int: The number of features for each sample.centers::Union{Int, Matrix{Float64}}: The number of centers to generate, or a matrix of center locations.cluster_std::Union{Float64, Vector{Float64}}: The standard deviation of the clusters.center_box::Tuple{Float64, Float64}: The bounding box for each cluster center when centers are generated at random.shuffle::Bool: Shuffle the samples.random_state::Union{Int, Nothing}: Determines random number generation for dataset creation.return_centers::Bool: If true, returns the centers in addition to X and y.
Returns
- If
return_centersis false:X::Matrix{Float64}: Generated samples.y::Vector{Int}: The integer labels for cluster membership of each sample.
- If
return_centersis true:X::Matrix{Float64}: Generated samples.y::Vector{Int}: The integer labels for cluster membership of each sample.centers::Matrix{Float64}: The centers used to generate the data.
Description
This function generates samples from isotropic Gaussian blobs for clustering. It can be used for testing clustering algorithms or as a simple dataset for demonstration purposes.
Example
```julia
Generate a simple dataset with 3 clusters
X, y = makeblobs(nsamples=300, centers=3, nfeatures=2, randomstate=42)
Generate a dataset with specified centers and return the centers
centers = [0 0; 1 1; 2 2] X, y, centers = makeblobs(nsamples=300, centers=centers, clusterstd=0.5, returncenters=true)
Notes
If centers is an int, it is interpreted as the number of centers to generate, and they are generated randomly within center_box.
- If centers is a 2-d array, it is interpreted as the actual centers to use, and n_features is ignored in this case.
- If n_samples is an int, it is interpreted as the total number of samples, which are then evenly divided among clusters.
- If n_samples is an array, it is interpreted as the number of samples per cluster.
NovaML.Datasets.make_moons — Methodmake_moons(;
n_samples::Union{Int, Tuple{Int, Int}}=100,
shuffle::Bool=true,
noise::Union{Float64, Nothing}=nothing,
random_state::Union{Int, Nothing}=nothing
)Generate two interleaving half circles for binary classification.
Arguments
n_samples::Union{Int, Tuple{Int, Int}}: The total number of points generated or a tuple containing the number of points in each of the two moons.shuffle::Bool: Whether to shuffle the samples.noise::Union{Float64, Nothing}: Standard deviation of Gaussian noise added to the data.random_state::Union{Int, Nothing}: Determines random number generation for dataset creation.
Returns
X::Matrix{Float64}: The generated samples, of shape (n_samples, 2).y::Vector{Int}: The integer labels (0 or 1) for class membership of each sample.
Description
This function generates a binary classification dataset in the shape of two interleaving half moons. It can be used for testing classification algorithms or as a simple dataset for demonstration purposes.
Example
```julia
Generate a simple moon dataset
X, y = makemoons(nsamples=100, noise=0.1, random_state=42)
Generate a moon dataset with different number of samples in each moon
X, y = makemoons(nsamples=(60, 40), noise=0.1, shuffle=false)
Notes
- If n_samples is an integer, it generates approximately equal numbers of samples in each moon.
- If the number is odd, the extra sample is added to the first moon.
- If n_samples is a tuple of two integers, it specifies the number of samples for each moon respectively.
- The two moons are generated on a 2D plane. The first moon is a half circle of radius 1 centered at (0, 0),
while the second moon is a half circle of radius 1 centered at (1, 0.5).
- If noise is specified, Gaussian noise with standard deviation noise is added to the data.
Clustering
NovaML.Cluster.AgglomerativeClustering — TypeAgglomerativeClusteringA struct representing Agglomerative Clustering, a hierarchical clustering algorithm.
Fields
n_clusters::Union{Int, Nothing}: The number of clusters to find. Ifnothing, it must be used withdistance_threshold.metric::Union{String, Function}: The metric to use for distance computation. Can be "euclidean", "manhattan", or a custom function.memory::Union{String, Nothing}: Used to cache the distance matrix between iterations.connectivity::Union{AbstractMatrix, Function, Nothing}: Connectivity matrix or callable to be used.compute_full_tree::Union{Bool, String}: Whether to compute the full tree or stop early.linkage::String: The linkage criterion to use. Can be "ward", "complete", "average", or "single".distance_threshold::Union{Float64, Nothing}: The threshold to stop clustering.compute_distances::Bool: Whether to compute distances.
Fitted Attributes
labels_::Vector{Int}: Cluster labels for each point.n_leaves_::Int: Number of leaves in the hierarchical tree.n_connected_components_::Int: Number of connected components in the graph.children_::Matrix{Int}: The children of each non-leaf node.distances_::Vector{Float64}: Distances between nodes in the tree.
Constructor
AgglomerativeClustering(;
n_clusters::Union{Int, Nothing}=2,
metric::Union{String, Function}="euclidean",
memory::Union{String, Nothing}=nothing,
connectivity::Union{AbstractMatrix, Function, Nothing}=nothing,
compute_full_tree::Union{Bool, String}="auto",
linkage::String="ward",
distance_threshold::Union{Float64, Nothing}=nothing,
compute_distances::Bool=false
)Constructs an AgglomerativeClustering object with the specified parameters.
Examples
# Create an AgglomerativeClustering object with 3 clusters
clustering = AgglomerativeClustering(n_clusters=3)
# Create an AgglomerativeClustering object with a distance threshold
clustering = AgglomerativeClustering(distance_threshold=1.5, linkage="single")NovaML.Cluster.AgglomerativeClustering — Method(clustering::AgglomerativeClustering)(X::AbstractMatrix, type::Symbol) Fit the clustering model and return the cluster labels.
Arguments
X::AbstractMatrix: The input data matrix. type::Symbol: Must be :fit_predict to fit the model and return labels.
Returns
labels::Vector{Int}: The cluster labels for each input sample.
Examples
X = rand(100, 5)
clustering = AgglomerativeClustering(n_clusters=3)
labels = clustering(X, :fit_predict)NovaML.Cluster.AgglomerativeClustering — Method(clustering::AgglomerativeClustering)(X::AbstractMatrix; y=nothing)Perform agglomerative clustering on the input data.
Arguments
X::AbstractMatrix: The input data matrix where each row is a sample and each column is a feature. y=nothing: Ignored. Present for API consistency.
Returns
clustering::AgglomerativeClustering: The fitted clustering object.
#Examples
X = rand(100, 5) # 100 samples, 5 features
clustering = AgglomerativeClustering(n_clusters=3)
fitted_clustering = clustering(X)NovaML.Cluster.DBSCAN — Type(dbscan::DBSCAN)(X::AbstractMatrix, y=nothing; sample_weight=nothing)Perform DBSCAN clustering on the input data.
Arguments
X::AbstractMatrix: The input data matrix where each row is a sample and each column is a feature. y=nothing: Ignored. Present for API consistency. sample_weight=nothing: Weight of each sample, used in computing the number of neighbors within eps.
Returns
dbscan::DBSCAN: The fitted DBSCAN object.
Examples
X = rand(100, 5) # 100 samples, 5 features
dbscan = DBSCAN(eps=0.5, min_samples=5)
fitted_dbscan = dbscan(X)NovaML.Cluster.DBSCAN — TypeDBSCANA struct representing the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm.
Fields
eps::Float64: The maximum distance between two samples for one to be considered as in the neighborhood of the other.min_samples::Int: The number of samples in a neighborhood for a point to be considered as a core point.metric::Union{String, Metric}: The metric to use when calculating distance between instances.metric_params::Union{Nothing, Dict}: Additional keyword arguments for the metric function.algorithm::Symbol: The algorithm to be used by the NearestNeighbors module.leaf_size::Int: Leaf size passed to BallTree or KDTree.p::Union{Nothing, Float64}: The power of the Minkowski metric to be used to calculate distance between points.n_jobs::Union{Nothing, Int}: The number of parallel jobs to run.
Fitted Attributes
core_sample_indices_::Vector{Int}: Indices of core samples.components_::Matrix{Float64}: Copy of each core sample found by training.labels_::Vector{Int}: Cluster labels for each point in the dataset given to fit().n_features_in_::Int: Number of features seen during fit.feature_names_in_::Vector{String}: Names of features seen during fit.fitted::Bool: Whether the model has been fitted.
Constructor
DBSCAN(;
eps::Float64 = 0.5,
min_samples::Int = 5,
metric::Union{String, Metric} = "euclidean",
metric_params::Union{Nothing, Dict} = nothing,
algorithm::Symbol = :auto,
leaf_size::Int = 30,
p::Union{Nothing, Float64} = nothing,
n_jobs::Union{Nothing, Int} = nothing
)Constructs a DBSCAN object with the specified parameters.
Examples
```julia
Create a DBSCAN object with default parameters
dbscan = DBSCAN()
Create a DBSCAN object with custom parameters
dbscan = DBSCAN(eps=0.7, min_samples=10, metric="manhattan")
NovaML.Cluster.KMeans — TypeKMeans <: AbstractModelRepresents the K-Means clustering algorithm.
Fields
n_clusters::Int: The number of clusters to form.init::Union{String, Matrix{Float64}, Function}: Method for initialization.n_init::Union{Int, String}: Number of time the k-means algorithm will be run with different centroid seeds.max_iter::Int: Maximum number of iterations of the k-means algorithm for a single run.tol::Float64: Relative tolerance with regards to inertia to declare convergence.verbose::Int: Verbosity mode.random_state::Union{Int, Nothing}: Determines random number generation for centroid initialization.copy_x::Bool: When pre-computing distances it is more numerically accurate to center the data first.algorithm::String: K-means algorithm to use.
Fitted Attributes
cluster_centers_::Union{Matrix{Float64}, Nothing}: Coordinates of cluster centers.labels_::Union{Vector{Int}, Nothing}: Labels of each point.inertia_::Union{Float64, Nothing}: Sum of squared distances of samples to their closest cluster center.n_iter_::Union{Int, Nothing}: Number of iterations run.
NovaML.Cluster.KMeans — Type(kmeans::KMeans)(X::AbstractVecOrMat{Float64}, y=nothing; sample_weight=nothing)Compute k-means clustering.
Arguments
X::AbstractVecOrMat{Float64}: Training instances to cluster.y: Ignored. Not used, present for API consistency by convention.sample_weight: The weights for each observation in X.
Returns
- If the model is not fitted, returns the fitted model.
- If the model is already fitted, returns the predicted labels for X.
Base.show — MethodBase.show(io::IO, dbscan::DBSCAN) Custom show method for DBSCAN objects.
Arguments
io::IO: The I/O stream to which the representation is written. dbscan::DBSCAN: The DBSCAN object to be displayed.
Examples
dbscan = DBSCAN(eps=0.7, min_samples=10)
println(dbscan)Base.show — MethodBase.show(io::IO, kmeans::KMeans)Custom show method for KMeans instances.
Arguments
io::IO: The I/O stream.kmeans::KMeans: The KMeans instance to display.
NovaML.Cluster.assign_labels — Methodassign_labels(X::AbstractMatrix{Float64}, centroids::Matrix{Float64})Assign labels to data points based on the nearest centroid.
Arguments
X::AbstractMatrix{Float64}: The input data.centroids::Matrix{Float64}: The current centroids.
Returns
Vector{Int}: The assigned labels for each data point.
NovaML.Cluster.compute_distances — Methodcompute_distances(X::AbstractMatrix, metric::Union{String, Function})Compute the distance matrix for the input data using the specified metric.
Arguments
X::AbstractMatrix: The input data matrix. metric::Union{String, Function}: The distance metric to use. Can be "euclidean", "manhattan", or a custom function.
Returns
distances::Matrix: The computed distance matrix.
Examples
X = rand(10, 3)
distances = compute_distances(X, "euclidean")NovaML.Cluster.compute_inertia — Functioncompute_inertia(X::Matrix{Float64}, centroids::Matrix{Float64}, labels::Vector{Int}, sample_weight=nothing)Compute the inertia, the sum of squared distances of samples to their closest cluster center.
Arguments
X::Matrix{Float64}: The input data.centroids::Matrix{Float64}: The current centroids.labels::Vector{Int}: The current label assignments.sample_weight: The weights for each observation in X.
Returns
Float64: The computed inertia.
NovaML.Cluster.fit_predict — Functionfit_predict(kmeans::KMeans, X::Matrix{Float64}, y=nothing; sample_weight=nothing)Compute cluster centers and predict cluster index for each sample.
Arguments
kmeans::KMeans: The KMeans instance.X::Matrix{Float64}: New data to transform.y: Ignored.sample_weight: The weights for each observation in X.
Returns
Vector{Int}: Index of the cluster each sample belongs to.
NovaML.Cluster.fit_transform — Functionfit_transform(kmeans::KMeans, X::Matrix{Float64}, y=nothing; sample_weight=nothing)Compute clustering and transform X to cluster-distance space.
Arguments
kmeans::KMeans: The KMeans instance.X::Matrix{Float64}: New data to transform.y: Ignored.sample_weight: The weights for each observation in X.
Returns
Matrix{Float64}: X transformed in the new space.
NovaML.Cluster.get_params — Methodget_params(dbscan::DBSCAN) Get parameters for this estimator.
Returns
params::Dict: Parameter names mapped to their values.
Examples
NovaML.Cluster.get_params — Methodget_params(kmeans::KMeans; deep=true)Get parameters for this estimator.
Arguments
kmeans::KMeans: The KMeans instance.deep::Bool: If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns
Dict: Parameter names mapped to their values.
NovaML.Cluster.initialize_centroids — Methodinitialize_centroids(kmeans::KMeans, X::Matrix{Float64})Initialize the centroids for K-Means clustering.
Arguments
kmeans::KMeans: The KMeans instance.X::Matrix{Float64}: The input data.
Returns
Matrix{Float64}: The initial centroids.
NovaML.Cluster.kmeans_plus_plus — Methodkmeans_plus_plus(X::Matrix{Float64}, n_clusters::Int)Perform K-Means++ initialization.
Arguments
X::Matrix{Float64}: The input data.n_clusters::Int: The number of clusters.
Returns
Matrix{Float64}: The initial centroids chosen by K-Means++.
NovaML.Cluster.score — Functionscore(kmeans::KMeans, X::Matrix{Float64}, y=nothing; sample_weight=nothing)Opposite of the value of X on the K-means objective.
Arguments
kmeans::KMeans: The KMeans instance.X::Matrix{Float64}: New data.y: Ignored.sample_weight: The weights for each observation in X.
Returns
Float64: Opposite of the value of X on the K-means objective.
NovaML.Cluster.set_params! — Methodset_params!(dbscan::DBSCAN; kwargs...) Set the parameters of this estimator.
Arguments
kwargs...: Estimator parameters.
Returns
dbscan::DBSCAN: The DBSCAN object.
Examples
dbscan = DBSCAN()
set_params!(dbscan, eps=0.8, min_samples=15)NovaML.Cluster.set_params! — Methodset_params!(kmeans::KMeans; params...)Set the parameters of this estimator.
Arguments
kmeans::KMeans: The KMeans instance.params...: Estimator parameters.
Returns
KMeans: The estimator instance.
NovaML.Cluster.transform — Methodtransform(kmeans::KMeans, X::Matrix{Float64})Transform X to a cluster-distance space.
Arguments
kmeans::KMeans: The KMeans instance.X::Matrix{Float64}: New data to transform.
Returns
Matrix{Float64}: X transformed in the new space.
NovaML.Cluster.update_centroids — Functionupdate_centroids(X::Matrix{Float64}, labels::Vector{Int}, n_clusters::Int, sample_weight=nothing)Update the centroids based on the current label assignments.
Arguments
X::Matrix{Float64}: The input data.labels::Vector{Int}: The current label assignments.n_clusters::Int: The number of clusters.sample_weight: The weights for each observation in X.
Returns
Matrix{Float64}: The updated centroids.
Decomposition
NovaML.Decomposition.LatentDirichletAllocation — TypeLatentDirichletAllocationLatent Dirichlet Allocation (LDA) with online variational Bayes algorithm.
LDA is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.
Fields
n_components::Int: Number of topics.doc_topic_prior::Union{Float64, Nothing}: Prior of document topic distribution.topic_word_prior::Union{Float64, Nothing}: Prior of topic word distribution.learning_method::Symbol: Method used to update the model: :batch for batch learning, :online for online learning.learning_decay::Float64: It is a parameter that control the rate at which the learning rate decreases.learning_offset::Float64: A (positive) parameter that downweights early iterations in online learning.max_iter::Int: The maximum number of iterations.batch_size::Int: Number of documents to use in each EM iteration in online learning method.evaluate_every::Int: How often to evaluate perplexity.total_samples::Float64: Total number of documents.perp_tol::Float64: Perplexity tolerance in batch learning.mean_change_tol::Float64: Stopping tolerance for updating document topic distribution in E-step.max_doc_update_iter::Int: Max number of iterations for updating document topic distribution in E-step.n_jobs::Union{Int, Nothing}: The number of jobs to use in the E-step.verbose::Int: Verbosity level.random_state::Union{Int, Nothing}: Seed for random number generation.
Learned attributes
components_::Union{Matrix{Float64}, Nothing}: Topic word distribution. shape = (ncomponents, nfeatures)exp_dirichlet_component_::Union{Matrix{Float64}, Nothing}: Exponential value of expectation of log topic word distribution. shape = (ncomponents, nfeatures)n_batch_iter_::Int: Number of iterations of the EM step.n_iter_::Int: Number of passes over the dataset.bound_::Float64: Final perplexity score on training set.n_features_in_::Int: Number of features seen during fit.feature_names_in_::Union{Vector{String}, Nothing}: Names of features seen during fit.
Example
```julia using NovaML
Create an LDA model
lda = LatentDirichletAllocation(ncomponents=10, randomstate=42)
Fit the model to data
doctopicdistr = lda(X)
Transform new data
newdoctopicdistr = lda(newX)
NovaML.Decomposition.LatentDirichletAllocation — Method(lda::LatentDirichletAllocation)(X::AbstractMatrix{T}; type=nothing) where T <: RealFit the model to X, or transform X if the model is already fitted.
Arguments
X::AbstractMatrix{T}: Document-term matrix.type: Ignored. Present for API consistency.
Returns
- If the model is not fitted, returns the document-topic distribution after fitting.
- If the model is already fitted, returns the document-topic distribution for X.
NovaML.Decomposition.PCA — TypePCAPrincipal Component Analysis (PCA).
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.
Fields
n_components::Union{Int, Float64, String, Nothing}: Number of components to keep.whiten::Bool: When True, thecomponents_vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.fitted::Bool: Whether the PCA model has been fitted to data.
Fitted Attributes
components_::Union{Matrix{Float64}, Nothing}: Principal axes in feature space, representing the directions of maximum variance in the data.explained_variance_::Union{Vector{Float64}, Nothing}: The amount of variance explained by each of the selected components.explained_variance_ratio_::Union{Vector{Float64}, Nothing}: Percentage of variance explained by each of the selected components.singular_values_::Union{Vector{Float64}, Nothing}: The singular values corresponding to each of the selected components.mean_::Union{Vector{Float64}, Nothing}: Per-feature empirical mean, estimated from the training set.n_samples_::Union{Int, Nothing}: Number of samples in the training data.n_features_::Union{Int, Nothing}: Number of features in the training data.n_components_::Union{Int, Nothing}: The estimated number of components.noise_variance_::Union{Float64, Nothing}: The estimated noise covariance following the Probabilistic PCA model.
Example
pca = PCA(n_components=2)
X_transformed = pca(X)
X_inverse = pca(X_transformed, :inverse_transform)NovaML.Decomposition.PCA — Method(pca::PCA)(X::AbstractMatrix{T}) where T <: RealFit the model with X and apply the dimensionality reduction on X.
Arguments
X::AbstractMatrix{T}: Training data, where nsamples is the number of samples and nfeatures is the number of features.
Returns
Matrix{Float64}: Transformed values.
NovaML.Decomposition.PCA — Method(pca::PCA)(X::AbstractMatrix{T}, mode::Symbol) where T <: RealTransform data back to its original space.
Arguments
X::AbstractMatrix{T}: New data, where nsamples is the number of samples and ncomponents is the number of components.mode::Symbol: Must be :inverse_transform.
Returns
Matrix{Float64}: X_original array.
Throws
ErrorException: If mode is not :inverse_transform.
Base.show — MethodBase.show(io::IO, lda::LatentDirichletAllocation)Custom show method for LatentDirichletAllocation.
Arguments
io::IO: The I/O streamlda::LatentDirichletAllocation: The LDA model to display
Base.show — MethodBase.show(io::IO, pca::PCA)Custom show method for PCA.
Arguments
io::IO: The I/O stream.pca::PCA: The PCA model to display.
NovaML.Decomposition._e_step — Method_e_step(lda::LatentDirichletAllocation, X::AbstractMatrix{T}) where T <: RealE-step in EM update.
Arguments
lda::LatentDirichletAllocation: The LDA model.X::AbstractMatrix{T}: Document-term matrix.
Returns
Matrix{Float64}: Document-topic distribution.
NovaML.Decomposition._fit_batch — Method_fit_batch(lda::LatentDirichletAllocation, X::AbstractMatrix{T}) where T <: RealFit the model to X using batch variational Bayes method.
Arguments
lda::LatentDirichletAllocation: The LDA model.X::AbstractMatrix{T}: Document-term matrix.
Returns
Matrix{Float64}: Document-topic distribution.
NovaML.Decomposition._fit_online — Method_fit_online(lda::LatentDirichletAllocation, X::AbstractMatrix{T}) where T <: RealFit the model to X using online variational Bayes method.
Arguments
lda::LatentDirichletAllocation: The LDA model.X::AbstractMatrix{T}: Document-term matrix.
Returns
Matrix{Float64}: Document-topic distribution.
NovaML.Decomposition._fit_transform — Method_fit_transform(lda::LatentDirichletAllocation, X::AbstractMatrix{T}) where T <: RealFit the model to X and return the document-topic distribution.
Arguments
lda::LatentDirichletAllocation: The LDA model.X::AbstractMatrix{T}: Document-term matrix.
Returns
Matrix{Float64}: Document-topic distribution.
NovaML.Decomposition._m_step — Method_m_step(lda::LatentDirichletAllocation, X::AbstractMatrix{T}, doc_topic_distr::Matrix{Float64}, scale::Float64=1.0) where T <: RealM-step in EM update.
Arguments
lda::LatentDirichletAllocation: The LDA model.X::AbstractMatrix{T}: Document-term matrix.doc_topic_distr::Matrix{Float64}: Document-topic distribution.scale::Float64: Scaling factor for online update.
NovaML.Decomposition._perplexity — Method_perplexity(lda::LatentDirichletAllocation, X::AbstractMatrix{T}, doc_topic_distr::Matrix{Float64}) where T <: RealCalculate approximate perplexity for data X.
Arguments
lda::LatentDirichletAllocation: The LDA model.X::AbstractMatrix{T}: Document-term matrix.doc_topic_distr::Matrix{Float64}: Document-topic distribution.
Returns
Float64: The calculated bound.
NovaML.Decomposition._transform — Method_transform(lda::LatentDirichletAllocation, X::AbstractMatrix{T}) where T <: RealTransform X to document-topic distribution.
Arguments
lda::LatentDirichletAllocation: The LDA model.X::AbstractMatrix{T}: Document-term matrix.
Returns
Matrix{Float64}: Document-topic distribution.
Ensemble Methods
NovaML.Ensemble.AdaBoostClassifier — TypeAdaBoostClassifier <: AbstractModelAn AdaBoost classifier.
An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
Fields
base_estimator::Any: The base estimator from which the boosted ensemble is built.n_estimators::Int: The maximum number of estimators at which boosting is terminated.learning_rate::Float64: Weight applied to each classifier at each boosting iteration.algorithm::Symbol: The SAMME algorithm to use when fitting the model.random_state::Union{Int, Nothing}: Controls the random seed given at eachbase_estimatorat each boosting iteration.
Fitted Attributes
estimators_::Vector{Any}: The collection of fitted sub-estimators.estimator_weights_::Vector{Float64}: Weights for each estimator in the boosted ensemble.estimator_errors_::Vector{Float64}: Classification error for each estimator in the boosted ensemble.classes_::Vector{Any}: The classes labels.n_classes_::Int: The number of classes.feature_importances_::Union{Vector{Float64}, Nothing}: The feature importances if supported by thebase_estimator.fitted::Bool: Whether the model has been fitted.
Example
model = AdaBoostClassifier(n_estimators=100, learning_rate=1.0)
model(X, y) # Fit the model
predictions = model(X_test) # Make predictions
probabilities = model(X_test, type=:probs) # Get probability estimatesNovaML.Ensemble.AdaBoostClassifier — Method(model::AdaBoostClassifier)(X::AbstractMatrix, y::AbstractVector) Fit the AdaBoost model.
Arguments
X::AbstractMatrix: The input samples.y::AbstractVector: The target values (class labels).
Returns
AdaBoostClassifier: The fitted model.
NovaML.Ensemble.AdaBoostClassifier — Method(model::AdaBoostClassifier)(X::AbstractMatrix; type=nothing)Predict using the AdaBoost model.
Arguments
X::AbstractMatrix: The input samples.type: If set to :probs, return probability estimates for each class.
Returns
- If type is :probs, returns probabilities of each class.
- Otherwise, returns predicted class labels.
NovaML.Ensemble.BaggingClassifier — TypeBaggingClassifier <: AbstractModelA Bagging classifier.
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.
Fields
base_estimator::AbstractModel: The base estimator to fit on random subsets of the dataset.n_estimators::Int: The number of base estimators in the ensemble.max_samples::Union{Int, Float64}: The number of samples to draw from X to train each base estimator.max_features::Union{Int, Float64}: The number of features to draw from X to train each base estimator.bootstrap::Bool: Whether samples are drawn with replacement.bootstrap_features::Bool: Whether features are drawn with replacement.oob_score::Bool: Whether to use out-of-bag samples to estimate the generalization error.warm_start::Bool: When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble.random_state::Union{Int, Nothing}: Controls the random resampling of the original dataset.verbose::Int: Controls the verbosity when fitting and predicting.
Fitted Attributes
estimators_::Vector{AbstractModel}: The collection of fitted base estimators.estimators_features_::Vector{Vector{Int}}: The subset of drawn features for each base estimator.classes_::Vector: The classes labels.n_classes_::Int: The number of classes.oob_score_::Union{Float64, Nothing}: Score of the training dataset obtained using an out-of-bag estimate.oob_decision_function_::Union{Matrix{Float64}, Nothing}: Decision function computed with out-of-bag estimate on the training set.fitted::Bool: Whether the model has been fitted.
Example
model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10)
model(X, y) # Fit the model
predictions = model(X_test) # Make predictions
probabilities = model(X_test, type=:probs) # Get probability estimatesNovaML.Ensemble.BaggingClassifier — Method(bc::BaggingClassifier)(X::AbstractMatrix, y::AbstractVector)Fit the Bagging classifier.
Arguments
X::AbstractMatrix: The input samples.y::AbstractVector: The target values (class labels).
Returns
BaggingClassifier: The fitted model.
NovaML.Ensemble.BaggingClassifier — Method(bc::BaggingClassifier)(X::AbstractMatrix; type=nothing)Predict class for X.
Arguments
X::AbstractMatrix: The input samples.type: If set to :probs, return probability estimates for each class.
Returns
- If type is :probs, returns probabilities of each class.
- Otherwise, returns predicted class labels.
NovaML.Ensemble.GradientBoostingClassifier — TypeGradientBoostingClassifier <: AbstractModelGradient Boosting for classification.
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.
Fields
loss::String: The loss function to be optimized.learning_rate::Float64: Learning rate shrinks the contribution of each tree bylearning_rate.n_estimators::Int: The number of boosting stages to perform.subsample::Float64: The fraction of samples to be used for fitting the individual base learners.criterion::String: The function to measure the quality of a split.min_samples_split::Union{Int, Float64}: The minimum number of samples required to split an internal node.min_samples_leaf::Union{Int, Float64}: The minimum number of samples required to be at a leaf node.min_weight_fraction_leaf::Float64: The minimum weighted fraction of the sum total of weights required to be at a leaf node.max_depth::Union{Int, Nothing}: Maximum depth of the individual regression estimators.min_impurity_decrease::Float64: A node will be split if this split induces a decrease of the impurity greater than or equal to this value.init::Union{AbstractModel, String, Nothing}: An estimator object that is used to compute the initial predictions.random_state::Union{Int, Nothing}: Controls the random seed given at each tree_estimator at each boosting iteration.max_features::Union{Int, Float64, String, Nothing}: The number of features to consider when looking for the best split.verbose::Int: Enable verbose output.max_leaf_nodes::Union{Int, Nothing}: Grow trees withmax_leaf_nodesin best-first fashion.warm_start::Bool: When set totrue, reuse the solution of the previous call to fit and add more estimators to the ensemble.validation_fraction::Float64: The proportion of training data to set aside as validation set for early stopping.n_iter_no_change::Union{Int, Nothing}: Used to decide if early stopping will be used to terminate training when validation score is not improving.tol::Float64: Tolerance for the early stopping.ccp_alpha::Float64: Complexity parameter used for Minimal Cost-Complexity Pruning.
Fitted Attributes
estimators_::Vector{Vector{DecisionTreeRegressor}}: The collection of fitted sub-estimators.classes_::Vector: The classes labels.n_classes_::Int: The number of classes.feature_importances_::Union{Vector{Float64}, Nothing}: The feature importances.oob_improvement_::Union{Vector{Float64}, Nothing}: The improvement in loss on the out-of-bag samples relative to the previous iteration.train_score_::Vector{Float64}: The i-th scoretrain_score_[i]is the loss of the model at iterationion the in-bag sample.n_estimators_::Int: The number of estimators as selected by early stopping.init_::Union{AbstractModel, Nothing}: The estimator that provides the initial predictions.fitted::Bool: Whether the model has been fitted.
Example
model = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1)
model(X, y) # Fit the model
predictions = model(X_test) # Make predictions
probabilities = model(X_test, type=:probs) # Get probability estimatesNovaML.Ensemble.GradientBoostingClassifier — Method(gbm::GradientBoostingClassifier)(X::AbstractMatrix, y::AbstractVector)Fit the gradient boosting model.
Arguments
X::AbstractMatrix: The input samples.y::AbstractVector: The target values (class labels).
Returns
GradientBoostingClassifier: The fitted model.
NovaML.Ensemble.GradientBoostingClassifier — Method(gbm::GradientBoostingClassifier)(X::AbstractMatrix; type=nothing)Predict class for X.
Arguments
X::AbstractMatrix: The input samples.type: If set to:probs, return probability estimates for each class.
Returns
- If
typeis:probs, returns probabilities of each class. - Otherwise, returns predicted class labels.
NovaML.Ensemble.InitialEstimator — TypeInitialEstimator <: AbstractModelAn initial estimator that always predicts a constant probability.
Fields
prob::Float64: The constant probability to predict.
NovaML.Ensemble.InitialEstimator — Method(estimator::InitialEstimator)(X::AbstractMatrix)Predict using the initial estimator.
Arguments
X::AbstractMatrix: The input samples.
Returns
Vector{Float64}: The predictions.
NovaML.Ensemble.RandomForestClassifier — TypeRandomForestClassifier <: AbstractModelA random forest classifier.
Random forests are an ensemble learning method for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees.
Fields
n_estimators::Int: The number of trees in the forest.max_depth::Union{Int, Nothing}: The maximum depth of the tree.min_samples_split::Int: The minimum number of samples required to split an internal node.min_samples_leaf::Int: The minimum number of samples required to be at a leaf node.max_features::Union{Int, Float64, String, Nothing}: The number of features to consider when looking for the best split.bootstrap::Bool: Whether bootstrap samples are used when building trees.random_state::Union{Int, Nothing}: Controls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node.trees::Vector{DecisionTreeClassifier}: The collection of fitted sub-estimators.n_classes::Int: The number of classes.classes::Vector: The class labels.fitted::Bool: Whether the model has been fitted.feature_importances_::Union{Vector{Float64}, Nothing}: The feature importances.n_features::Int: The number of features when fitting the model.
Example
```julia rf = RandomForestClassifier(nestimators=100, maxdepth=10) rf(X, y) # Fit the model predictions = rf(X_test) # Make predictions
NovaML.Ensemble.RandomForestClassifier — Method(forest::RandomForestClassifier)(X::AbstractMatrix, y::AbstractVector)Fit the random forest classifier.
Arguments
X::AbstractMatrix: The input samples.y::AbstractVector: The target values (class labels).
Returns
RandomForestClassifier: The fitted model.
NovaML.Ensemble.RandomForestClassifier — Method(forest::RandomForestClassifier)(X::AbstractMatrix)Predict class for X.
Arguments
X::AbstractMatrix: The input samples.
Returns
Vector: The predicted class labels.
NovaML.Ensemble.RandomForestRegressor — TypeRandomForestRegressor <: AbstractModelA random forest regressor.
Random forests are an ensemble learning method for regression that operate by constructing a multitude of decision trees at training time and outputting the mean prediction of the individual trees.
Fields
n_estimators::Int: The number of trees in the forest.criterion::String: The function to measure the quality of a split.max_depth::Union{Int, Nothing}: The maximum depth of the tree.min_samples_split::Int: The minimum number of samples required to split an internal node.min_samples_leaf::Int: The minimum number of samples required to be at a leaf node.min_weight_fraction_leaf::Float64: The minimum weighted fraction of the sum total of weights required to be at a leaf node.max_features::Union{Int, Float64, String, Nothing}: The number of features to consider when looking for the best split.max_leaf_nodes::Union{Int, Nothing}: Grow trees with maxleafnodes in best-first fashion.min_impurity_decrease::Float64: A node will be split if this split induces a decrease of the impurity greater than or equal to this value.bootstrap::Bool: Whether bootstrap samples are used when building trees.oob_score::Bool: Whether to use out-of-bag samples to estimate the generalization score.n_jobs::Union{Int, Nothing}: The number of jobs to run in parallel.random_state::Union{Int, Nothing}: Controls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node.verbose::Int: Controls the verbosity when fitting and predicting.warm_start::Bool: When set to true, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.ccp_alpha::Float64: Complexity parameter used for Minimal Cost-Complexity Pruning.max_samples::Union{Int, Float64, Nothing}: If bootstrap is True, the number of samples to draw from X to train each base estimator.
Example
rf = RandomForestRegressor(n_estimators=100, max_depth=10)
rf(X, y) # Fit the model
predictions = rf(X_test) # Make predictionsNovaML.Ensemble.RandomForestRegressor — Method(forest::RandomForestRegressor)(X::AbstractMatrix, y::AbstractVector)Fit the random forest regressor.
Arguments
X::AbstractMatrix: The input samples.y::AbstractVector: The target values.
Returns
RandomForestRegressor: The fitted model.
NovaML.Ensemble.RandomForestRegressor — Method(forest::RandomForestRegressor)(X::AbstractMatrix)Predict regression target for X.
Arguments
X::AbstractMatrix: The input samples.
Returns
Vector{Float64}: The predicted values.
NovaML.Ensemble.VotingClassifier — TypeVotingClassifier <: AbstractModelA Voting Classifier for combining multiple machine learning classifiers.
This classifier combines a number of estimators to create a single classifier that makes predictions based on either hard voting (majority vote) or soft voting (weighted average of predicted probabilities).
Fields
estimators::Vector{Tuple{String, Any}}: List of (name, estimator) tuples.voting::Symbol: The voting strategy, either :hard for majority voting or :soft for probability voting.weights::Union{Vector{Float64}, Nothing}: Sequence of weights for each estimator in soft voting.flatten_transform::Bool: Affects the shape of transform output.verbose::Bool: If true, prints progress messages during fitting.
Fitted Attributes
estimators_::Vector{Any}: The fitted estimators.classes_::Vector{Any}: The class labels.fitted::Bool: Whether the classifier is fitted.
Example
estimators = [("lr", LogisticRegression()), ("rf", RandomForestClassifier())]
vc = VotingClassifier(estimators=estimators, voting=:soft)
vc(X, y) # Fit the classifier
predictions = vc(X_test) # Make predictionsNovaML.Ensemble.VotingClassifier — Method(vc::VotingClassifier)(X::AbstractMatrix, y::AbstractVector)Fit the voting classifier.
Arguments
X::AbstractMatrix: The input samples.y::AbstractVector: The target values (class labels).
Returns
VotingClassifier: The fitted classifier.
NovaML.Ensemble.VotingClassifier — Method(vc::VotingClassifier)(X::AbstractMatrix; type=nothing)Predict class labels for X.
Arguments
X::AbstractMatrix: The input samples.type: If set to :probs, return probability estimates for each class.
Returns
- If
typeis:probs, returns probabilities of each class. - Otherwise, returns predicted class labels.
NovaML.Ensemble.ZeroEstimator — TypeZeroEstimator <: AbstractModelAn estimator that always predicts zero.
NovaML.Ensemble.ZeroEstimator — Method(::ZeroEstimator)(X::AbstractMatrix)Predict using the zero estimator.
Arguments
X::AbstractMatrix: The input samples.
Returns
Vector{Float64}: Zero predictions.
Base.show — MethodBase.show(io::IO, model::AdaBoostClassifier)Custom show method for AdaBoostClassifier.
Arguments
io::IO: The I/O stream.model::AdaBoostClassifier: The AdaBoost model to display.
Base.show — MethodBase.show(io::IO, bc::BaggingClassifier)Custom show method for BaggingClassifier.
Arguments
io::IO: The I/O stream.bc::BaggingClassifier: The Bagging classifier to display.
Base.show — MethodBase.show(io::IO, gbm::GradientBoostingClassifier)Custom show method for GradientBoostingClassifier.
Arguments
io::IO: The I/O stream.gbm::GradientBoostingClassifier: The gradient boosting model to display.
Base.show — MethodBase.show(io::IO, forest::RandomForestClassifier)Custom show method for RandomForestClassifier.
Arguments
io::IO: The I/O stream.forest::RandomForestClassifier: The random forest classifier to display.
Base.show — MethodBase.show(io::IO, forest::RandomForestRegressor)Custom show method for RandomForestRegressor.
Arguments
io::IO: The I/O stream.forest::RandomForestRegressor: The random forest regressor to display.
Base.show — MethodBase.show(io::IO, vc::VotingClassifier)Custom show method for VotingClassifier.
Arguments
io::IO: The I/O stream.vc::VotingClassifier: The voting classifier to display.
NovaML.Ensemble._compute_feature_importances — Method_compute_feature_importances(model::AdaBoostClassifier)Compute feature importances for the AdaBoost model.
Arguments
model::AdaBoostClassifier: The fitted AdaBoost model.
Returns
Union{Vector{Float64}, Nothing}: The feature importances if available, otherwise nothing.
NovaML.Ensemble._compute_oob_score — Method_compute_oob_score(bc::BaggingClassifier, X::AbstractMatrix, y::AbstractVector)Compute out-of-bag score for the Bagging classifier.
Arguments
bc::BaggingClassifier: The Bagging classifier.X::AbstractMatrix: The input samples.y::AbstractVector: The target values.
NovaML.Ensemble._generate_indices — Method_generate_indices(bc::BaggingClassifier, n_samples::Int)Generate sample indices for individual base estimators.
Arguments
bc::BaggingClassifier: The Bagging classifier.n_samples::Int: The number of samples in the dataset.
Returns
Vector{Int}: The generated sample indices.
NovaML.Ensemble.bootstrap_sample — Methodbootstrap_sample(forest::RandomForestClassifier, X::AbstractMatrix, y::AbstractVector)Create a bootstrap sample of the dataset.
Arguments
forest::RandomForestClassifier: The random forest classifier.X::AbstractMatrix: The input samples.y::AbstractVector: The target values.
Returns
Tuple{AbstractMatrix, AbstractVector}: The bootstrapped samples and targets.
NovaML.Ensemble.calculate_tree_feature_importance — Methodcalculate_tree_feature_importance(tree::DecisionTreeClassifier, feature_indices::Vector{Int}, n_features::Int)Calculate the feature importance for a single decision tree.
Arguments
tree::DecisionTreeClassifier: The decision tree.feature_indices::Vector{Int}: The indices of the features used in this tree.n_features::Int: The total number of features.
Returns
Vector{Float64}: The feature importances.
NovaML.Ensemble.compute_feature_importances — Methodcompute_feature_importances(gbm::GradientBoostingClassifier)Compute feature importances for the gradient boosting model.
Arguments
gbm::GradientBoostingClassifier: The fitted gradient boosting model.
Returns
Vector{Float64}: The feature importances.
NovaML.Ensemble.compute_loss — Methodcompute_loss(y::AbstractVector, y_pred::AbstractVector, loss::String)Compute the loss for the given predictions.
Arguments
y::AbstractVector: The true values.y_pred::AbstractVector: The predicted values.loss::String: The loss function name.
Returns
Float64: The computed loss.
NovaML.Ensemble.compute_negative_gradient — Methodcompute_negative_gradient(y::AbstractVector, y_pred::AbstractVector, loss::String)Compute negative gradient for the given loss function.
Arguments
y::AbstractVector: The true values.y_pred::AbstractVector: The predicted values.loss::String: The loss function name.
Returns
AbstractVector: The negative gradient.
NovaML.Ensemble.compute_oob_score — Methodcompute_oob_score(forest::RandomForestRegressor, X::AbstractMatrix, y::AbstractVector)Compute out-of-bag (OOB) score for the random forest regressor.
Arguments
forest::RandomForestRegressor: The random forest regressor.X::AbstractMatrix: The input samples.y::AbstractVector: The target values.
Returns
Tuple{Float64, Vector{Float64}}: The OOB score and OOB predictions.
NovaML.Ensemble.decision_function — Methoddecision_function(model::AdaBoostClassifier, X::AbstractMatrix)Compute the decision function of X.
Arguments
model::AdaBoostClassifier: The fitted AdaBoost model.X::AbstractMatrix: The input samples.
Returns
Matrix{Float64}: The decision function of the input samples.
NovaML.Ensemble.fit_initial_estimator — Methodfit_initial_estimator(y::AbstractVector)Fit an initial estimator based on the mean of y.
Arguments
y::AbstractVector: The target values.
Returns
InitialEstimator: The fitted initial estimator.
NovaML.Ensemble.get_max_features — Methodget_max_features(forest::RandomForestClassifier, n_features::Int)Get the number of features to consider when looking for the best split.
Arguments
forest::RandomForestClassifier: The random forest classifier.n_features::Int: The total number of features.
Returns
Int: The number of features to consider.
NovaML.Ensemble.get_max_features — Methodgetmaxfeatures(forest::RandomForestRegressor, n_features::Int) Get the number of features to consider when looking for the best split.
Arguments
forest::RandomForestRegressor: The random forest regressor.n_features::Int: The total number of features.
Returns
Int: The number of features to consider.
NovaML.Ensemble.get_params — Methodget_params(model::AdaBoostClassifier; deep=true)Get parameters for this estimator.
Arguments
model::AdaBoostClassifier: The AdaBoost model.deep::Bool: If true, will return the parameters for this estimator and contained subobjects that are estimators.
Returns
Dict: Parameter names mapped to their values.
NovaML.Ensemble.set_params! — Methodset_params!(model::AdaBoostClassifier; kwargs...)Set the parameters of this estimator.
Arguments
model::AdaBoostClassifier: The AdaBoost model.kwargs...: Estimator parameters.
Returns
AdaBoostClassifier: The estimator instance.
NovaML.Ensemble.staged_predict — Methodstaged_predict(model::AdaBoostClassifier, X::AbstractMatrix)Return a generator of predictions for each boosting iteration.
Arguments
model::AdaBoostClassifier: The fitted AdaBoost model.X::AbstractMatrix: The input samples.
Returns
Channel: A generator of predictions at each stage.
NovaML.Ensemble.staged_predict_proba — Methodstaged_predict_proba(model::AdaBoostClassifier, X::AbstractMatrix)Return a generator of predicted probabilities for each boosting iteration.
Arguments
model::AdaBoostClassifier: The fitted AdaBoost model.X::AbstractMatrix: The input samples.
Returns
Channel: A generator of predicted probabilities at each stage.
NovaML.Ensemble.transform — Methodtransform(vc::VotingClassifier, X::AbstractMatrix)Return class labels or probabilities for X for each estimator.
Arguments
vc::VotingClassifier: The fitted voting classifier.X::AbstractMatrix: The input samples.
Returns
- If
votingis:soft, returns the probabilities for each class for each estimator. - If
votingis:hard, returns the class label predictions of each estimator.
The shape of the return depends on the flatten_transform parameter.