API Reference

Datasets

NovaML.Datasets.load_bostonMethod
load_boston(; return_X_y=false)

Load and return the Boston house prices dataset (regression).

This function creates a synthetic version of the Boston Housing dataset for demonstration purposes, as the original dataset might not be available.

Arguments

  • return_X_y::Bool: If true, returns (X, y) instead of a dict-like object.

Returns

  • If return_X_y is false, returns a Dict with the following keys:

    • "data": ndarray of shape (506, 13) The data matrix.
    • "target": ndarray of shape (506,) The regression target.
    • "feature_names": list The names of the dataset columns.
    • "DESCR": str The full description of the dataset.
  • If return_X_y is true, returns a tuple (data, target):

    • data : ndarray of shape (506, 13)
    • target : ndarray of shape (506,)

Description

The Boston Housing dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms.

Note: This function generates synthetic data based on the structure of the original Boston Housing dataset. The actual values and relationships in the data are simulated and do not represent real housing data.

Features

1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT: % lower status of the population

Target

- MEDV: Median value of owner-occupied homes in $1000's

Example

```julia

Load the Boston Housing dataset

boston = load_boston()

Access the data and target

X = boston["data"] y = boston["target"]

Get feature names

featurenames = boston["featurenames"]

Alternatively, get data and target directly

X, y = loadboston(returnX_y=true)

source
NovaML.Datasets.load_breast_cancerMethod
load_breast_cancer(; return_X_y=false)

Load and return the Wisconsin Breast Cancer dataset (classification).

Arguments

  • return_X_y::Bool: If true, returns (X, y) instead of a dict-like object.

Returns

  • If return_X_y is false, returns a Dict with the following keys:

    • "data": Matrix{Float64} of shape (569, 30) The data matrix.
    • "target": Vector{Bool} of shape (569,) The classification target.
    • "feature_names": Vector{String} The names of the dataset columns.
    • "target_names": Vector{String} The names of target classes.
    • "DESCR": String The full description of the dataset.
  • If return_X_y is true, returns a tuple (data, target):

    • data: Matrix{Float64} of shape (569, 30)
    • target: Vector{Bool} of shape (569,)

Description

The Wisconsin Breast Cancer dataset is a classic and very easy binary classification dataset.

Features

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Ten real-valued features are computed for each cell nucleus: 1) radius (mean of distances from center to points on the perimeter) 2) texture (standard deviation of gray-scale values) 3) perimeter 4) area 5) smoothness (local variation in radius lengths) 6) compactness (perimeter^2 / area - 1.0) 7) concavity (severity of concave portions of the contour) 8) concave points (number of concave portions of the contour) 9) symmetry 10) fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.

Target

- 0: benign
- 1: malignant

Dataset Characteristics

:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information: 10 real-valued features are computed for each cell nucleus:
    a) radius (mean of distances from center to points on the perimeter)
    b) texture (standard deviation of gray-scale values)
    c) perimeter
    d) area
    e) smoothness (local variation in radius lengths)
    f) compactness (perimeter^2 / area - 1.0)
    g) concavity (severity of concave portions of the contour)
    h) concave points (number of concave portions of the contour)
    i) symmetry
    j) fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

:Class Distribution: 212 Malignant, 357 Benign

Example

```julia

Load the Breast Cancer dataset

breastcancer = loadbreast_cancer()

Access the data and target

X = breastcancer["data"] y = breastcancer["target"]

Get feature names and target names

featurenames = breastcancer["featurenames"] targetnames = breastcancer["targetnames"]

Alternatively, get data and target directly

X, y = loadbreastcancer(returnXy=true)

Notes

This function downloads the Wisconsin Breast Cancer dataset from the UCI Machine Learning Repository if it's not already present in the local directory.

The dataset was created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at the University of Wisconsin-Madison.

source
NovaML.Datasets.load_irisMethod
load_iris(; return_X_y=false)

Load and return the iris dataset (classification).

Arguments

  • return_X_y::Bool: If true, returns (X, y) instead of a dict-like object.

Returns

  • If return_X_y is false, returns a Dict with the following keys:

    • "data": Matrix{Float64} of shape (150, 4) The data matrix.
    • "target": Vector{Int} of shape (150,) The classification target.
    • "feature_names": Vector{String} The names of the dataset columns.
    • "target_names": Vector{String} The names of target classes.
    • "DESCR": String The full description of the dataset.
  • If return_X_y is true, returns a tuple (data, target):

    • data: Matrix{Float64} of shape (150, 4)
    • target: Vector{Int} of shape (150,)

Description

The iris dataset is a classic and very easy multi-class classification dataset.

Features

1. sepal length (cm)
2. sepal width (cm)
3. petal length (cm)
4. petal width (cm)

Target

- Iris-setosa (1)
- Iris-versicolor (2)
- Iris-virginica (3)

Dataset Characteristics

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
:Class:
    - Iris-Setosa
    - Iris-Versicolour
    - Iris-Virginica

Example

```julia

Load the Iris dataset

iris = load_iris()

Access the data and target

X = iris["data"] y = iris["target"]

Get feature names and target names

featurenames = iris["featurenames"] targetnames = iris["targetnames"]

Alternatively, get data and target directly

X, y = loadiris(returnX_y=true)

Notes

This function downloads the Iris dataset from the UCI Machine Learning Repository if it's not already present in the local directory.

source
NovaML.Datasets.load_wineMethod
load_wine(; return_X_y=false)

Load and return the wine dataset (classification).

Arguments

  • return_X_y::Bool: If true, returns (X, y) instead of a dict-like object.

Returns

  • If return_X_y is false, returns a Dict with the following keys:

    • "data": Matrix{Float64} of shape (178, 13) The data matrix.
    • "target": Vector{Int} of shape (178,) The classification target.
    • "feature_names": Vector{String} The names of the dataset columns.
    • "target_names": Vector{String} The names of target classes.
    • "DESCR": String The full description of the dataset.
  • If return_X_y is true, returns a tuple (data, target):

    • data: Matrix{Float64} of shape (178, 13)
    • target: Vector{Int} of shape (178,)

Description

This dataset is a classic and very easy multi-class classification dataset.

Features

1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10) Color intensity
11) Hue
12) OD280/OD315 of diluted wines
13) Proline

Target

- class 1 (0)
- class 2 (1)
- class 3 (2)

Dataset Characteristics

:Number of Instances: 178
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
    - Alcohol
    - Malic acid
    - Ash
    - Alcalinity of ash
    - Magnesium
    - Total phenols
    - Flavanoids
    - Nonflavanoid phenols
    - Proanthocyanins
    - Color intensity
    - Hue
    - OD280/OD315 of diluted wines
    - Proline

:Class:
    - class 1
    - class 2
    - class 3

Example

```julia

Load the Wine dataset

wine = load_wine()

Access the data and target

X = wine["data"] y = wine["target"]

Get feature names and target names

featurenames = wine["featurenames"] targetnames = wine["targetnames"]

Alternatively, get data and target directly

X, y = loadwine(returnX_y=true)

Notes

This function downloads the Wine dataset from the UCI Machine Learning Repository if it's not already present in the local directory.

The data set contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample.

The classes are ordered and not balanced (class 1 has 59 samples, class 2 has 71 samples, and class 3 has 48 samples).

This dataset is also excellent for visualization techniques.

source
NovaML.Datasets.make_blobsMethod
make_blobs(;
    n_samples::Union{Int, Vector{Int}} = 100,
    n_features::Int = 2,
    centers::Union{Int, Matrix{Float64}} = nothing,
    cluster_std::Union{Float64, Vector{Float64}} = 1.0,
    center_box::Tuple{Float64, Float64} = (-10.0, 10.0),
    shuffle::Bool = true,
    random_state::Union{Int, Nothing} = nothing,
    return_centers::Bool = false
)

Generate isotropic Gaussian blobs for clustering.

Arguments

  • n_samples::Union{Int, Vector{Int}}: The total number of points equally divided among clusters, or the number of samples per cluster.
  • n_features::Int: The number of features for each sample.
  • centers::Union{Int, Matrix{Float64}}: The number of centers to generate, or a matrix of center locations.
  • cluster_std::Union{Float64, Vector{Float64}}: The standard deviation of the clusters.
  • center_box::Tuple{Float64, Float64}: The bounding box for each cluster center when centers are generated at random.
  • shuffle::Bool: Shuffle the samples.
  • random_state::Union{Int, Nothing}: Determines random number generation for dataset creation.
  • return_centers::Bool: If true, returns the centers in addition to X and y.

Returns

  • If return_centers is false:
    • X::Matrix{Float64}: Generated samples.
    • y::Vector{Int}: The integer labels for cluster membership of each sample.
  • If return_centers is true:
    • X::Matrix{Float64}: Generated samples.
    • y::Vector{Int}: The integer labels for cluster membership of each sample.
    • centers::Matrix{Float64}: The centers used to generate the data.

Description

This function generates samples from isotropic Gaussian blobs for clustering. It can be used for testing clustering algorithms or as a simple dataset for demonstration purposes.

Example

```julia

Generate a simple dataset with 3 clusters

X, y = makeblobs(nsamples=300, centers=3, nfeatures=2, randomstate=42)

Generate a dataset with specified centers and return the centers

centers = [0 0; 1 1; 2 2] X, y, centers = makeblobs(nsamples=300, centers=centers, clusterstd=0.5, returncenters=true)

Notes

If centers is an int, it is interpreted as the number of centers to generate, and they are generated randomly within center_box.

  • If centers is a 2-d array, it is interpreted as the actual centers to use, and n_features is ignored in this case.
  • If n_samples is an int, it is interpreted as the total number of samples, which are then evenly divided among clusters.
  • If n_samples is an array, it is interpreted as the number of samples per cluster.
source
NovaML.Datasets.make_moonsMethod
make_moons(;
    n_samples::Union{Int, Tuple{Int, Int}}=100,
    shuffle::Bool=true,
    noise::Union{Float64, Nothing}=nothing,
    random_state::Union{Int, Nothing}=nothing
)

Generate two interleaving half circles for binary classification.

Arguments

  • n_samples::Union{Int, Tuple{Int, Int}}: The total number of points generated or a tuple containing the number of points in each of the two moons.
  • shuffle::Bool: Whether to shuffle the samples.
  • noise::Union{Float64, Nothing}: Standard deviation of Gaussian noise added to the data.
  • random_state::Union{Int, Nothing}: Determines random number generation for dataset creation.

Returns

  • X::Matrix{Float64}: The generated samples, of shape (n_samples, 2).
  • y::Vector{Int}: The integer labels (0 or 1) for class membership of each sample.

Description

This function generates a binary classification dataset in the shape of two interleaving half moons. It can be used for testing classification algorithms or as a simple dataset for demonstration purposes.

Example

```julia

Generate a simple moon dataset

X, y = makemoons(nsamples=100, noise=0.1, random_state=42)

Generate a moon dataset with different number of samples in each moon

X, y = makemoons(nsamples=(60, 40), noise=0.1, shuffle=false)

Notes

  • If n_samples is an integer, it generates approximately equal numbers of samples in each moon.
  • If the number is odd, the extra sample is added to the first moon.
  • If n_samples is a tuple of two integers, it specifies the number of samples for each moon respectively.
  • The two moons are generated on a 2D plane. The first moon is a half circle of radius 1 centered at (0, 0),

while the second moon is a half circle of radius 1 centered at (1, 0.5).

  • If noise is specified, Gaussian noise with standard deviation noise is added to the data.
source

Clustering

NovaML.Cluster.AgglomerativeClusteringType
AgglomerativeClustering

A struct representing Agglomerative Clustering, a hierarchical clustering algorithm.

Fields

  • n_clusters::Union{Int, Nothing}: The number of clusters to find. If nothing, it must be used with distance_threshold.
  • metric::Union{String, Function}: The metric to use for distance computation. Can be "euclidean", "manhattan", or a custom function.
  • memory::Union{String, Nothing}: Used to cache the distance matrix between iterations.
  • connectivity::Union{AbstractMatrix, Function, Nothing}: Connectivity matrix or callable to be used.
  • compute_full_tree::Union{Bool, String}: Whether to compute the full tree or stop early.
  • linkage::String: The linkage criterion to use. Can be "ward", "complete", "average", or "single".
  • distance_threshold::Union{Float64, Nothing}: The threshold to stop clustering.
  • compute_distances::Bool: Whether to compute distances.

Fitted Attributes

  • labels_::Vector{Int}: Cluster labels for each point.
  • n_leaves_::Int: Number of leaves in the hierarchical tree.
  • n_connected_components_::Int: Number of connected components in the graph.
  • children_::Matrix{Int}: The children of each non-leaf node.
  • distances_::Vector{Float64}: Distances between nodes in the tree.

Constructor

AgglomerativeClustering(;
    n_clusters::Union{Int, Nothing}=2,
    metric::Union{String, Function}="euclidean",
    memory::Union{String, Nothing}=nothing,
    connectivity::Union{AbstractMatrix, Function, Nothing}=nothing,
    compute_full_tree::Union{Bool, String}="auto",
    linkage::String="ward",
    distance_threshold::Union{Float64, Nothing}=nothing,
    compute_distances::Bool=false
)

Constructs an AgglomerativeClustering object with the specified parameters.

Examples

# Create an AgglomerativeClustering object with 3 clusters
clustering = AgglomerativeClustering(n_clusters=3)

# Create an AgglomerativeClustering object with a distance threshold
clustering = AgglomerativeClustering(distance_threshold=1.5, linkage="single")
source
NovaML.Cluster.AgglomerativeClusteringMethod

(clustering::AgglomerativeClustering)(X::AbstractMatrix, type::Symbol) Fit the clustering model and return the cluster labels.

Arguments

X::AbstractMatrix: The input data matrix. type::Symbol: Must be :fit_predict to fit the model and return labels.

Returns

labels::Vector{Int}: The cluster labels for each input sample.

Examples

X = rand(100, 5)
clustering = AgglomerativeClustering(n_clusters=3)
labels = clustering(X, :fit_predict)
source
NovaML.Cluster.AgglomerativeClusteringMethod
(clustering::AgglomerativeClustering)(X::AbstractMatrix; y=nothing)

Perform agglomerative clustering on the input data.

Arguments

X::AbstractMatrix: The input data matrix where each row is a sample and each column is a feature. y=nothing: Ignored. Present for API consistency.

Returns

clustering::AgglomerativeClustering: The fitted clustering object.

#Examples

X = rand(100, 5)  # 100 samples, 5 features
clustering = AgglomerativeClustering(n_clusters=3)
fitted_clustering = clustering(X)
source
NovaML.Cluster.DBSCANType
(dbscan::DBSCAN)(X::AbstractMatrix, y=nothing; sample_weight=nothing)

Perform DBSCAN clustering on the input data.

Arguments

X::AbstractMatrix: The input data matrix where each row is a sample and each column is a feature. y=nothing: Ignored. Present for API consistency. sample_weight=nothing: Weight of each sample, used in computing the number of neighbors within eps.

Returns

dbscan::DBSCAN: The fitted DBSCAN object.

Examples

X = rand(100, 5)  # 100 samples, 5 features
dbscan = DBSCAN(eps=0.5, min_samples=5)
fitted_dbscan = dbscan(X)
source
NovaML.Cluster.DBSCANType
DBSCAN

A struct representing the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm.

Fields

  • eps::Float64: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
  • min_samples::Int: The number of samples in a neighborhood for a point to be considered as a core point.
  • metric::Union{String, Metric}: The metric to use when calculating distance between instances.
  • metric_params::Union{Nothing, Dict}: Additional keyword arguments for the metric function.
  • algorithm::Symbol: The algorithm to be used by the NearestNeighbors module.
  • leaf_size::Int: Leaf size passed to BallTree or KDTree.
  • p::Union{Nothing, Float64}: The power of the Minkowski metric to be used to calculate distance between points.
  • n_jobs::Union{Nothing, Int}: The number of parallel jobs to run.

Fitted Attributes

  • core_sample_indices_::Vector{Int}: Indices of core samples.
  • components_::Matrix{Float64}: Copy of each core sample found by training.
  • labels_::Vector{Int}: Cluster labels for each point in the dataset given to fit().
  • n_features_in_::Int: Number of features seen during fit.
  • feature_names_in_::Vector{String}: Names of features seen during fit.
  • fitted::Bool: Whether the model has been fitted.

Constructor

DBSCAN(;
    eps::Float64 = 0.5,
    min_samples::Int = 5,
    metric::Union{String, Metric} = "euclidean",
    metric_params::Union{Nothing, Dict} = nothing,
    algorithm::Symbol = :auto,
    leaf_size::Int = 30,
    p::Union{Nothing, Float64} = nothing,
    n_jobs::Union{Nothing, Int} = nothing
)

Constructs a DBSCAN object with the specified parameters.

Examples

```julia

Create a DBSCAN object with default parameters

dbscan = DBSCAN()

Create a DBSCAN object with custom parameters

dbscan = DBSCAN(eps=0.7, min_samples=10, metric="manhattan")

source
NovaML.Cluster.KMeansType
KMeans <: AbstractModel

Represents the K-Means clustering algorithm.

Fields

  • n_clusters::Int: The number of clusters to form.
  • init::Union{String, Matrix{Float64}, Function}: Method for initialization.
  • n_init::Union{Int, String}: Number of time the k-means algorithm will be run with different centroid seeds.
  • max_iter::Int: Maximum number of iterations of the k-means algorithm for a single run.
  • tol::Float64: Relative tolerance with regards to inertia to declare convergence.
  • verbose::Int: Verbosity mode.
  • random_state::Union{Int, Nothing}: Determines random number generation for centroid initialization.
  • copy_x::Bool: When pre-computing distances it is more numerically accurate to center the data first.
  • algorithm::String: K-means algorithm to use.

Fitted Attributes

  • cluster_centers_::Union{Matrix{Float64}, Nothing}: Coordinates of cluster centers.
  • labels_::Union{Vector{Int}, Nothing}: Labels of each point.
  • inertia_::Union{Float64, Nothing}: Sum of squared distances of samples to their closest cluster center.
  • n_iter_::Union{Int, Nothing}: Number of iterations run.
source
NovaML.Cluster.KMeansType
(kmeans::KMeans)(X::AbstractVecOrMat{Float64}, y=nothing; sample_weight=nothing)

Compute k-means clustering.

Arguments

  • X::AbstractVecOrMat{Float64}: Training instances to cluster.
  • y: Ignored. Not used, present for API consistency by convention.
  • sample_weight: The weights for each observation in X.

Returns

  • If the model is not fitted, returns the fitted model.
  • If the model is already fitted, returns the predicted labels for X.
source
Base.showMethod

Base.show(io::IO, dbscan::DBSCAN) Custom show method for DBSCAN objects.

Arguments

io::IO: The I/O stream to which the representation is written. dbscan::DBSCAN: The DBSCAN object to be displayed.

Examples

dbscan = DBSCAN(eps=0.7, min_samples=10)
println(dbscan)
source
Base.showMethod
Base.show(io::IO, kmeans::KMeans)

Custom show method for KMeans instances.

Arguments

  • io::IO: The I/O stream.
  • kmeans::KMeans: The KMeans instance to display.
source
NovaML.Cluster.assign_labelsMethod
assign_labels(X::AbstractMatrix{Float64}, centroids::Matrix{Float64})

Assign labels to data points based on the nearest centroid.

Arguments

  • X::AbstractMatrix{Float64}: The input data.
  • centroids::Matrix{Float64}: The current centroids.

Returns

  • Vector{Int}: The assigned labels for each data point.
source
NovaML.Cluster.compute_distancesMethod
compute_distances(X::AbstractMatrix, metric::Union{String, Function})

Compute the distance matrix for the input data using the specified metric.

Arguments

X::AbstractMatrix: The input data matrix. metric::Union{String, Function}: The distance metric to use. Can be "euclidean", "manhattan", or a custom function.

Returns

distances::Matrix: The computed distance matrix.

Examples

X = rand(10, 3)
distances = compute_distances(X, "euclidean")
source
NovaML.Cluster.compute_inertiaFunction
compute_inertia(X::Matrix{Float64}, centroids::Matrix{Float64}, labels::Vector{Int}, sample_weight=nothing)

Compute the inertia, the sum of squared distances of samples to their closest cluster center.

Arguments

  • X::Matrix{Float64}: The input data.
  • centroids::Matrix{Float64}: The current centroids.
  • labels::Vector{Int}: The current label assignments.
  • sample_weight: The weights for each observation in X.

Returns

  • Float64: The computed inertia.
source
NovaML.Cluster.fit_predictFunction
fit_predict(kmeans::KMeans, X::Matrix{Float64}, y=nothing; sample_weight=nothing)

Compute cluster centers and predict cluster index for each sample.

Arguments

  • kmeans::KMeans: The KMeans instance.
  • X::Matrix{Float64}: New data to transform.
  • y: Ignored.
  • sample_weight: The weights for each observation in X.

Returns

  • Vector{Int}: Index of the cluster each sample belongs to.
source
NovaML.Cluster.fit_transformFunction
fit_transform(kmeans::KMeans, X::Matrix{Float64}, y=nothing; sample_weight=nothing)

Compute clustering and transform X to cluster-distance space.

Arguments

  • kmeans::KMeans: The KMeans instance.
  • X::Matrix{Float64}: New data to transform.
  • y: Ignored.
  • sample_weight: The weights for each observation in X.

Returns

  • Matrix{Float64}: X transformed in the new space.
source
NovaML.Cluster.get_paramsMethod

get_params(dbscan::DBSCAN) Get parameters for this estimator.

Returns

params::Dict: Parameter names mapped to their values.

Examples

source
NovaML.Cluster.get_paramsMethod
get_params(kmeans::KMeans; deep=true)

Get parameters for this estimator.

Arguments

  • kmeans::KMeans: The KMeans instance.
  • deep::Bool: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

  • Dict: Parameter names mapped to their values.
source
NovaML.Cluster.initialize_centroidsMethod
initialize_centroids(kmeans::KMeans, X::Matrix{Float64})

Initialize the centroids for K-Means clustering.

Arguments

  • kmeans::KMeans: The KMeans instance.
  • X::Matrix{Float64}: The input data.

Returns

  • Matrix{Float64}: The initial centroids.
source
NovaML.Cluster.kmeans_plus_plusMethod
kmeans_plus_plus(X::Matrix{Float64}, n_clusters::Int)

Perform K-Means++ initialization.

Arguments

  • X::Matrix{Float64}: The input data.
  • n_clusters::Int: The number of clusters.

Returns

  • Matrix{Float64}: The initial centroids chosen by K-Means++.
source
NovaML.Cluster.scoreFunction
score(kmeans::KMeans, X::Matrix{Float64}, y=nothing; sample_weight=nothing)

Opposite of the value of X on the K-means objective.

Arguments

  • kmeans::KMeans: The KMeans instance.
  • X::Matrix{Float64}: New data.
  • y: Ignored.
  • sample_weight: The weights for each observation in X.

Returns

  • Float64: Opposite of the value of X on the K-means objective.
source
NovaML.Cluster.set_params!Method

set_params!(dbscan::DBSCAN; kwargs...) Set the parameters of this estimator.

Arguments

kwargs...: Estimator parameters.

Returns

dbscan::DBSCAN: The DBSCAN object.

Examples

dbscan = DBSCAN()
set_params!(dbscan, eps=0.8, min_samples=15)
source
NovaML.Cluster.set_params!Method
set_params!(kmeans::KMeans; params...)

Set the parameters of this estimator.

Arguments

  • kmeans::KMeans: The KMeans instance.
  • params...: Estimator parameters.

Returns

  • KMeans: The estimator instance.
source
NovaML.Cluster.transformMethod
transform(kmeans::KMeans, X::Matrix{Float64})

Transform X to a cluster-distance space.

Arguments

  • kmeans::KMeans: The KMeans instance.
  • X::Matrix{Float64}: New data to transform.

Returns

  • Matrix{Float64}: X transformed in the new space.
source
NovaML.Cluster.update_centroidsFunction
update_centroids(X::Matrix{Float64}, labels::Vector{Int}, n_clusters::Int, sample_weight=nothing)

Update the centroids based on the current label assignments.

Arguments

  • X::Matrix{Float64}: The input data.
  • labels::Vector{Int}: The current label assignments.
  • n_clusters::Int: The number of clusters.
  • sample_weight: The weights for each observation in X.

Returns

  • Matrix{Float64}: The updated centroids.
source

Decomposition

NovaML.Decomposition.LatentDirichletAllocationType
LatentDirichletAllocation

Latent Dirichlet Allocation (LDA) with online variational Bayes algorithm.

LDA is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.

Fields

  • n_components::Int: Number of topics.
  • doc_topic_prior::Union{Float64, Nothing}: Prior of document topic distribution.
  • topic_word_prior::Union{Float64, Nothing}: Prior of topic word distribution.
  • learning_method::Symbol: Method used to update the model: :batch for batch learning, :online for online learning.
  • learning_decay::Float64: It is a parameter that control the rate at which the learning rate decreases.
  • learning_offset::Float64: A (positive) parameter that downweights early iterations in online learning.
  • max_iter::Int: The maximum number of iterations.
  • batch_size::Int: Number of documents to use in each EM iteration in online learning method.
  • evaluate_every::Int: How often to evaluate perplexity.
  • total_samples::Float64: Total number of documents.
  • perp_tol::Float64: Perplexity tolerance in batch learning.
  • mean_change_tol::Float64: Stopping tolerance for updating document topic distribution in E-step.
  • max_doc_update_iter::Int: Max number of iterations for updating document topic distribution in E-step.
  • n_jobs::Union{Int, Nothing}: The number of jobs to use in the E-step.
  • verbose::Int: Verbosity level.
  • random_state::Union{Int, Nothing}: Seed for random number generation.

Learned attributes

  • components_::Union{Matrix{Float64}, Nothing}: Topic word distribution. shape = (ncomponents, nfeatures)
  • exp_dirichlet_component_::Union{Matrix{Float64}, Nothing}: Exponential value of expectation of log topic word distribution. shape = (ncomponents, nfeatures)
  • n_batch_iter_::Int: Number of iterations of the EM step.
  • n_iter_::Int: Number of passes over the dataset.
  • bound_::Float64: Final perplexity score on training set.
  • n_features_in_::Int: Number of features seen during fit.
  • feature_names_in_::Union{Vector{String}, Nothing}: Names of features seen during fit.

Example

```julia using NovaML

Create an LDA model

lda = LatentDirichletAllocation(ncomponents=10, randomstate=42)

Fit the model to data

doctopicdistr = lda(X)

Transform new data

newdoctopicdistr = lda(newX)

source
NovaML.Decomposition.LatentDirichletAllocationMethod
(lda::LatentDirichletAllocation)(X::AbstractMatrix{T}; type=nothing) where T <: Real

Fit the model to X, or transform X if the model is already fitted.

Arguments

  • X::AbstractMatrix{T}: Document-term matrix.
  • type: Ignored. Present for API consistency.

Returns

  • If the model is not fitted, returns the document-topic distribution after fitting.
  • If the model is already fitted, returns the document-topic distribution for X.
source
NovaML.Decomposition.PCAType
PCA

Principal Component Analysis (PCA).

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

Fields

  • n_components::Union{Int, Float64, String, Nothing}: Number of components to keep.
  • whiten::Bool: When True, the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
  • fitted::Bool: Whether the PCA model has been fitted to data.

Fitted Attributes

  • components_::Union{Matrix{Float64}, Nothing}: Principal axes in feature space, representing the directions of maximum variance in the data.
  • explained_variance_::Union{Vector{Float64}, Nothing}: The amount of variance explained by each of the selected components.
  • explained_variance_ratio_::Union{Vector{Float64}, Nothing}: Percentage of variance explained by each of the selected components.
  • singular_values_::Union{Vector{Float64}, Nothing}: The singular values corresponding to each of the selected components.
  • mean_::Union{Vector{Float64}, Nothing}: Per-feature empirical mean, estimated from the training set.
  • n_samples_::Union{Int, Nothing}: Number of samples in the training data.
  • n_features_::Union{Int, Nothing}: Number of features in the training data.
  • n_components_::Union{Int, Nothing}: The estimated number of components.
  • noise_variance_::Union{Float64, Nothing}: The estimated noise covariance following the Probabilistic PCA model.

Example

pca = PCA(n_components=2)
X_transformed = pca(X)
X_inverse = pca(X_transformed, :inverse_transform)
source
NovaML.Decomposition.PCAMethod
(pca::PCA)(X::AbstractMatrix{T}) where T <: Real

Fit the model with X and apply the dimensionality reduction on X.

Arguments

  • X::AbstractMatrix{T}: Training data, where nsamples is the number of samples and nfeatures is the number of features.

Returns

  • Matrix{Float64}: Transformed values.
source
NovaML.Decomposition.PCAMethod
(pca::PCA)(X::AbstractMatrix{T}, mode::Symbol) where T <: Real

Transform data back to its original space.

Arguments

  • X::AbstractMatrix{T}: New data, where nsamples is the number of samples and ncomponents is the number of components.
  • mode::Symbol: Must be :inverse_transform.

Returns

  • Matrix{Float64}: X_original array.

Throws

  • ErrorException: If mode is not :inverse_transform.
source
Base.showMethod
Base.show(io::IO, lda::LatentDirichletAllocation)

Custom show method for LatentDirichletAllocation.

Arguments

  • io::IO: The I/O stream
  • lda::LatentDirichletAllocation: The LDA model to display
source
Base.showMethod
Base.show(io::IO, pca::PCA)

Custom show method for PCA.

Arguments

  • io::IO: The I/O stream.
  • pca::PCA: The PCA model to display.
source
NovaML.Decomposition._e_stepMethod
_e_step(lda::LatentDirichletAllocation, X::AbstractMatrix{T}) where T <: Real

E-step in EM update.

Arguments

  • lda::LatentDirichletAllocation: The LDA model.
  • X::AbstractMatrix{T}: Document-term matrix.

Returns

  • Matrix{Float64}: Document-topic distribution.
source
NovaML.Decomposition._fit_batchMethod
_fit_batch(lda::LatentDirichletAllocation, X::AbstractMatrix{T}) where T <: Real

Fit the model to X using batch variational Bayes method.

Arguments

  • lda::LatentDirichletAllocation: The LDA model.
  • X::AbstractMatrix{T}: Document-term matrix.

Returns

  • Matrix{Float64}: Document-topic distribution.
source
NovaML.Decomposition._fit_onlineMethod
_fit_online(lda::LatentDirichletAllocation, X::AbstractMatrix{T}) where T <: Real

Fit the model to X using online variational Bayes method.

Arguments

  • lda::LatentDirichletAllocation: The LDA model.
  • X::AbstractMatrix{T}: Document-term matrix.

Returns

  • Matrix{Float64}: Document-topic distribution.
source
NovaML.Decomposition._fit_transformMethod
_fit_transform(lda::LatentDirichletAllocation, X::AbstractMatrix{T}) where T <: Real

Fit the model to X and return the document-topic distribution.

Arguments

  • lda::LatentDirichletAllocation: The LDA model.
  • X::AbstractMatrix{T}: Document-term matrix.

Returns

  • Matrix{Float64}: Document-topic distribution.
source
NovaML.Decomposition._m_stepMethod
_m_step(lda::LatentDirichletAllocation, X::AbstractMatrix{T}, doc_topic_distr::Matrix{Float64}, scale::Float64=1.0) where T <: Real

M-step in EM update.

Arguments

  • lda::LatentDirichletAllocation: The LDA model.
  • X::AbstractMatrix{T}: Document-term matrix.
  • doc_topic_distr::Matrix{Float64}: Document-topic distribution.
  • scale::Float64: Scaling factor for online update.
source
NovaML.Decomposition._perplexityMethod
_perplexity(lda::LatentDirichletAllocation, X::AbstractMatrix{T}, doc_topic_distr::Matrix{Float64}) where T <: Real

Calculate approximate perplexity for data X.

Arguments

  • lda::LatentDirichletAllocation: The LDA model.
  • X::AbstractMatrix{T}: Document-term matrix.
  • doc_topic_distr::Matrix{Float64}: Document-topic distribution.

Returns

  • Float64: The calculated bound.
source
NovaML.Decomposition._transformMethod
_transform(lda::LatentDirichletAllocation, X::AbstractMatrix{T}) where T <: Real

Transform X to document-topic distribution.

Arguments

  • lda::LatentDirichletAllocation: The LDA model.
  • X::AbstractMatrix{T}: Document-term matrix.

Returns

  • Matrix{Float64}: Document-topic distribution.
source

Ensemble Methods

NovaML.Ensemble.AdaBoostClassifierType
AdaBoostClassifier <: AbstractModel

An AdaBoost classifier.

An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

Fields

  • base_estimator::Any: The base estimator from which the boosted ensemble is built.
  • n_estimators::Int: The maximum number of estimators at which boosting is terminated.
  • learning_rate::Float64: Weight applied to each classifier at each boosting iteration.
  • algorithm::Symbol: The SAMME algorithm to use when fitting the model.
  • random_state::Union{Int, Nothing}: Controls the random seed given at each base_estimator at each boosting iteration.

Fitted Attributes

  • estimators_::Vector{Any}: The collection of fitted sub-estimators.
  • estimator_weights_::Vector{Float64}: Weights for each estimator in the boosted ensemble.
  • estimator_errors_::Vector{Float64}: Classification error for each estimator in the boosted ensemble.
  • classes_::Vector{Any}: The classes labels.
  • n_classes_::Int: The number of classes.
  • feature_importances_::Union{Vector{Float64}, Nothing}: The feature importances if supported by the base_estimator.
  • fitted::Bool: Whether the model has been fitted.

Example

model = AdaBoostClassifier(n_estimators=100, learning_rate=1.0)
model(X, y)  # Fit the model
predictions = model(X_test)  # Make predictions
probabilities = model(X_test, type=:probs)  # Get probability estimates
source
NovaML.Ensemble.AdaBoostClassifierMethod

(model::AdaBoostClassifier)(X::AbstractMatrix, y::AbstractVector) Fit the AdaBoost model.

Arguments

  • X::AbstractMatrix: The input samples.
  • y::AbstractVector: The target values (class labels).

Returns

  • AdaBoostClassifier: The fitted model.
source
NovaML.Ensemble.AdaBoostClassifierMethod
(model::AdaBoostClassifier)(X::AbstractMatrix; type=nothing)

Predict using the AdaBoost model.

Arguments

  • X::AbstractMatrix: The input samples.
  • type: If set to :probs, return probability estimates for each class.

Returns

  • If type is :probs, returns probabilities of each class.
  • Otherwise, returns predicted class labels.
source
NovaML.Ensemble.BaggingClassifierType
BaggingClassifier <: AbstractModel

A Bagging classifier.

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

Fields

  • base_estimator::AbstractModel: The base estimator to fit on random subsets of the dataset.
  • n_estimators::Int: The number of base estimators in the ensemble.
  • max_samples::Union{Int, Float64}: The number of samples to draw from X to train each base estimator.
  • max_features::Union{Int, Float64}: The number of features to draw from X to train each base estimator.
  • bootstrap::Bool: Whether samples are drawn with replacement.
  • bootstrap_features::Bool: Whether features are drawn with replacement.
  • oob_score::Bool: Whether to use out-of-bag samples to estimate the generalization error.
  • warm_start::Bool: When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble.
  • random_state::Union{Int, Nothing}: Controls the random resampling of the original dataset.
  • verbose::Int: Controls the verbosity when fitting and predicting.

Fitted Attributes

  • estimators_::Vector{AbstractModel}: The collection of fitted base estimators.
  • estimators_features_::Vector{Vector{Int}}: The subset of drawn features for each base estimator.
  • classes_::Vector: The classes labels.
  • n_classes_::Int: The number of classes.
  • oob_score_::Union{Float64, Nothing}: Score of the training dataset obtained using an out-of-bag estimate.
  • oob_decision_function_::Union{Matrix{Float64}, Nothing}: Decision function computed with out-of-bag estimate on the training set.
  • fitted::Bool: Whether the model has been fitted.

Example

model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10)
model(X, y)  # Fit the model
predictions = model(X_test)  # Make predictions
probabilities = model(X_test, type=:probs)  # Get probability estimates
source
NovaML.Ensemble.BaggingClassifierMethod
(bc::BaggingClassifier)(X::AbstractMatrix, y::AbstractVector)

Fit the Bagging classifier.

Arguments

  • X::AbstractMatrix: The input samples.
  • y::AbstractVector: The target values (class labels).

Returns

  • BaggingClassifier: The fitted model.
source
NovaML.Ensemble.BaggingClassifierMethod
(bc::BaggingClassifier)(X::AbstractMatrix; type=nothing)

Predict class for X.

Arguments

  • X::AbstractMatrix: The input samples.
  • type: If set to :probs, return probability estimates for each class.

Returns

  • If type is :probs, returns probabilities of each class.
  • Otherwise, returns predicted class labels.
source
NovaML.Ensemble.GradientBoostingClassifierType
GradientBoostingClassifier <: AbstractModel

Gradient Boosting for classification.

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

Fields

  • loss::String: The loss function to be optimized.
  • learning_rate::Float64: Learning rate shrinks the contribution of each tree by learning_rate.
  • n_estimators::Int: The number of boosting stages to perform.
  • subsample::Float64: The fraction of samples to be used for fitting the individual base learners.
  • criterion::String: The function to measure the quality of a split.
  • min_samples_split::Union{Int, Float64}: The minimum number of samples required to split an internal node.
  • min_samples_leaf::Union{Int, Float64}: The minimum number of samples required to be at a leaf node.
  • min_weight_fraction_leaf::Float64: The minimum weighted fraction of the sum total of weights required to be at a leaf node.
  • max_depth::Union{Int, Nothing}: Maximum depth of the individual regression estimators.
  • min_impurity_decrease::Float64: A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
  • init::Union{AbstractModel, String, Nothing}: An estimator object that is used to compute the initial predictions.
  • random_state::Union{Int, Nothing}: Controls the random seed given at each tree_estimator at each boosting iteration.
  • max_features::Union{Int, Float64, String, Nothing}: The number of features to consider when looking for the best split.
  • verbose::Int: Enable verbose output.
  • max_leaf_nodes::Union{Int, Nothing}: Grow trees with max_leaf_nodes in best-first fashion.
  • warm_start::Bool: When set to true, reuse the solution of the previous call to fit and add more estimators to the ensemble.
  • validation_fraction::Float64: The proportion of training data to set aside as validation set for early stopping.
  • n_iter_no_change::Union{Int, Nothing}: Used to decide if early stopping will be used to terminate training when validation score is not improving.
  • tol::Float64: Tolerance for the early stopping.
  • ccp_alpha::Float64: Complexity parameter used for Minimal Cost-Complexity Pruning.

Fitted Attributes

  • estimators_::Vector{Vector{DecisionTreeRegressor}}: The collection of fitted sub-estimators.
  • classes_::Vector: The classes labels.
  • n_classes_::Int: The number of classes.
  • feature_importances_::Union{Vector{Float64}, Nothing}: The feature importances.
  • oob_improvement_::Union{Vector{Float64}, Nothing}: The improvement in loss on the out-of-bag samples relative to the previous iteration.
  • train_score_::Vector{Float64}: The i-th score train_score_[i] is the loss of the model at iteration i on the in-bag sample.
  • n_estimators_::Int: The number of estimators as selected by early stopping.
  • init_::Union{AbstractModel, Nothing}: The estimator that provides the initial predictions.
  • fitted::Bool: Whether the model has been fitted.

Example

model = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1)
model(X, y)  # Fit the model
predictions = model(X_test)  # Make predictions
probabilities = model(X_test, type=:probs)  # Get probability estimates
source
NovaML.Ensemble.GradientBoostingClassifierMethod
(gbm::GradientBoostingClassifier)(X::AbstractMatrix, y::AbstractVector)

Fit the gradient boosting model.

Arguments

  • X::AbstractMatrix: The input samples.
  • y::AbstractVector: The target values (class labels).

Returns

  • GradientBoostingClassifier: The fitted model.
source
NovaML.Ensemble.GradientBoostingClassifierMethod
(gbm::GradientBoostingClassifier)(X::AbstractMatrix; type=nothing)

Predict class for X.

Arguments

  • X::AbstractMatrix: The input samples.
  • type: If set to :probs, return probability estimates for each class.

Returns

  • If type is :probs, returns probabilities of each class.
  • Otherwise, returns predicted class labels.
source
NovaML.Ensemble.InitialEstimatorType
InitialEstimator <: AbstractModel

An initial estimator that always predicts a constant probability.

Fields

  • prob::Float64: The constant probability to predict.
source
NovaML.Ensemble.InitialEstimatorMethod
(estimator::InitialEstimator)(X::AbstractMatrix)

Predict using the initial estimator.

Arguments

  • X::AbstractMatrix: The input samples.

Returns

  • Vector{Float64}: The predictions.
source
NovaML.Ensemble.RandomForestClassifierType
RandomForestClassifier <: AbstractModel

A random forest classifier.

Random forests are an ensemble learning method for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees.

Fields

  • n_estimators::Int: The number of trees in the forest.
  • max_depth::Union{Int, Nothing}: The maximum depth of the tree.
  • min_samples_split::Int: The minimum number of samples required to split an internal node.
  • min_samples_leaf::Int: The minimum number of samples required to be at a leaf node.
  • max_features::Union{Int, Float64, String, Nothing}: The number of features to consider when looking for the best split.
  • bootstrap::Bool: Whether bootstrap samples are used when building trees.
  • random_state::Union{Int, Nothing}: Controls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node.
  • trees::Vector{DecisionTreeClassifier}: The collection of fitted sub-estimators.
  • n_classes::Int: The number of classes.
  • classes::Vector: The class labels.
  • fitted::Bool: Whether the model has been fitted.
  • feature_importances_::Union{Vector{Float64}, Nothing}: The feature importances.
  • n_features::Int: The number of features when fitting the model.

Example

```julia rf = RandomForestClassifier(nestimators=100, maxdepth=10) rf(X, y) # Fit the model predictions = rf(X_test) # Make predictions

source
NovaML.Ensemble.RandomForestClassifierMethod
(forest::RandomForestClassifier)(X::AbstractMatrix, y::AbstractVector)

Fit the random forest classifier.

Arguments

  • X::AbstractMatrix: The input samples.
  • y::AbstractVector: The target values (class labels).

Returns

  • RandomForestClassifier: The fitted model.
source
NovaML.Ensemble.RandomForestRegressorType
RandomForestRegressor <: AbstractModel

A random forest regressor.

Random forests are an ensemble learning method for regression that operate by constructing a multitude of decision trees at training time and outputting the mean prediction of the individual trees.

Fields

  • n_estimators::Int: The number of trees in the forest.
  • criterion::String: The function to measure the quality of a split.
  • max_depth::Union{Int, Nothing}: The maximum depth of the tree.
  • min_samples_split::Int: The minimum number of samples required to split an internal node.
  • min_samples_leaf::Int: The minimum number of samples required to be at a leaf node.
  • min_weight_fraction_leaf::Float64: The minimum weighted fraction of the sum total of weights required to be at a leaf node.
  • max_features::Union{Int, Float64, String, Nothing}: The number of features to consider when looking for the best split.
  • max_leaf_nodes::Union{Int, Nothing}: Grow trees with maxleafnodes in best-first fashion.
  • min_impurity_decrease::Float64: A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
  • bootstrap::Bool: Whether bootstrap samples are used when building trees.
  • oob_score::Bool: Whether to use out-of-bag samples to estimate the generalization score.
  • n_jobs::Union{Int, Nothing}: The number of jobs to run in parallel.
  • random_state::Union{Int, Nothing}: Controls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node.
  • verbose::Int: Controls the verbosity when fitting and predicting.
  • warm_start::Bool: When set to true, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.
  • ccp_alpha::Float64: Complexity parameter used for Minimal Cost-Complexity Pruning.
  • max_samples::Union{Int, Float64, Nothing}: If bootstrap is True, the number of samples to draw from X to train each base estimator.

Example

rf = RandomForestRegressor(n_estimators=100, max_depth=10)
rf(X, y)  # Fit the model
predictions = rf(X_test)  # Make predictions
source
NovaML.Ensemble.RandomForestRegressorMethod
(forest::RandomForestRegressor)(X::AbstractMatrix, y::AbstractVector)

Fit the random forest regressor.

Arguments

  • X::AbstractMatrix: The input samples.
  • y::AbstractVector: The target values.

Returns

  • RandomForestRegressor: The fitted model.
source
NovaML.Ensemble.RandomForestRegressorMethod
(forest::RandomForestRegressor)(X::AbstractMatrix)

Predict regression target for X.

Arguments

  • X::AbstractMatrix: The input samples.

Returns

  • Vector{Float64}: The predicted values.
source
NovaML.Ensemble.VotingClassifierType
VotingClassifier <: AbstractModel

A Voting Classifier for combining multiple machine learning classifiers.

This classifier combines a number of estimators to create a single classifier that makes predictions based on either hard voting (majority vote) or soft voting (weighted average of predicted probabilities).

Fields

  • estimators::Vector{Tuple{String, Any}}: List of (name, estimator) tuples.
  • voting::Symbol: The voting strategy, either :hard for majority voting or :soft for probability voting.
  • weights::Union{Vector{Float64}, Nothing}: Sequence of weights for each estimator in soft voting.
  • flatten_transform::Bool: Affects the shape of transform output.
  • verbose::Bool: If true, prints progress messages during fitting.

Fitted Attributes

  • estimators_::Vector{Any}: The fitted estimators.
  • classes_::Vector{Any}: The class labels.
  • fitted::Bool: Whether the classifier is fitted.

Example

estimators = [("lr", LogisticRegression()), ("rf", RandomForestClassifier())]
vc = VotingClassifier(estimators=estimators, voting=:soft)
vc(X, y)  # Fit the classifier
predictions = vc(X_test)  # Make predictions
source
NovaML.Ensemble.VotingClassifierMethod
(vc::VotingClassifier)(X::AbstractMatrix, y::AbstractVector)

Fit the voting classifier.

Arguments

  • X::AbstractMatrix: The input samples.
  • y::AbstractVector: The target values (class labels).

Returns

  • VotingClassifier: The fitted classifier.
source
NovaML.Ensemble.VotingClassifierMethod
(vc::VotingClassifier)(X::AbstractMatrix; type=nothing)

Predict class labels for X.

Arguments

  • X::AbstractMatrix: The input samples.
  • type: If set to :probs, return probability estimates for each class.

Returns

  • If type is :probs, returns probabilities of each class.
  • Otherwise, returns predicted class labels.
source
NovaML.Ensemble.ZeroEstimatorMethod
(::ZeroEstimator)(X::AbstractMatrix)

Predict using the zero estimator.

Arguments

  • X::AbstractMatrix: The input samples.

Returns

  • Vector{Float64}: Zero predictions.
source
Base.showMethod
Base.show(io::IO, model::AdaBoostClassifier)

Custom show method for AdaBoostClassifier.

Arguments

  • io::IO: The I/O stream.
  • model::AdaBoostClassifier: The AdaBoost model to display.
source
Base.showMethod
Base.show(io::IO, bc::BaggingClassifier)

Custom show method for BaggingClassifier.

Arguments

  • io::IO: The I/O stream.
  • bc::BaggingClassifier: The Bagging classifier to display.
source
Base.showMethod
Base.show(io::IO, gbm::GradientBoostingClassifier)

Custom show method for GradientBoostingClassifier.

Arguments

  • io::IO: The I/O stream.
  • gbm::GradientBoostingClassifier: The gradient boosting model to display.
source
Base.showMethod
Base.show(io::IO, forest::RandomForestClassifier)

Custom show method for RandomForestClassifier.

Arguments

  • io::IO: The I/O stream.
  • forest::RandomForestClassifier: The random forest classifier to display.
source
Base.showMethod
Base.show(io::IO, forest::RandomForestRegressor)

Custom show method for RandomForestRegressor.

Arguments

  • io::IO: The I/O stream.
  • forest::RandomForestRegressor: The random forest regressor to display.
source
Base.showMethod
Base.show(io::IO, vc::VotingClassifier)

Custom show method for VotingClassifier.

Arguments

  • io::IO: The I/O stream.
  • vc::VotingClassifier: The voting classifier to display.
source
NovaML.Ensemble._compute_feature_importancesMethod
_compute_feature_importances(model::AdaBoostClassifier)

Compute feature importances for the AdaBoost model.

Arguments

  • model::AdaBoostClassifier: The fitted AdaBoost model.

Returns

  • Union{Vector{Float64}, Nothing}: The feature importances if available, otherwise nothing.
source
NovaML.Ensemble._compute_oob_scoreMethod
_compute_oob_score(bc::BaggingClassifier, X::AbstractMatrix, y::AbstractVector)

Compute out-of-bag score for the Bagging classifier.

Arguments

  • bc::BaggingClassifier: The Bagging classifier.
  • X::AbstractMatrix: The input samples.
  • y::AbstractVector: The target values.
source
NovaML.Ensemble._generate_indicesMethod
_generate_indices(bc::BaggingClassifier, n_samples::Int)

Generate sample indices for individual base estimators.

Arguments

  • bc::BaggingClassifier: The Bagging classifier.
  • n_samples::Int: The number of samples in the dataset.

Returns

  • Vector{Int}: The generated sample indices.
source
NovaML.Ensemble.bootstrap_sampleMethod
bootstrap_sample(forest::RandomForestClassifier, X::AbstractMatrix, y::AbstractVector)

Create a bootstrap sample of the dataset.

Arguments

  • forest::RandomForestClassifier: The random forest classifier.
  • X::AbstractMatrix: The input samples.
  • y::AbstractVector: The target values.

Returns

  • Tuple{AbstractMatrix, AbstractVector}: The bootstrapped samples and targets.
source
NovaML.Ensemble.calculate_tree_feature_importanceMethod
calculate_tree_feature_importance(tree::DecisionTreeClassifier, feature_indices::Vector{Int}, n_features::Int)

Calculate the feature importance for a single decision tree.

Arguments

  • tree::DecisionTreeClassifier: The decision tree.
  • feature_indices::Vector{Int}: The indices of the features used in this tree.
  • n_features::Int: The total number of features.

Returns

  • Vector{Float64}: The feature importances.
source
NovaML.Ensemble.compute_feature_importancesMethod
compute_feature_importances(gbm::GradientBoostingClassifier)

Compute feature importances for the gradient boosting model.

Arguments

  • gbm::GradientBoostingClassifier: The fitted gradient boosting model.

Returns

  • Vector{Float64}: The feature importances.
source
NovaML.Ensemble.compute_lossMethod
compute_loss(y::AbstractVector, y_pred::AbstractVector, loss::String)

Compute the loss for the given predictions.

Arguments

  • y::AbstractVector: The true values.
  • y_pred::AbstractVector: The predicted values.
  • loss::String: The loss function name.

Returns

  • Float64: The computed loss.
source
NovaML.Ensemble.compute_negative_gradientMethod
compute_negative_gradient(y::AbstractVector, y_pred::AbstractVector, loss::String)

Compute negative gradient for the given loss function.

Arguments

  • y::AbstractVector: The true values.
  • y_pred::AbstractVector: The predicted values.
  • loss::String: The loss function name.

Returns

  • AbstractVector: The negative gradient.
source
NovaML.Ensemble.compute_oob_scoreMethod
compute_oob_score(forest::RandomForestRegressor, X::AbstractMatrix, y::AbstractVector)

Compute out-of-bag (OOB) score for the random forest regressor.

Arguments

  • forest::RandomForestRegressor: The random forest regressor.
  • X::AbstractMatrix: The input samples.
  • y::AbstractVector: The target values.

Returns

  • Tuple{Float64, Vector{Float64}}: The OOB score and OOB predictions.
source
NovaML.Ensemble.decision_functionMethod
decision_function(model::AdaBoostClassifier, X::AbstractMatrix)

Compute the decision function of X.

Arguments

  • model::AdaBoostClassifier: The fitted AdaBoost model.
  • X::AbstractMatrix: The input samples.

Returns

  • Matrix{Float64}: The decision function of the input samples.
source
NovaML.Ensemble.fit_initial_estimatorMethod
fit_initial_estimator(y::AbstractVector)

Fit an initial estimator based on the mean of y.

Arguments

  • y::AbstractVector: The target values.

Returns

  • InitialEstimator: The fitted initial estimator.
source
NovaML.Ensemble.get_max_featuresMethod
get_max_features(forest::RandomForestClassifier, n_features::Int)

Get the number of features to consider when looking for the best split.

Arguments

  • forest::RandomForestClassifier: The random forest classifier.
  • n_features::Int: The total number of features.

Returns

  • Int: The number of features to consider.
source
NovaML.Ensemble.get_max_featuresMethod

getmaxfeatures(forest::RandomForestRegressor, n_features::Int) Get the number of features to consider when looking for the best split.

Arguments

  • forest::RandomForestRegressor: The random forest regressor.
  • n_features::Int: The total number of features.

Returns

Int: The number of features to consider.

source
NovaML.Ensemble.get_paramsMethod
get_params(model::AdaBoostClassifier; deep=true)

Get parameters for this estimator.

Arguments

  • model::AdaBoostClassifier: The AdaBoost model.
  • deep::Bool: If true, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

  • Dict: Parameter names mapped to their values.
source
NovaML.Ensemble.set_params!Method
set_params!(model::AdaBoostClassifier; kwargs...)

Set the parameters of this estimator.

Arguments

  • model::AdaBoostClassifier: The AdaBoost model.
  • kwargs...: Estimator parameters.

Returns

  • AdaBoostClassifier: The estimator instance.
source
NovaML.Ensemble.staged_predictMethod
staged_predict(model::AdaBoostClassifier, X::AbstractMatrix)

Return a generator of predictions for each boosting iteration.

Arguments

  • model::AdaBoostClassifier: The fitted AdaBoost model.
  • X::AbstractMatrix: The input samples.

Returns

  • Channel: A generator of predictions at each stage.
source
NovaML.Ensemble.staged_predict_probaMethod
staged_predict_proba(model::AdaBoostClassifier, X::AbstractMatrix)

Return a generator of predicted probabilities for each boosting iteration.

Arguments

  • model::AdaBoostClassifier: The fitted AdaBoost model.
  • X::AbstractMatrix: The input samples.

Returns

  • Channel: A generator of predicted probabilities at each stage.
source
NovaML.Ensemble.transformMethod
transform(vc::VotingClassifier, X::AbstractMatrix)

Return class labels or probabilities for X for each estimator.

Arguments

  • vc::VotingClassifier: The fitted voting classifier.
  • X::AbstractMatrix: The input samples.

Returns

  • If voting is :soft, returns the probabilities for each class for each estimator.
  • If voting is :hard, returns the class label predictions of each estimator.

The shape of the return depends on the flatten_transform parameter.

source