API

Models

NearestNeighborModels.KNNClassifierType
KNNClassifier

A model type for constructing a K-nearest neighbor classifier, based on NearestNeighborModels.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

KNNClassifier = @load KNNClassifier pkg=NearestNeighborModels

Do model = KNNClassifier() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in KNNClassifier(K=...).

KNNClassifier implements K-Nearest Neighbors classifier which is non-parametric algorithm that predicts a discrete class distribution associated with a new point by taking a vote over the classes of the k-nearest points. Each neighbor vote is assigned a weight based on proximity of the neighbor point to the test point according to a specified distance metric.

For more information about the weighting kernels, see the paper by Geler et.al Comparison of different weighting schemes for the kNN classifier on time-series data.

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

OR

mach = machine(model, X, y, w)

Here:

  • X is any table of input features (eg, a DataFrame) whose columns are of scitype Continuous; check column scitypes with schema(X).

  • y is the target, which can be any AbstractVector whose element scitype is <:Finite (<:Multiclass or <:OrderedFactor will do); check the scitype with scitype(y)

  • w is the observation weights which can either be nothing (default) or an AbstractVector whose element scitype is Count or Continuous. This is different from weights kernel which is a model hyperparameter, see below.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • K::Int=5 : number of neighbors
  • algorithm::Symbol = :kdtree : one of (:kdtree, :brutetree, :balltree)
  • metric::Metric = Euclidean() : any Metric from Distances.jl for the distance between points. For algorithm = :kdtree only metrics which are instances of Union{Distances.Chebyshev, Distances.Cityblock, Distances.Euclidean, Distances.Minkowski, Distances.WeightedCityblock, Distances.WeightedEuclidean, Distances.WeightedMinkowski} are supported.
  • leafsize::Int = algorithm == 10 : determines the number of points at which to stop splitting the tree. This option is ignored and always taken as 0 for algorithm = :brutetree, since brutetree isn't actually a tree.
  • reorder::Bool = true : if true then points which are close in distance are placed close in memory. In this case, a copy of the original data will be made so that the original data is left unmodified. Setting this to true can significantly improve performance of the specified algorithm (except :brutetree). This option is ignored and always taken as false for algorithm = :brutetree.
  • weights::KNNKernel=Uniform() : kernel used in assigning weights to the k-nearest neighbors for each observation. An instance of one of the types in list_kernels(). User-defined weighting functions can be passed by wrapping the function in a UserDefinedKernel kernel (do ?NearestNeighborModels.UserDefinedKernel for more info). If observation weights w are passed during machine construction then the weight assigned to each neighbor vote is the product of the kernel generated weight for that neighbor and the corresponding observation weight.

Operations

  • predict(mach, Xnew): Return predictions of the target given features Xnew, which should have same scitype as X above. Predictions are probabilistic but uncalibrated.

  • predict_mode(mach, Xnew): Return the modes of the probabilistic predictions returned above.

Fitted parameters

The fields of fitted_params(mach) are:

  • tree: An instance of either KDTree, BruteTree or BallTree depending on the value of the algorithm hyperparameter (See hyper-parameters section above). These are data structures that stores the training data with the view of making quicker nearest neighbor searches on test data points.

Examples

using MLJ
KNNClassifier = @load KNNClassifier pkg=NearestNeighborModels
X, y = @load_crabs; # a table and a vector from the crabs dataset
# view possible kernels
NearestNeighborModels.list_kernels()
# KNNClassifier instantiation
model = KNNClassifier(weights = NearestNeighborModels.Inverse())
mach = machine(model, X, y) |> fit! # wrap model and required data in an MLJ machine and fit
y_hat = predict(mach, X)
labels = predict_mode(mach, X)

See also MultitargetKNNClassifier

source
NearestNeighborModels.KNNRegressorType
KNNRegressor

A model type for constructing a K-nearest neighbor regressor, based on NearestNeighborModels.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

KNNRegressor = @load KNNRegressor pkg=NearestNeighborModels

Do model = KNNRegressor() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in KNNRegressor(K=...).

KNNRegressor implements K-Nearest Neighbors regressor which is non-parametric algorithm that predicts the response associated with a new point by taking an weighted average of the response of the K-nearest points.

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

OR

mach = machine(model, X, y, w)

Here:

  • X is any table of input features (eg, a DataFrame) whose columns are of scitype Continuous; check column scitypes with schema(X).

  • y is the target, which can be any table of responses whose element scitype is Continuous; check the scitype with scitype(y).

  • w is the observation weights which can either be nothing(default) or an AbstractVector whoose element scitype is Count or Continuous. This is different from weights kernel which is an hyperparameter to the model, see below.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • K::Int=5 : number of neighbors
  • algorithm::Symbol = :kdtree : one of (:kdtree, :brutetree, :balltree)
  • metric::Metric = Euclidean() : any Metric from Distances.jl for the distance between points. For algorithm = :kdtree only metrics which are instances of Union{Distances.Chebyshev, Distances.Cityblock, Distances.Euclidean, Distances.Minkowski, Distances.WeightedCityblock, Distances.WeightedEuclidean, Distances.WeightedMinkowski} are supported.
  • leafsize::Int = algorithm == 10 : determines the number of points at which to stop splitting the tree. This option is ignored and always taken as 0 for algorithm = :brutetree, since brutetree isn't actually a tree.
  • reorder::Bool = true : if true then points which are close in distance are placed close in memory. In this case, a copy of the original data will be made so that the original data is left unmodified. Setting this to true can significantly improve performance of the specified algorithm (except :brutetree). This option is ignored and always taken as false for algorithm = :brutetree.
  • weights::KNNKernel=Uniform() : kernel used in assigning weights to the k-nearest neighbors for each observation. An instance of one of the types in list_kernels(). User-defined weighting functions can be passed by wrapping the function in a UserDefinedKernel kernel (do ?NearestNeighborModels.UserDefinedKernel for more info). If observation weights w are passed during machine construction then the weight assigned to each neighbor vote is the product of the kernel generated weight for that neighbor and the corresponding observation weight.

Operations

  • predict(mach, Xnew): Return predictions of the target given features Xnew, which should have same scitype as X above.

Fitted parameters

The fields of fitted_params(mach) are:

  • tree: An instance of either KDTree, BruteTree or BallTree depending on the value of the algorithm hyperparameter (See hyper-parameters section above). These are data structures that stores the training data with the view of making quicker nearest neighbor searches on test data points.

Examples

using MLJ
KNNRegressor = @load KNNRegressor pkg=NearestNeighborModels
X, y = @load_boston; # loads the crabs dataset from MLJBase
# view possible kernels
NearestNeighborModels.list_kernels()
model = KNNRegressor(weights = NearestNeighborModels.Inverse()) #KNNRegressor instantiation
mach = machine(model, X, y) |> fit! # wrap model and required data in an MLJ machine and fit
y_hat = predict(mach, X)

See also MultitargetKNNRegressor

source
NearestNeighborModels.MultitargetKNNRegressorType
MultitargetKNNRegressor

A model type for constructing a multitarget K-nearest neighbor regressor, based on NearestNeighborModels.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

MultitargetKNNRegressor = @load MultitargetKNNRegressor pkg=NearestNeighborModels

Do model = MultitargetKNNRegressor() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in MultitargetKNNRegressor(K=...).

Multi-target K-Nearest Neighbors regressor (MultitargetKNNRegressor) is a variation of KNNRegressor that assumes the target variable is vector-valued with Continuous components. (Target data must be presented as a table, however.)

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

OR

mach = machine(model, X, y, w)

Here:

  • X is any table of input features (eg, a DataFrame) whose columns are of scitype Continuous; check column scitypes with schema(X).

  • y is the target, which can be any table of responses whose element scitype is Continuous; check column scitypes with schema(y).

  • w is the observation weights which can either be nothing(default) or an AbstractVector whoose element scitype is Count or Continuous. This is different from weights kernel which is an hyperparameter to the model, see below.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • K::Int=5 : number of neighbors
  • algorithm::Symbol = :kdtree : one of (:kdtree, :brutetree, :balltree)
  • metric::Metric = Euclidean() : any Metric from Distances.jl for the distance between points. For algorithm = :kdtree only metrics which are instances of Union{Distances.Chebyshev, Distances.Cityblock, Distances.Euclidean, Distances.Minkowski, Distances.WeightedCityblock, Distances.WeightedEuclidean, Distances.WeightedMinkowski} are supported.
  • leafsize::Int = algorithm == 10 : determines the number of points at which to stop splitting the tree. This option is ignored and always taken as 0 for algorithm = :brutetree, since brutetree isn't actually a tree.
  • reorder::Bool = true : if true then points which are close in distance are placed close in memory. In this case, a copy of the original data will be made so that the original data is left unmodified. Setting this to true can significantly improve performance of the specified algorithm (except :brutetree). This option is ignored and always taken as false for algorithm = :brutetree.
  • weights::KNNKernel=Uniform() : kernel used in assigning weights to the k-nearest neighbors for each observation. An instance of one of the types in list_kernels(). User-defined weighting functions can be passed by wrapping the function in a UserDefinedKernel kernel (do ?NearestNeighborModels.UserDefinedKernel for more info). If observation weights w are passed during machine construction then the weight assigned to each neighbor vote is the product of the kernel generated weight for that neighbor and the corresponding observation weight.

Operations

  • predict(mach, Xnew): Return predictions of the target given features Xnew, which should have same scitype as X above.

Fitted parameters

The fields of fitted_params(mach) are:

  • tree: An instance of either KDTree, BruteTree or BallTree depending on the value of the algorithm hyperparameter (See hyper-parameters section above). These are data structures that stores the training data with the view of making quicker nearest neighbor searches on test data points.

Examples

using MLJ

# Create Data
X, y = make_regression(10, 5, n_targets=2)

# load MultitargetKNNRegressor
MultitargetKNNRegressor = @load MultitargetKNNRegressor pkg=NearestNeighborModels

# view possible kernels
NearestNeighborModels.list_kernels()

# MutlitargetKNNRegressor instantiation
model = MultitargetKNNRegressor(weights = NearestNeighborModels.Inverse())

# Wrap model and required data in an MLJ machine and fit.
mach = machine(model, X, y) |> fit! 

# Predict
y_hat = predict(mach, X)

See also KNNRegressor

source
NearestNeighborModels.MultitargetKNNClassifierType
MultitargetKNNClassifier

A model type for constructing a multitarget K-nearest neighbor classifier, based on NearestNeighborModels.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

MultitargetKNNClassifier = @load MultitargetKNNClassifier pkg=NearestNeighborModels

Do model = MultitargetKNNClassifier() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in MultitargetKNNClassifier(K=...).

Multi-target K-Nearest Neighbors Classifier (MultitargetKNNClassifier) is a variation of KNNClassifier that assumes the target variable is vector-valued with Multiclass or OrderedFactor components. (Target data must be presented as a table, however.)

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

OR

mach = machine(model, X, y, w)

Here:

  • X is any table of input features (eg, a DataFrame) whose columns are of scitype Continuous; check column scitypes with schema(X).

  • yis the target, which can be any table of responses whose element scitype is either<:Finite(<:Multiclassor<:OrderedFactorwill do); check the columns scitypes withschema(y). Each column ofy` is assumed to belong to a common categorical pool.

  • w is the observation weights which can either be nothing(default) or an AbstractVector whose element scitype is Count or Continuous. This is different from weights kernel which is a model hyperparameter, see below.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • K::Int=5 : number of neighbors
  • algorithm::Symbol = :kdtree : one of (:kdtree, :brutetree, :balltree)
  • metric::Metric = Euclidean() : any Metric from Distances.jl for the distance between points. For algorithm = :kdtree only metrics which are instances of Union{Distances.Chebyshev, Distances.Cityblock, Distances.Euclidean, Distances.Minkowski, Distances.WeightedCityblock, Distances.WeightedEuclidean, Distances.WeightedMinkowski} are supported.
  • leafsize::Int = algorithm == 10 : determines the number of points at which to stop splitting the tree. This option is ignored and always taken as 0 for algorithm = :brutetree, since brutetree isn't actually a tree.
  • reorder::Bool = true : if true then points which are close in distance are placed close in memory. In this case, a copy of the original data will be made so that the original data is left unmodified. Setting this to true can significantly improve performance of the specified algorithm (except :brutetree). This option is ignored and always taken as false for algorithm = :brutetree.
  • weights::KNNKernel=Uniform() : kernel used in assigning weights to the k-nearest neighbors for each observation. An instance of one of the types in list_kernels(). User-defined weighting functions can be passed by wrapping the function in a UserDefinedKernel kernel (do ?NearestNeighborModels.UserDefinedKernel for more info). If observation weights w are passed during machine construction then the weight assigned to each neighbor vote is the product of the kernel generated weight for that neighbor and the corresponding observation weight.
  • output_type::Type{<:MultiUnivariateFinite}=DictTable : One of (ColumnTable, DictTable). The type of table type to use for predictions. Setting to ColumnTable might improve performance for narrow tables while setting to DictTable improves performance for wide tables.

Operations

  • predict(mach, Xnew): Return predictions of the target given features Xnew, which should have same scitype as X above. Predictions are either a ColumnTable or DictTable of UnivariateFiniteVector columns depending on the value set for the output_type parameter discussed above. The probabilistic predictions are uncalibrated.

  • predict_mode(mach, Xnew): Return the modes of each column of the table of probabilistic predictions returned above.

Fitted parameters

The fields of fitted_params(mach) are:

  • tree: An instance of either KDTree, BruteTree or BallTree depending on the value of the algorithm hyperparameter (See hyper-parameters section above). These are data structures that stores the training data with the view of making quicker nearest neighbor searches on test data points.

Examples

using MLJ, StableRNGs

# set rng for reproducibility
rng = StableRNG(10)

# Dataset generation
n, p = 10, 3
X = table(randn(rng, n, p)) # feature table
fruit, color = categorical(["apple", "orange"]), categorical(["blue", "green"])
y = [(fruit = rand(rng, fruit), color = rand(rng, color)) for _ in 1:n] # target_table
# Each column in y has a common categorical pool as expected
selectcols(y, :fruit) # categorical array
selectcols(y, :color) # categorical array

# Load MultitargetKNNClassifier
MultitargetKNNClassifier = @load MultitargetKNNClassifier pkg=NearestNeighborModels

# view possible kernels
NearestNeighborModels.list_kernels()

# MultitargetKNNClassifier instantiation
model = MultitargetKNNClassifier(K=3, weights = NearestNeighborModels.Inverse())

# wrap model and required data in an MLJ machine and fit
mach = machine(model, X, y) |> fit!

# predict
y_hat = predict(mach, X)
labels = predict_mode(mach, X)

See also KNNClassifier

source

Kernels

NearestNeighborModels.UserDefinedKernelType
UserDefinedKernel(;func::Function = x->nothing, sort::Bool=false)

Wrap a user defined nearest neighbors weighting function func as a KNNKernel.

Keywords

  • func : user-defined nearest neighbors weighting function. The function should have the signature func(dists_matrix)::Union{Nothing, <:AbstractMatrix}. The dists_matrix is a n by K nearest neighbors distances matrix where n is the number of samples in the test dataset and K is number of neighbors. func should either output nothing or an AbstractMatrix of the same shape as dists_matrix. If func(dists_matrix) returns nothing then all k-nearest neighbors in each row are assign equal weights.
  • sort : if true requests that the dists_matrix be sorted before being passed to func. The sort is done in a manner that puts the k-nearest neighbors in each row of dists_matrix in acesending order .
source