Multi-Class Classification Algorithms

Multi-class classification is a type of machine learning algorithm used to distinguish between three or more distinct classes or categories. It is commonly applied in situations where each instance in the dataset belongs to one of several known classes, and the goal is to correctly assign new instances to one of these classes. This approach is particularly useful when sufficient labeled data is available for all classes, allowing the model to learn complex patterns that differentiate among them.

Common Parameters

  • Algorithm: The algorithm to be used for one-class classification. Options include:

    • Multi-Layer Perceptron (MLP): A multi-layer perceptron-based approach that uses neural networks for classification tasks.

    • Random Forest: A random forest-based approach that uses an ensemble of decision trees for classification tasks.

Note

The sample class used for multi-class classification is always the Unclassified class, see Unclassified Class. Any new samples added to the dataset will automatically be placed in the Unclassified class, and therefore will not influence the trained models, but will be taken into account during the classification scoring.

The algorithm’s specific parameters are explained in detail below.

Multi-Layer Perceptron (MLP)

A multi-layer perceptron (MLP) is a type of artificial neural network used for supervised learning tasks such as classification. It consists of an input layer, one or more hidden layers with non-linear activation functions (such as ReLU), and an output layer that produces class probabilities. The network is trained using the Adam optimizer[1], which adaptively tunes the learning rate for each weight, enabling efficient training on complex, non-linear data. MLP is well-suited for multi-class classification problems, especially when data relationships are not easily captured by linear models. For reference, see : Learning representations by back-propagating errors[2].

In this setup, the input features are first reduced using Principal Component Analysis (PCA), which projects the data into a lower-dimensional space that retains the most meaningful variance. This reduces noise and improves training efficiency. To assess model performance, leave-one-out cross-validation (LOOCV)—also known as jackknife resampling—is applied, see Cross-Validatory Choice and Assessment of Statistical Predictions[3]. In this strategy, the model is trained on all samples except one, which is then used for testing, and this is repeated for every data point. LOOCV provides an unbiased and rigorous estimate of the classifier’s generalization performance, especially beneficial when working with small datasets.

For the interpretation of the charts generated by MLP, refer to the section on Multi-Class Classification: Interpreting Training Quality Charts.

Hyperparameters

  • Regularization: also named alpha, it is used for regularization in the multi-layer perceptron. Regularization is a technique used to prevent overfitting by adding a penalty to the loss function. The alpha parameter specifies the strength of the L2 regularization term, which is also known as the regularization parameter. A larger value of alpha increases the regularization strength, which can help in reducing overfitting but might also lead to underfitting if set too high.

  • Max. iterations: sets the maximum number of iterations for the solver to converge. It determines how many times the solver will go through the dataset to update the weights. If the number of iterations is too low, the model might not converge to the optimal solution. Conversely, setting it too high can lead to unnecessary computation time, especially if the model converges early.

Note

As raw data is often high-dimensional, PCA reduction of up to 100 components is applied internally before training.

Note

The MLP implementation is using 2 hidden layers, which sizes are automatically determined based on the number of features after PCA reduction, with a maximum size of 32 neurons for the first, and 16 for the second. This dynamic sizing helps to balance model complexity and generalization ability.

Note

Some other hyperparameters are fixed; they are not exposed to the user as they were found to have little impact on the quality of the results. These include:

  • Hidden layers activation: ReLU

  • Output layer activation: Softmax

  • Early stopping: enabled

Random Forest

Random Forest is an ensemble-based supervised learning algorithm that constructs a large number of decision trees during training. Each tree is trained on a random subset of the data and considers a random subset of features when making splits. During prediction, each tree votes for a class, and the final output is determined by majority voting across all trees. This ensemble approach makes random forests robust against overfitting, capable of capturing both linear and non-linear feature interactions, and effective in handling datasets with many input variables. For reference, see : Random Forests[4].

In your workflow, the classifier is trained using the full set of original features, without any dimensionality reduction. This allows the model to utilize all available information, while its internal feature selection ensures it focuses on the most informative variables during training. Model performance is evaluated using leave-one-out cross-validation (LOOCV), where the model is repeatedly trained on all data points except one, which is held out for testing, see Cross-Validatory Choice and Assessment of Statistical Predictions[3]. This exhaustive validation strategy provides a nearly unbiased estimate of prediction error, making it especially valuable when working with limited data and aiming for high-confidence performance evaluation.

For the interpretation of the charts generated by Random Forest, refer to the section on Multi-Class Classification: Interpreting Training Quality Charts.

Hyperparameters

  • Nbr. estimators: specifies the number of trees in the forest. Generally, a larger number of trees increases the performance of the model and makes the predictions more stable, but it also increases the computational cost and the time required to train the model.

  • Max. features: determines the maximum number of features considered for splitting a node. Using a subset of features can improve the model’s generalization by reducing variance. Possible values include:

    • \(\sqrt{n_{\text{features}}}\): the square root of the number of features.

    • a custom value: A user-defined maximum number of features to consider for each split, allowing for tailored model complexity.