KNN & K-Fold

In this project, a classification of the type of cancer and type of star was made by implementing the k-nearest neighbors (KNN) model and an evaluation technique called K-Fold. KNN is a supervised learning method based on instances. It is commonly used to perform classifications depending on the class of the nearest neighbors. The K-fold Cross Validation method is used to know the prediction performance of a model.

Datasets

The database used is the type of star and the type of cancer, in their links you can download and see their properties. Only the first data was described on this site.

Data preparation

Variable type

Because the data will be analyzed with the PCA technique, it is required that all the data of each attribute be a variable of the numeric type.

Data preparation

Data normalization

Since the data values do not have an equal distance between them spatially, it is necessary to perform a normalization (mean normalization) that will not affect the behavior (distribution) of the data.

Data preparation

Outliers

We proceeded to look for outliers, whether valid or invalid, to determine if any preparation action is required. Given the results of this, it can be seen that the outliers are valid, which is why we continued.

Data analisys

PCA

A dimensionality reduction method, called principal component analysis (PCA), was used to select the attributes whose sum of their information was 90, 80 and 70 percent.

Training

Elbow method

In order to correctly train the KNN model, it is necessary to know the number of neighbors where the loss function or the accuracy begins to stabilize. In this case, the best number of neighbors was 7.

Results

Results can be seen in the next table, it can be seen all the combinations of the tests carried out using 7 close neighbors and 3 different K-Folds (3,5, and 10). Also, our results were compared with the use of the scikit-learn library.