Synthetic data

In this project, databases containing a class imbalance in their decision attribute were searched, in order to apply the synthetic data creation method (ADASYN and Roulette) and thus have a suitable database to train a model of Machine Learning. The database used was cancer which can be seen here.

Dataset

The database consists of 700 instances and 11 attributes, the decision attribute being 'Class'. This database contains 458 instances with class '2' (65.52%) and 241 instances with class '4' (34.48). Which is not good for training any AI model to avoid some bias.

Data preparation

Missing values and data imputation

Firstly, missing data is searched for, this was done by observing the characteristics of the data, and it was found that the "Bare Nuclei" attribute has missing values(?) since it presented observations of the object type when integer or float type values would be expected. Therefore, a random imputation (hot-deck) was carried out on the attribute.

Data preparation

Mean normalization

In order for all the data to have an equal spatial distance, a normalization by means was carried out so as not to affect the distribution/behavior of the data.

Data preparation

Outliers

How the above attributes with outliers were identified. It is proposed to explore if the displayed outliers belong to a specific class and determine if they are valid or not. Since outliers express a relationship to a specific class "4", this can be interpreted as valid behavior of data in that class.

Data analysis

PCA

A dimensionality reduction method, called principal component analysis (PCA), was used to select the attributes whose sum of their information was 90 and 80 percent. The image on the side shows the speakers and their contribution of information until completing a minimum of 90 percent. Then the minority classes were assigned the value of 1 and the majority 0 for better data management.

Data analysis

ADASYN

ADASYN is a method for creating synthetic data.
The essential idea of ADASYN is to use a weighted distribution for different examples of minority classes according to their level of learning difficulty, where more synthetic data is generated for examples of minority classes that are more difficult to learn compared to those minority examples that are easier to learn.

Data analysis

Roulette method

This method is used over random selection to prevent the distribution of the data from being significantly altered. The roulette method is based on probability and randomness. The process consists of knowing the probability of each observation of each attribute in order to generate a random number that represents a probability between zero and one and the value of each attribute whose probability is closest will be the selected synthetic data.

Entrenamiento

Desition tree & KNN

The models that were trained to perform classification were the decision tree and k nearest neighbors. 80 percent of the data was split as the training set while the remaining 20 percent as the test set.

Results

Once synthetic data has been created to balance the classes, and it has been confirmed that the distribution of the database was not affected. Two different models (DT and KNN) were trained with two different percentages of information (80 and 90%) calculated by PCA. The results can be seen in the following tables, where it is also compared with the unbalanced (original) database.

Results for dataset cancer applying only PCA.

Results for the cancer database applying PCA and Data
synthetics.