Fungus classification

This project focuses on preparing and analyzing a database to train models that are capable of classifying whether a fungus is edible or poisonous.

Dataset

This Mushroom dataset contains 8124 instances, 22 attributes for different funguses, and 2 classes, poisonous and edible with 48.2% and 51.8% of class balance respectively.

Data preparation

Class balance

The class balance is analyzed in order to discern if it is necessary to apply any synthetic data technique to balance classes and not create biases in the results. In this case it is not necessary as can be seen in the bar graph

Data preparation

Missing values

The next step is to know if there are missing values, looking at attribute by attribute, it was found that only one attribute had missing values, symbolized with the "?" sign. In the class-based distribution for the stalk-root attribute, it shows the missing values.

Data preparation

Missing values

Based on the distribution of the data, a data imputation is performed. The specific technique is the imputation of means by class. This technique generates data for each missing value based on its class and the mean of certain neighbors. By applying this technique, the distribution of this attribute was obtained again and it was observed that the data were correctly imputed.

Data analisys

Selecting better attributes

In this section, it was proposed to attack the problem in three different ways, an analysis was made using Pearson's Correlation, PCA and MCA. After extensive experimentation using machine learning methods (KNN and Decision tree) and K-fold. The best approximation was using the Pearson correlation matrix.

Results

After selecting the most relevant attributes using the three approaches, the nearest neighbor (KNN) and decision tree (DT) methods were trained and evaluated using various metrics. The best results were when using Pearson matrix and DT.