import mermaid from 'https://cdnjs.cloudflare.com/ajax/libs/mermaid/10.2.3/mermaid.esm.min.mjs'; mermaid.initialize({ startOnLoad: true });
Correction (en cours de rédaction) des exercices autour des graphes courants en machine learning.
%matplotlib inline
%load_ext pyensae
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from jyquickhelper import add_notebook_menu
add_notebook_menu()
Le module utilise des données issue de Wine Quality Data Set pour lequel on essaye de prédire la qualité du vin en fonction de ses caractéristiques chimiques.
from pyensae.datasource import download_data, DownloadDataException
uci = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/"
try:
download_data("winequality-red.csv", url=uci)
download_data("winequality-white.csv", url=uci)
except DownloadDataException:
print("backup")
download_data("winequality-red.csv", website="xd")
download_data("winequality-white.csv", website="xd")
%head winequality-red.csv
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality" 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5 7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5 7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5 11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6 7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5 7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5 7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5 7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10;7 7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5;7
import pandas
red_wine = pandas.read_csv("winequality-red.csv", sep=";")
red_wine["red"] = 1
white_wine = pandas.read_csv("winequality-white.csv", sep=";")
white_wine["red"] = 0
wines = pandas.concat([red_wine, white_wine])
wines.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | red | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 1 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 | 1 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 | 1 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 | 1 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 1 |
On découpe en base d'apprentissage, base de test :
from sklearn.model_selection import train_test_split
X = wines[[c for c in wines.columns if c != "quality"]]
Y = wines["quality"]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
type(x_train), type(y_train)
(pandas.core.frame.DataFrame, pandas.core.series.Series)
wines.shape, x_train.shape, y_train.shape
((6497, 13), (4352, 12), (4352,))
Considérer un modèle et estimer au mieux ses paramètres.