.. _2017session6rst: ==================================== Session 26/6/2017 - machine learning ==================================== .. only:: html **Links:** :download:`notebook <2017_session6.ipynb>`, :downloadlink:`html <2017_session62html.html>`, :download:`PDF <2017_session6.pdf>`, :download:`python <2017_session6.py>`, :downloadlink:`slides <2017_session6.slides.html>`, :githublink:`GitHub|_doc/notebooks/sessions/2017_session6.ipynb|*` Découverte des trois problèmes de machine learning exposé dans l’article `Machine Learning - session 6 `__. .. code:: from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: Problème 1 : comparaison random forest, linéaire ------------------------------------------------ C’est un problème de régression. On cherche à comparer une random forest avec un modèle linéaire. - Comparaison des tests de coefficients pour un modèle linéaire `OLS `__ et des `features importance `__ - Résultat au niveau d’une observation `treeinterpreter `__ - Données : `Housing `__, `Forest Fire `__ Données ~~~~~~~ .. code:: import pandas df = pandas.read_csv("data/housing.data", delim_whitespace=True, header=None) df.head() .. raw:: html
.. code:: cols = "CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV".split() df.columns = cols df.head() .. raw:: html
.. code:: X = df.drop("MEDV", axis=1) y = df["MEDV"] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) Random Forest ~~~~~~~~~~~~~ .. code:: from sklearn.ensemble import RandomForestRegressor clr = RandomForestRegressor() clr.fit(X, y) .. parsed-literal:: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False) .. code:: importances = clr.feature_importances_ importances .. parsed-literal:: array([0.04377988, 0.00098713, 0.00669368, 0.00092023, 0.02074126, 0.4076033 , 0.01283279, 0.06458251, 0.0036521 , 0.01366027, 0.01576497, 0.01209152, 0.39669034]) On s’inspire de l’exemple `Feature importances with forests of trees `__. .. code:: %matplotlib inline .. code:: import matplotlib.pyplot as plt import numpy as np plt.figure(figsize=(12,4)) indices = np.argsort(importances)[::-1] std = np.std([tree.feature_importances_ for tree in clr.estimators_], axis=0) plt.title("Feature importances") plt.bar(range(X.shape[1]), importances[indices], color="r", yerr=std[indices], align="center") xlabels = list(df.columns[:-1]) xlabels = [xlabels[i] for i in indices] plt.xticks(range(X.shape[1]), xlabels) plt.xlim([-1, X.shape[1]]) plt.show() .. image:: 2017_session6_12_0.png .. code:: from sklearn.metrics import r2_score r2_score(y_train, clr.predict(X_train)) .. parsed-literal:: 0.979236135448983 .. code:: r2_score(y_test, clr.predict(X_test)) .. parsed-literal:: 0.9843720628811461 Modèle linéaire ~~~~~~~~~~~~~~~ .. code:: import statsmodels.api as sm .. code:: model = sm.OLS(y_train, X_train) .. code:: results = model.fit() .. code:: results.params .. parsed-literal:: CRIM -0.109867 ZN 0.045855 INDUS -0.054801 CHAS 3.758792 NOX -5.285538 RM 6.267853 AGE -0.022554 DIS -1.169496 RAD 0.221785 TAX -0.010410 PTRATIO -0.447963 B 0.016474 LSTAT -0.259210 dtype: float64 .. code:: results.summary() .. raw:: html
OLS Regression Results
Dep. Variable: MEDV R-squared: 0.958
Model: OLS Adj. R-squared: 0.957
Method: Least Squares F-statistic: 575.3
Date: Fri, 07 Jun 2019 Prob (F-statistic): 7.10e-216
Time: 16:38:39 Log-Likelihood: -1025.5
No. Observations: 339 AIC: 2077.
Df Residuals: 326 BIC: 2127.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
CRIM -0.1099 0.039 -2.845 0.005 -0.186 -0.034
ZN 0.0459 0.018 2.543 0.011 0.010 0.081
INDUS -0.0548 0.079 -0.697 0.486 -0.209 0.100
CHAS 3.7588 1.116 3.370 0.001 1.564 5.953
NOX -5.2855 4.398 -1.202 0.230 -13.938 3.367
RM 6.2679 0.405 15.461 0.000 5.470 7.065
AGE -0.0226 0.017 -1.325 0.186 -0.056 0.011
DIS -1.1695 0.249 -4.692 0.000 -1.660 -0.679
RAD 0.2218 0.083 2.672 0.008 0.058 0.385
TAX -0.0104 0.005 -2.155 0.032 -0.020 -0.001
PTRATIO -0.4480 0.137 -3.261 0.001 -0.718 -0.178
B 0.0165 0.003 4.767 0.000 0.010 0.023
LSTAT -0.2592 0.065 -3.959 0.000 -0.388 -0.130
Omnibus: 170.276 Durbin-Watson: 1.981
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1726.398
Skew: 1.839 Prob(JB): 0.00
Kurtosis: 13.426 Cond. No. 8.91e+03

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems. .. code:: model = sm.OLS(y,X.drop("LSTAT", axis=1)) results = model.fit() results.summary() .. raw:: html
OLS Regression Results
Dep. Variable: MEDV R-squared: 0.954
Model: OLS Adj. R-squared: 0.953
Method: Least Squares F-statistic: 846.6
Date: Fri, 07 Jun 2019 Prob (F-statistic): 2.38e-320
Time: 16:38:39 Log-Likelihood: -1556.1
No. Observations: 506 AIC: 3136.
Df Residuals: 494 BIC: 3187.
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
CRIM -0.1439 0.036 -3.990 0.000 -0.215 -0.073
ZN 0.0413 0.015 2.696 0.007 0.011 0.071
INDUS -0.0370 0.068 -0.540 0.589 -0.172 0.098
CHAS 3.2525 0.961 3.384 0.001 1.364 5.141
NOX -10.8653 3.422 -3.175 0.002 -17.590 -4.141
RM 7.1436 0.289 24.734 0.000 6.576 7.711
AGE -0.0449 0.014 -3.235 0.001 -0.072 -0.018
DIS -1.2292 0.206 -5.980 0.000 -1.633 -0.825
RAD 0.2008 0.071 2.829 0.005 0.061 0.340
TAX -0.0100 0.004 -2.391 0.017 -0.018 -0.002
PTRATIO -0.6575 0.112 -5.881 0.000 -0.877 -0.438
B 0.0165 0.003 5.779 0.000 0.011 0.022
Omnibus: 277.013 Durbin-Watson: 0.927
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3084.310
Skew: 2.148 Prob(JB): 0.00
Kurtosis: 14.307 Cond. No. 8.13e+03

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.13e+03. This might indicate that there are
strong multicollinearity or other numerical problems. TPOT ~~~~ `TPOT `__ est un module d’apprentissage automatique. .. code:: try: from tpot import TPOTRegressor except ImportError: # for sklearn 0.22 import sklearn.preprocessing from sklearn.impute import SimpleImputer sklearn.preprocessing.Imputer = SimpleImputer from tpot import TPOTRegressor tpot = TPOTRegressor(generations=2, population_size=50, verbosity=2) tpot.fit(X_train, y_train) print(tpot.score(X_test, y_test)) tpot.export('tpot_boston_pipeline.py') .. parsed-literal:: HBox(children=(IntProgress(value=0, description='Optimization Progress', max=150, style=ProgressStyle(descript… .. parsed-literal:: Generation 1 - Current best internal CV score: -10.19955600172555 Generation 2 - Current best internal CV score: -10.19955600172555 Best pipeline: XGBRegressor(StandardScaler(input_matrix), learning_rate=0.1, max_depth=9, min_child_weight=2, n_estimators=100, nthread=1, subsample=0.8) -9.01100125963658 Le module optimise les hyperparamètres, parfois un peu trop à en juger la mauvaise performance obtenue sur la base de test. .. code:: r2_score(y_train, tpot.predict(X_train)) .. parsed-literal:: 0.9994364901579464 .. code:: r2_score(y_test, tpot.predict(X_test)) .. parsed-literal:: 0.8978805496355459 Feature importance pour une observations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On reprend la première random forest et on utilise le module `treeinterpreter `__. .. code:: clr = RandomForestRegressor() clr.fit(X, y) .. parsed-literal:: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False) .. code:: from treeinterpreter import treeinterpreter as ti prediction, bias, contributions = ti.predict(clr, X_test) .. code:: for i in range(min(2, X_train.shape[0])): print("Instance", i) print("Bias (trainset mean)", bias[i]) print("Feature contributions:") for c, feature in sorted(zip(contributions[i], df.columns), key=lambda x: -abs(x[0])): print(feature, round(c, 2)) print( "-"*20) .. parsed-literal:: Instance 0 Bias (trainset mean) 22.527195652173905 Feature contributions: RM 7.53 LSTAT 1.58 TAX -0.64 PTRATIO 0.6 DIS -0.4 INDUS 0.19 RAD -0.18 AGE 0.13 ZN 0.12 CRIM 0.07 NOX 0.06 B -0.01 CHAS -0.01 -------------------- Instance 1 Bias (trainset mean) 22.527195652173905 Feature contributions: LSTAT -5.04 RM -1.64 B -0.9 NOX -0.89 CRIM -0.81 DIS 0.64 AGE 0.17 PTRATIO -0.07 TAX -0.04 CHAS -0.02 INDUS -0.0 ZN 0.0 RAD 0.0 -------------------- Problème 2 : série temporelle ----------------------------- On prend une série sur `Google Trends `__, dans notre cas, c’est la requête *tennis live*. On compare une approche linéaire et une approche non linéaire. Approche linéaire ~~~~~~~~~~~~~~~~~ .. code:: import pandas df = pandas.read_csv("data/multiTimeline.csv", skiprows=1) .. code:: df.columns= ["Semaine", "compte"] .. code:: df["SemaineDt"] = pandas.to_datetime(df.Semaine) .. code:: df=df.set_index("SemaineDt") .. code:: df["compte"] = df["compte"].astype(float) .. code:: df.head() .. raw:: html
Semaine compte
2012-07-01 2012-07-01 70.0
2012-07-08 2012-07-08 49.0
2012-07-15 2012-07-15 18.0
2012-07-22 2012-07-22 22.0
2012-07-29 2012-07-29 88.0
.. code:: %matplotlib inline .. code:: df.plot() .. parsed-literal:: .. image:: 2017_session6_40_1.png .. code:: from statsmodels.tsa.arima_model import ARIMA arma_mod = ARIMA(df["compte"].values, order=(6 ,1, 1)) res = arma_mod.fit() res.params .. parsed-literal:: .. parsed-literal:: array([ 0.00418581, 0.59035757, -0.32540695, 0.23286807, -0.03300838, 0.06434307, -0.07204017, -0.99999983]) .. code:: res.summary() .. raw:: html
ARIMA Model Results
Dep. Variable: D.y No. Observations: 259
Model: ARIMA(6, 1, 1) Log Likelihood -1055.581
Method: css-mle S.D. of innovations 14.116
Date: Fri, 07 Jun 2019 AIC 2129.161
Time: 16:45:54 BIC 2161.173
Sample: 1 HQIC 2142.032
coef std err z P>|z| [0.025 0.975]
const 0.0042 0.021 0.196 0.845 -0.038 0.046
ar.L1.D.y 0.5904 0.063 9.431 0.000 0.468 0.713
ar.L2.D.y -0.3254 0.072 -4.507 0.000 -0.467 -0.184
ar.L3.D.y 0.2329 0.075 3.097 0.002 0.085 0.380
ar.L4.D.y -0.0330 0.076 -0.433 0.665 -0.182 0.116
ar.L5.D.y 0.0643 0.076 0.842 0.400 -0.085 0.214
ar.L6.D.y -0.0720 0.066 -1.096 0.274 -0.201 0.057
ma.L1.D.y -1.0000 0.010 -96.075 0.000 -1.020 -0.980
Real Imaginary Modulus Frequency
AR.1 -1.2011 -1.2144j 1.7080 -0.3741
AR.2 -1.2011 +1.2144j 1.7080 0.3741
AR.3 0.1840 -1.4018j 1.4138 -0.2292
AR.4 0.1840 +1.4018j 1.4138 0.2292
AR.5 1.4636 -0.4882j 1.5429 -0.0512
AR.6 1.4636 +0.4882j 1.5429 0.0512
MA.1 1.0000 +0.0000j 1.0000 0.0000
Méthode non linéaire ~~~~~~~~~~~~~~~~~~~~ On construire la matrice des séries décalées. Cette méthode permet de sortir du cadre linéaire et d’ajouter d’autres variables. .. code:: from statsmodels.tsa.tsatools import lagmat lag = 8 X = lagmat(df["compte"], lag) lagged = df.copy() for c in range(1,lag+1): lagged["lag%d" % c] = X[:, c-1] pandas.concat([lagged.head(), lagged.tail()]) .. raw:: html
Semaine compte lag1 lag2 lag3 lag4 lag5 lag6 lag7 lag8
2012-07-01 2012-07-01 70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-08 2012-07-08 49.0 70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-15 2012-07-15 18.0 49.0 70.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-22 2012-07-22 22.0 18.0 49.0 70.0 0.0 0.0 0.0 0.0 0.0
2012-07-29 2012-07-29 88.0 22.0 18.0 49.0 70.0 0.0 0.0 0.0 0.0
2017-05-21 2017-05-21 23.0 40.0 35.0 21.0 29.0 27.0 14.0 23.0 40.0
2017-05-28 2017-05-28 44.0 23.0 40.0 35.0 21.0 29.0 27.0 14.0 23.0
2017-06-04 2017-06-04 55.0 44.0 23.0 40.0 35.0 21.0 29.0 27.0 14.0
2017-06-11 2017-06-11 28.0 55.0 44.0 23.0 40.0 35.0 21.0 29.0 27.0
2017-06-18 2017-06-18 28.0 28.0 55.0 44.0 23.0 40.0 35.0 21.0 29.0
.. code:: xc = ["lag%d" % i for i in range(1,lag+1)] split = 0.66 isplit = int(len(lagged) * split) xt = lagged[10:][xc] yt = lagged[10:]["compte"] X_train, y_train, X_test, y_test = xt[:isplit], yt[:isplit], xt[isplit:], yt[isplit:] .. code:: from sklearn.ensemble import RandomForestRegressor clr = RandomForestRegressor() clr.fit(X_train, y_train) .. parsed-literal:: RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False) .. code:: from sklearn.metrics import r2_score r2 = r2_score(y_test.values, clr.predict(X_test)) r2 .. parsed-literal:: .. parsed-literal:: 0.4878325848907805 .. code:: plt.scatter(y_test.values, clr.predict(X_test)); .. parsed-literal:: .. parsed-literal:: .. image:: 2017_session6_48_2.png Texte ----- On cherche à comparer une `LDA `__ avec `word2vec `__ et `kmeans `__ et les données qui sont sur `ensae_teaching_cs/src/ensae_teaching_cs/data/data_web/ `__. .. code:: from ensae_teaching_cs.data import twitter_zip df = twitter_zip(as_df=True) df.head(n=2).T .. raw:: html
0 1
index 776066992054861825 776067660979245056
nb_user_mentions 0 0
nb_extended_entities 0 0
nb_hashtags 1 1
geo NaN NaN
text_hashtags , SiJétaisPrésident , SiJétaisPrésident
annee 2016 2016
delimit_mention NaN NaN
lang fr fr
id_str 7.76067e+17 7.76068e+17
text_mention NaN NaN
retweet_count 4 5
favorite_count 3 8
type_extended_entities [] []
text #SiJétaisPrésident se serait la fin du monde..... #SiJétaisPrésident je donnerai plus de vacance...
nb_user_photos 0 0
nb_urls 0 0
nb_symbols 0 0
created_at Wed Sep 14 14:36:04 +0000 2016 Wed Sep 14 14:38:43 +0000 2016
delimit_hash , 0, 18 , 0, 18
Des mots aux coordonnées - tf-idf ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: keep = df.text.dropna().index .. code:: dfnonan = df.iloc[keep, :] dfnonan.shape .. parsed-literal:: (5087, 20) .. code:: from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=1000) tfidf = tfidf_vectorizer.fit_transform(dfnonan["text"]) .. code:: tfidf[:2, :] .. parsed-literal:: <2x1000 sparse matrix of type '' with 21 stored elements in Compressed Sparse Row format> .. code:: tfidf[:2, :].todense() .. parsed-literal:: matrix([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]]) LDA ~~~ .. code:: from sklearn.decomposition import LatentDirichletAllocation lda = LatentDirichletAllocation(n_components=10, max_iter=5, learning_method='online', learning_offset=50., random_state=0) .. code:: lda.fit(tfidf) .. parsed-literal:: LatentDirichletAllocation(batch_size=128, doc_topic_prior=None, evaluate_every=-1, learning_decay=0.7, learning_method='online', learning_offset=50.0, max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001, n_components=10, n_jobs=None, perp_tol=0.1, random_state=0, topic_word_prior=None, total_samples=1000000.0, verbose=0) .. code:: tf_feature_names = tfidf_vectorizer.get_feature_names() tf_feature_names[100:103] .. parsed-literal:: ['avoir', 'bac', 'bah'] .. code:: lda.components_.shape .. parsed-literal:: (10, 1000) On obtient dix vecteurs qui représentent les dix vecteurs associés aux dix clusters. Chaque dimension relié au fait que le mot appartient ou non au cluster. .. code:: def print_top_words(model, feature_names, n_top_words): for topic_idx, topic in enumerate(model.components_): print("Topic #%d:" % topic_idx) print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])) print() .. code:: print_top_words(lda, tf_feature_names, 10) .. parsed-literal:: Topic #0: gratuit mcdo supprimerai école soir kebab macdo kfc domicile cc Topic #1: macron co https de la est le il et hollande Topic #2: sijetaispresident je les de la et le des en pour Topic #3: notaires eu organiserais mets carte nouveaux journées installation cache créer Topic #4: sijetaispresident interdirais les je ballerines la serait serais bah de Topic #5: ministre de sijetaispresident la je premier mort et nommerais président Topic #6: cours le supprimerais jour sijetaispresident lundi samedi semaine je vendredi Topic #7: port interdirait démissionnerais promesses heure rendrai ballerine mes changement christineboutin Topic #8: seraient sijetaispresident gratuits aux les nos putain éducation nationale bonne Topic #9: bordel seront légaliserai putes gratuites pizza mot virerais vitesse dutreil Clustering ~~~~~~~~~~ .. code:: from sklearn.cluster import KMeans km = KMeans(n_clusters=10) km.fit(tfidf) .. parsed-literal:: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0) .. code:: km.cluster_centers_.shape .. parsed-literal:: (10, 1000) .. code:: def print_top_words(model, feature_names, n_top_words): for topic_idx, topic in enumerate(model.cluster_centers_): print("Topic #%d:" % topic_idx) print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])) print() .. code:: print_top_words(km, tf_feature_names, 10) .. parsed-literal:: Topic #0: ferais je sijetaispresident les en sorte que de et un Topic #1: le sijetaispresident je monde et pour de tout des les Topic #2: de sijetaispresident la en et des une un plus aurait Topic #3: https co macron de la le les via et sijetaispresident Topic #4: les sijetaispresident je de et tous des seraient pour ballerines Topic #5: serait sijetaispresident la le merde de ça et on dans Topic #6: je sijetaispresident la des me et ferai pas en au Topic #7: macron est il de pas la hollande le gauche que Topic #8: ministre premier nommerais sijetaispresident de je la mickey serait nommerai Topic #9: serais je sijetaispresident un pas ne président que la de Ah les stop words….