.. _tspredrst: ============================================= Prédiction, COVID en France, série temporelle ============================================= .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/timeseries/ts_pred.ipynb|*` On s’intéresse aux séries temporelles de l’épidémie du COVID en France récupérées depuis data.gouv.fr : `Chiffres-clés concernant l’épidémie de COVID19 en France `__. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. code:: ipython3 %matplotlib inline Données ------- .. code:: ipython3 from pandas import DataFrame, read_csv, to_datetime df = read_csv("https://www.data.gouv.fr/en/datasets/r/0b66ca39-1623-4d9c-83ad-5434b7f9e2a4") df['date'] = to_datetime(df['date'].apply(lambda s: s.replace("_", "-"))) df.tail() .. parsed-literal:: c:\python387_x64\lib\site-packages\IPython\core\interactiveshell.py:3146: DtypeWarning: Columns (17,18) have mixed types.Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name, .. raw:: html
date granularite maille_code maille_nom cas_confirmes cas_ehpad cas_confirmes_ehpad cas_possibles_ehpad deces deces_ehpad reanimation hospitalises nouvelles_hospitalisations nouvelles_reanimations gueris depistes source_nom source_url source_archive source_type
43080 2021-02-10 region REG-75 Nouvelle-Aquitaine NaN NaN NaN NaN 2510.0 NaN 213.0 1524.0 78.0 20.0 10227.0 NaN OpenCOVID19-fr NaN NaN opencovid19-fr
43081 2021-02-10 region REG-76 Occitanie NaN NaN NaN NaN 2882.0 NaN 276.0 1886.0 121.0 25.0 12801.0 NaN OpenCOVID19-fr NaN NaN opencovid19-fr
43082 2021-02-10 region REG-84 Auvergne-Rhône-Alpes NaN NaN NaN NaN 8453.0 NaN 400.0 3699.0 213.0 42.0 34579.0 NaN OpenCOVID19-fr NaN NaN opencovid19-fr
43083 2021-02-10 region REG-93 Provence-Alpes-Côte d'Azur NaN NaN NaN NaN 5010.0 NaN 438.0 3581.0 240.0 33.0 24080.0 NaN OpenCOVID19-fr NaN NaN opencovid19-fr
43084 2021-02-10 region REG-94 Corse NaN NaN NaN NaN 133.0 NaN 7.0 58.0 4.0 3.0 587.0 NaN OpenCOVID19-fr NaN NaN opencovid19-fr
.. code:: ipython3 from datetime import datetime gr = df[["date", "cas_confirmes"]].groupby("date").sum().sort_index() gr.loc[datetime(2020, 11, 11), 'cas_confirmes'] = ( gr.loc[datetime(2020, 11, 10), 'cas_confirmes'] + gr.loc[datetime(2020, 11, 12), 'cas_confirmes']) / 2 gr.loc[gr.index >= datetime(2020, 9, 1), 'cas_confirmes'].plot(figsize=(14, 4), grid=True, title="Cas positifs"); .. image:: ts_pred_5_0.png .. code:: ipython3 covsept = gr.loc[gr.index >= datetime(2020, 9, 1), 'cas_confirmes'] covsept.diff().plot(figsize=(14, 4), grid=True, title="Cas positifs quotidiens"); .. image:: ts_pred_6_0.png Prédiction : AR --------------- Autocorrélogramme ~~~~~~~~~~~~~~~~~ .. code:: ipython3 import matplotlib.pyplot as plt from statsmodels.graphics.tsaplots import plot_acf, plot_pacf data = covsept.diff()[1:] fig = plt.figure(figsize=(12,8)) ax1 = fig.add_subplot(211) fig = plot_acf(data, lags=40, ax=ax1) ax2 = fig.add_subplot(212) fig = plot_pacf(data, lags=40, ax=ax2); .. image:: ts_pred_9_0.png ARIMA ~~~~~ .. code:: ipython3 from statsmodels.tsa.arima.model import ARIMA mod = ARIMA(data, order=(7, 1, 1)) res = mod.fit() print(res.summary()) .. parsed-literal:: c:\python387_x64\lib\site-packages\statsmodels\tsa\base\tsa_model.py:524: ValueWarning: No frequency information was provided, so inferred frequency D will be used. warnings.warn('No frequency information was' c:\python387_x64\lib\site-packages\statsmodels\tsa\base\tsa_model.py:524: ValueWarning: No frequency information was provided, so inferred frequency D will be used. warnings.warn('No frequency information was' c:\python387_x64\lib\site-packages\statsmodels\tsa\base\tsa_model.py:524: ValueWarning: No frequency information was provided, so inferred frequency D will be used. warnings.warn('No frequency information was' .. parsed-literal:: SARIMAX Results ============================================================================== Dep. Variable: cas_confirmes No. Observations: 162 Model: ARIMA(7, 1, 1) Log Likelihood -1665.458 Date: Thu, 11 Feb 2021 AIC 3348.917 Time: 00:17:40 BIC 3376.649 Sample: 09-02-2020 HQIC 3360.177 - 02-10-2021 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ar.L1 0.2273 0.162 1.407 0.159 -0.089 0.544 ar.L2 -0.1927 0.077 -2.508 0.012 -0.343 -0.042 ar.L3 -0.0039 0.103 -0.038 0.970 -0.206 0.199 ar.L4 -0.1243 0.102 -1.216 0.224 -0.324 0.076 ar.L5 0.0696 0.078 0.895 0.371 -0.083 0.222 ar.L6 -0.0285 0.076 -0.374 0.708 -0.177 0.121 ar.L7 0.3801 0.056 6.750 0.000 0.270 0.491 ma.L1 -0.6858 0.167 -4.098 0.000 -1.014 -0.358 sigma2 5.996e+07 1.57e-09 3.81e+16 0.000 6e+07 6e+07 =================================================================================== Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 289.45 Prob(Q): 1.00 Prob(JB): 0.00 Heteroskedasticity (H): 1.46 Skew: 0.31 Prob(H) (two-sided): 0.17 Kurtosis: 9.54 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). [2] Covariance matrix is singular or near-singular, with condition number 3.36e+32. Standard errors may be unstable. .. code:: ipython3 import numpy pred = DataFrame({"cas+":data.copy()}) pred['pred'] = numpy.nan pred['pred'] = res.forecasts.ravel() pred.tail() .. raw:: html
cas+ pred
date
2021-02-06 20586.0 20346.037004
2021-02-07 19715.0 19741.530365
2021-02-08 4317.0 14212.567962
2021-02-09 18870.0 14869.262634
2021-02-10 25387.0 23740.270465
.. code:: ipython3 res.forecasts.ravel()[-5:] .. parsed-literal:: array([20346.03700422, 19741.53036487, 14212.56796166, 14869.2626336 , 23740.27046459]) .. code:: ipython3 res.forecast(7) .. parsed-literal:: 2021-02-11 21924.732356 2021-02-12 21223.595082 2021-02-13 18260.072157 2021-02-14 18045.532220 2021-02-15 13187.175698 2021-02-16 17328.572392 2021-02-17 22102.419937 Freq: D, Name: predicted_mean, dtype: float64 .. code:: ipython3 pred.plot(); .. image:: ts_pred_15_0.png Approche machine learning ------------------------- .. code:: ipython3 def lagged_dataframe(data, lag=7): res = data.copy() data = res.iloc[:, 0] for i in range(1, lag): res['l%i' % i] = data.shift(i) return res[lag:].copy() lagged_dataframe(DataFrame({'cas+': data}), 8) .. raw:: html
cas+ l1 l2 l3 l4 l5 l6 l7
date
2020-09-10 9843.0 8577.0 6544.0 4203.0 7071.0 8550.0 8975.0 7157.0
2020-09-11 9406.0 9843.0 8577.0 6544.0 4203.0 7071.0 8550.0 8975.0
2020-09-12 10561.0 9406.0 9843.0 8577.0 6544.0 4203.0 7071.0 8550.0
2020-09-13 7183.0 10561.0 9406.0 9843.0 8577.0 6544.0 4203.0 7071.0
2020-09-14 6158.0 7183.0 10561.0 9406.0 9843.0 8577.0 6544.0 4203.0
... ... ... ... ... ... ... ... ...
2021-02-06 20586.0 22139.0 23448.0 26362.0 23337.0 4347.0 19235.0 24392.0
2021-02-07 19715.0 20586.0 22139.0 23448.0 26362.0 23337.0 4347.0 19235.0
2021-02-08 4317.0 19715.0 20586.0 22139.0 23448.0 26362.0 23337.0 4347.0
2021-02-09 18870.0 4317.0 19715.0 20586.0 22139.0 23448.0 26362.0 23337.0
2021-02-10 25387.0 18870.0 4317.0 19715.0 20586.0 22139.0 23448.0 26362.0

154 rows × 8 columns

.. code:: ipython3 feat = lagged_dataframe(DataFrame({'cas+': data}), 8) X, y = feat.drop('cas+', axis=1), feat['cas+'] X.shape, y.shape .. parsed-literal:: ((154, 7), (154,)) Régression linéaire ~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline lr = make_pipeline(StandardScaler(), LinearRegression()) lr.fit(X, y) .. parsed-literal:: Pipeline(steps=[('standardscaler', StandardScaler()), ('linearregression', LinearRegression())]) .. code:: ipython3 from sklearn.metrics import r2_score r2_score(y, lr.predict(X)) .. parsed-literal:: 0.639550667481653 .. code:: ipython3 r2_score(y, X.values[:, 0]) .. parsed-literal:: 0.4573617856901514 .. code:: ipython3 fig, ax = plt.subplots(1, 1, figsize=(14, 4)) ax.plot(y.values) ax.plot(lr.predict(X)) ax.set_title("Prédiction avec un modèle linéaire"); .. image:: ts_pred_23_0.png Autres modèles ~~~~~~~~~~~~~~ Overfitting ? .. code:: ipython3 from sklearn.model_selection import TimeSeriesSplit tscv = TimeSeriesSplit(n_splits=5) for train_index, test_index in tscv.split(X, y): X_train, X_test, y_train, y_test = ( X.iloc[train_index, :], X.iloc[test_index, :], y.iloc[train_index], y.iloc[test_index]) print("TRAIN:", X_train.shape, "TEST:", X_test.shape) .. parsed-literal:: TRAIN: (29, 7) TEST: (25, 7) TRAIN: (54, 7) TEST: (25, 7) TRAIN: (79, 7) TEST: (25, 7) TRAIN: (104, 7) TEST: (25, 7) TRAIN: (129, 7) TEST: (25, 7)