.. _survivalrst: ================= Analyse de survie ================= .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/ml/survival.ipynb|*` .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: Quelques données ---------------- On récupère les données disponibles sur *open.data.gouv.fr* `Données hospitalières relatives à l’épidémie de COVID-19 `__. Ces données ne permettent pas de construire la courbe de `Kaplan-Meier `__. On sait combien de personnes rentrent et sortent chaque jour mais on ne sait pas quand une personne qui sort un 1er avril est entrée. .. code:: ipython3 import numpy.random as rnd import pandas df = pandas.read_csv("https://www.data.gouv.fr/en/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7", sep=";") gr = df[["jour", "rad", "dc"]].groupby(["jour"]).sum() diff = gr.diff().reset_index(drop=False) diff.head() .. raw:: html
jour rad dc
0 2020-03-18 NaN NaN
1 2020-03-19 695.0 207.0
2 2020-03-20 806.0 248.0
3 2020-03-21 452.0 151.0
4 2020-03-22 608.0 210.0
.. code:: ipython3 def donnees_artificielles(hosp, mu=14, nu=21): dt = pandas.to_datetime(hosp['jour']) res = [] for i in range(hosp.shape[0]): date = dt[i].dayofyear h = hosp.iloc[i, 1] delay = rnd.exponential(mu, int(h)) for j in range(delay.shape[0]): res.append([date - int(delay[j]), date, 1]) h = hosp.iloc[i, 2] delay = rnd.exponential(nu, int(h)) for j in range(delay.shape[0]): res.append([date - int(delay[j]), date , 0]) return pandas.DataFrame(res, columns=["entree", "sortie", "issue"]) data = donnees_artificielles(diff[1:].reset_index(drop=True)).sort_values('entree') data.head() .. raw:: html
entree sortie issue
518488 -200 19 0
541408 -192 27 0
476735 -187 2 0
587013 -185 42 0
476057 -180 1 0
Chaque ligne est une personne, ``entree`` est le jour d’entrée à l’hôpital, ``sortie`` celui de la sortie, ``issue``, 0 pour décès, 1 pour en vie. .. code:: ipython3 data.describe() .. raw:: html
entree sortie issue
count 624130.000000 624130.000000 624130.000000
mean 169.704510 184.532815 0.806729
std 125.420957 124.343186 0.394864
min -200.000000 1.000000 0.000000
25% 53.000000 84.000000 1.000000
50% 133.000000 144.000000 1.000000
75% 301.000000 315.000000 1.000000
max 366.000000 366.000000 1.000000
Il y a environ 80% de survie dans ces données. .. code:: ipython3 import numpy duree = data.sortie - data.entree deces = (data.issue == 0).astype(numpy.int32) .. code:: ipython3 import numpy import matplotlib.pyplot as plt from lifelines import KaplanMeierFitter fig, ax = plt.subplots(1, 1, figsize=(10, 4)) kmf = KaplanMeierFitter() kmf.fit(duree, deces) kmf.plot(ax=ax) ax.legend(); .. image:: survival_9_0.png Régression de Cox ----------------- On reprend les données artificiellement générées et on ajoute une variable identique à la durée plus un bruit mais quasi nul .. code:: ipython3 import pandas data_simple = pandas.DataFrame({'duree': duree, 'deces': deces, 'X1': duree * 0.57 * deces + numpy.random.randn(duree.shape[0]), 'X2': duree * (-0.57) * deces + numpy.random.randn(duree.shape[0])}) data_simple.head() .. raw:: html
duree deces X1 X2
518488 219 1 125.653230 -125.666662
541408 219 1 126.006024 -125.327549
476735 189 1 107.920779 -108.358230
587013 227 1 129.788930 -130.045019
476057 181 1 103.642440 -103.793008
.. code:: ipython3 from sklearn.model_selection import train_test_split data_train, data_test = train_test_split(data_simple, test_size=0.8) .. code:: ipython3 from lifelines.fitters.coxph_fitter import CoxPHFitter cox = CoxPHFitter() cox.fit(data_train[['duree', 'deces', 'X1']], duration_col="duree", event_col="deces", show_progress=True) .. parsed-literal:: Iteration 1: norm_delta = 0.13943, step_size = 0.9000, log_lik = -250658.36250, newton_decrement = 889.93933, seconds_since_start = 0.0 Iteration 2: norm_delta = 0.00660, step_size = 0.9000, log_lik = -249862.37270, newton_decrement = 2.81312, seconds_since_start = 0.0 Iteration 3: norm_delta = 0.00073, step_size = 0.9000, log_lik = -249859.57376, newton_decrement = 0.03357, seconds_since_start = 0.1 Iteration 4: norm_delta = 0.00000, step_size = 1.0000, log_lik = -249859.54017, newton_decrement = 0.00000, seconds_since_start = 0.1 Convergence success after 4 iterations. .. parsed-literal:: .. code:: ipython3 cox.print_summary() .. raw:: html
model lifelines.CoxPHFitter
duration col 'duree'
event col 'deces'
baseline estimation breslow
number of observations 124826
number of events observed 24072
partial log-likelihood -249859.54
time fit was run 2021-02-24 23:48:57 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
X1 0.02 1.02 0.00 0.02 0.02 1.02 1.02 42.23 <0.005 inf

Concordance 0.69
Partial AIC 499721.08
log-likelihood ratio test 1597.64 on 1 df
-log2(p) of ll-ratio test inf
.. code:: ipython3 cox2 = CoxPHFitter() cox2.fit(data_train[['duree', 'deces', 'X2']], duration_col="duree", event_col="deces", show_progress=True) cox2.print_summary() .. parsed-literal:: Iteration 1: norm_delta = 0.13946, step_size = 0.9000, log_lik = -250658.36250, newton_decrement = 888.92089, seconds_since_start = 0.0 Iteration 2: norm_delta = 0.00667, step_size = 0.9000, log_lik = -249863.61089, newton_decrement = 2.86434, seconds_since_start = 0.0 Iteration 3: norm_delta = 0.00074, step_size = 0.9000, log_lik = -249860.76079, newton_decrement = 0.03426, seconds_since_start = 0.1 Iteration 4: norm_delta = 0.00000, step_size = 1.0000, log_lik = -249860.72650, newton_decrement = 0.00000, seconds_since_start = 0.1 Convergence success after 4 iterations. .. raw:: html
model lifelines.CoxPHFitter
duration col 'duree'
event col 'deces'
baseline estimation breslow
number of observations 124826
number of events observed 24072
partial log-likelihood -249860.73
time fit was run 2021-02-24 23:48:59 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
X2 -0.02 0.98 0.00 -0.02 -0.02 0.98 0.98 -42.21 <0.005 inf

Concordance 0.69
Partial AIC 499723.45
log-likelihood ratio test 1595.27 on 1 df
-log2(p) of ll-ratio test inf
.. code:: ipython3 cox.predict_cumulative_hazard(data_test[:5]) .. raw:: html
621725 110352 139986 72623 248121
0.0 0.008806 0.008673 0.008543 0.008717 0.008661
1.0 0.018218 0.017942 0.017673 0.018035 0.017918
2.0 0.026735 0.026330 0.025936 0.026466 0.026295
3.0 0.036002 0.035457 0.034926 0.035640 0.035409
4.0 0.045543 0.044854 0.044182 0.045085 0.044793
... ... ... ... ... ...
184.0 2.055736 2.024623 1.994277 2.035070 2.021872
189.0 2.092002 2.060340 2.029459 2.070971 2.057541
197.0 2.138687 2.106318 2.074747 2.117186 2.103457
201.0 2.205513 2.172133 2.139576 2.183341 2.169182
217.0 2.330629 2.295356 2.260952 2.307199 2.292237

165 rows × 5 columns

.. code:: ipython3 cox.predict_survival_function(data_test[:5]) .. raw:: html
621725 110352 139986 72623 248121
0.0 0.991233 0.991365 0.991494 0.991321 0.991377
1.0 0.981947 0.982218 0.982482 0.982127 0.982242
2.0 0.973619 0.974013 0.974398 0.973881 0.974048
3.0 0.964638 0.965164 0.965677 0.964988 0.965211
4.0 0.955478 0.956137 0.956780 0.955916 0.956196
... ... ... ... ... ...
184.0 0.127999 0.132044 0.136112 0.130671 0.132407
189.0 0.123440 0.127411 0.131407 0.126063 0.127768
197.0 0.117809 0.121685 0.125588 0.120370 0.122034
201.0 0.110194 0.113934 0.117705 0.112664 0.114271
217.0 0.097235 0.100726 0.104251 0.099540 0.101040

165 rows × 5 columns