Analyse de survie en pratique#

Quelques données#

On récupère les données disponibles sur open.data.gouv.fr Données hospitalières relatives à l’épidémie de COVID-19. Ces données ne permettent pas de construire la courbe de Kaplan-Meier. On sait combien de personnes rentrent et sortent chaque jour mais on ne sait pas quand une personne qui sort un 1er avril est entrée.

[17]:
import numpy.random as rnd

import pandas

df = pandas.read_csv(
    "https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7",
    sep=";",
)
gr = df[["jour", "rad", "dc"]].groupby(["jour"]).sum()
diff = gr.diff().reset_index(drop=False)
diff.head()
[17]:
jour rad dc
0 2020-03-18 NaN NaN
1 2020-03-19 695.0 207.0
2 2020-03-20 806.0 248.0
3 2020-03-21 452.0 151.0
4 2020-03-22 608.0 210.0
[18]:
def donnees_artificielles(hosp, mu=14, nu=21):
    dt = pandas.to_datetime(hosp["jour"])
    res = []
    for i in range(hosp.shape[0]):
        date = dt[i].dayofyear
        h1 = hosp.iloc[i, 1]
        h2 = hosp.iloc[i, 2]
        if h1 < 0 or h2 < 0:
            continue
        delay1 = rnd.exponential(mu, int(h1))
        for j in range(delay1.shape[0]):
            res.append([date - int(delay1[j]), date, 1])
        delay2 = rnd.exponential(mu, int(h2))
        for j in range(delay2.shape[0]):
            res.append([date - int(delay2[j]), date, 0])
    return pandas.DataFrame(res, columns=["entree", "sortie", "issue"])


data = donnees_artificielles(diff[1:].reset_index(drop=True)).sort_values("entree")
data.head()
[18]:
entree sortie issue
1905678 -148 3 1
577877 -147 40 1
1126578 -140 6 1
1140232 -140 11 1
1205621 -131 26 1

Chaque ligne est une personne, entree est le jour d’entrée à l’hôpital, sortie celui de la sortie, issue, 0 pour décès, 1 pour en vie.

[19]:
data.describe()
[19]:
entree sortie issue
count 1.993886e+06 1.993886e+06 1.993886e+06
mean 1.481621e+02 1.616597e+02 8.642781e-01
std 1.152239e+02 1.143726e+02 3.424931e-01
min -1.480000e+02 1.000000e+00 0.000000e+00
25% 5.100000e+01 6.400000e+01 1.000000e+00
50% 1.130000e+02 1.250000e+02 1.000000e+00
75% 2.600000e+02 2.750000e+02 1.000000e+00
max 3.660000e+02 3.660000e+02 1.000000e+00

Il y a environ 80% de survie dans ces données.

[20]:
import numpy

duree = data.sortie - data.entree
deces = (data.issue == 0).astype(numpy.int32)
[21]:
import numpy
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter

fig, ax = plt.subplots(1, 1, figsize=(10, 4))
kmf = KaplanMeierFitter()
kmf.fit(duree, deces)
kmf.plot(ax=ax)
ax.legend();
../../_images/notebooks_ml_survival_8_0.png

Régression de Cox#

On reprend les données artificiellement générées et on ajoute une variable identique à la durée plus un bruit mais quasi nul

[22]:
import pandas

data_simple = pandas.DataFrame(
    {
        "duree": duree,
        "deces": deces,
        "X1": duree * 0.57 * deces + numpy.random.randn(duree.shape[0]),
        "X2": duree * (-0.57) * deces + numpy.random.randn(duree.shape[0]),
    }
)
data_simple.head()
[22]:
duree deces X1 X2
1905678 151 0 0.650961 -1.128843
577877 187 0 -1.956525 0.108041
1126578 146 0 0.026987 -0.130392
1140232 151 0 1.149385 0.280224
1205621 157 0 -0.032398 0.400499
[23]:
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(data_simple, test_size=0.8)
[24]:
from lifelines.fitters.coxph_fitter import CoxPHFitter

cox = CoxPHFitter()
cox.fit(
    data_train[["duree", "deces", "X1"]],
    duration_col="duree",
    event_col="deces",
    show_progress=True,
)
Iteration 1: norm_delta = 5.01e-01, step_size = 0.9500, log_lik = -647954.57157, newton_decrement = 1.86e+04, seconds_since_start = 2.0
Iteration 2: norm_delta = 1.29e-01, step_size = 0.9500, log_lik = -668348.17665, newton_decrement = 2.19e+04, seconds_since_start = 4.2
Iteration 3: norm_delta = 8.30e-02, step_size = 0.9500, log_lik = -642917.75741, newton_decrement = 4.33e+03, seconds_since_start = 6.2
Iteration 4: norm_delta = 3.36e-02, step_size = 1.0000, log_lik = -637879.13496, newton_decrement = 4.09e+02, seconds_since_start = 8.5
Iteration 5: norm_delta = 3.94e-03, step_size = 1.0000, log_lik = -637443.76633, newton_decrement = 4.61e+00, seconds_since_start = 10.7
Iteration 6: norm_delta = 4.65e-05, step_size = 1.0000, log_lik = -637439.12353, newton_decrement = 6.25e-04, seconds_since_start = 13.0
Iteration 7: norm_delta = 6.33e-09, step_size = 1.0000, log_lik = -637439.12291, newton_decrement = 1.16e-11, seconds_since_start = 15.1
Convergence success after 7 iterations.
[24]:
<lifelines.CoxPHFitter: fitted with 398777 total observations, 344439 right-censored observations>
[25]:
cox.print_summary()
model lifelines.CoxPHFitter
duration col 'duree'
event col 'deces'
baseline estimation breslow
number of observations 398777
number of events observed 54338
partial log-likelihood -637439.12
time fit was run 2024-10-07 10:42:15 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% cmp to z p -log2(p)
X1 0.06 1.06 0.00 0.06 0.06 1.06 1.06 0.00 176.66 <0.005 inf

Concordance 0.75
Partial AIC 1274880.25
log-likelihood ratio test 21030.90 on 1 df
-log2(p) of ll-ratio test inf
[26]:
cox2 = CoxPHFitter()
cox2.fit(
    data_train[["duree", "deces", "X2"]],
    duration_col="duree",
    event_col="deces",
    show_progress=True,
)
cox2.print_summary()
Iteration 1: norm_delta = 5.01e-01, step_size = 0.9500, log_lik = -647954.57157, newton_decrement = 1.86e+04, seconds_since_start = 2.4
Iteration 2: norm_delta = 1.31e-01, step_size = 0.9500, log_lik = -668036.09368, newton_decrement = 2.18e+04, seconds_since_start = 5.0
Iteration 3: norm_delta = 8.29e-02, step_size = 0.9500, log_lik = -642745.26291, newton_decrement = 4.23e+03, seconds_since_start = 7.2
Iteration 4: norm_delta = 3.27e-02, step_size = 1.0000, log_lik = -637838.96866, newton_decrement = 3.84e+02, seconds_since_start = 9.4
Iteration 5: norm_delta = 3.70e-03, step_size = 1.0000, log_lik = -637430.64477, newton_decrement = 4.03e+00, seconds_since_start = 11.5
Iteration 6: norm_delta = 4.05e-05, step_size = 1.0000, log_lik = -637426.59011, newton_decrement = 4.72e-04, seconds_since_start = 13.6
Iteration 7: norm_delta = 4.77e-09, step_size = 1.0000, log_lik = -637426.58963, newton_decrement = 6.55e-12, seconds_since_start = 15.7
Convergence success after 7 iterations.
model lifelines.CoxPHFitter
duration col 'duree'
event col 'deces'
baseline estimation breslow
number of observations 398777
number of events observed 54338
partial log-likelihood -637426.59
time fit was run 2024-10-07 10:42:35 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% cmp to z p -log2(p)
X2 -0.06 0.94 0.00 -0.06 -0.06 0.94 0.95 0.00 -176.68 <0.005 inf

Concordance 0.75
Partial AIC 1274855.18
log-likelihood ratio test 21055.96 on 1 df
-log2(p) of ll-ratio test inf
[27]:
cox.predict_cumulative_hazard(data_test[:5])
[27]:
1369970 834048 1217055 1119706 1444869
0.0 0.008909 0.008111 0.020118 0.008601 0.008080
1.0 0.017592 0.016017 0.039727 0.016985 0.015956
2.0 0.026273 0.023921 0.059330 0.025366 0.023830
3.0 0.035119 0.031975 0.079308 0.033908 0.031854
4.0 0.043795 0.039875 0.098901 0.042284 0.039724
... ... ... ... ... ...
156.0 0.532402 0.484742 1.202293 0.514034 0.482903
158.0 0.538375 0.490181 1.215783 0.519801 0.488322
163.0 0.538375 0.490181 1.215783 0.519801 0.488322
170.0 0.538375 0.490181 1.215783 0.519801 0.488322
186.0 0.538375 0.490181 1.215783 0.519801 0.488322

151 rows × 5 columns

[28]:
cox.predict_survival_function(data_test[:5])
[28]:
1369970 834048 1217055 1119706 1444869
0.0 0.991131 0.991922 0.980083 0.991435 0.991952
1.0 0.982562 0.984111 0.961052 0.983159 0.984170
2.0 0.974069 0.976363 0.942396 0.974953 0.976452
3.0 0.965490 0.968530 0.923755 0.966661 0.968648
4.0 0.957150 0.960910 0.905833 0.958597 0.961055
... ... ... ... ... ...
156.0 0.587193 0.615856 0.300504 0.598078 0.616989
158.0 0.583696 0.612516 0.296478 0.594639 0.613656
163.0 0.583696 0.612516 0.296478 0.594639 0.613656
170.0 0.583696 0.612516 0.296478 0.594639 0.613656
186.0 0.583696 0.612516 0.296478 0.594639 0.613656

151 rows × 5 columns

[14]:

[15]: