Chapter 1 Monday: Introduction to Survival Analyses and simulation of data

1.1 Key (operative) concepts

  1. Time has asymmetric density and can be censored:
  • not possible to summarize it by the means
  • cannot be normal distributed
  • use exponential family
  1. Plot the log-plot to check the distribution assumptions

  2. Censoring can be:
  • Right: event not (yet) occurred at f-up
    • Fixed (identical f-up for anyone)
    • Sequential (\(min(T_i, C_i)\))
    • Random
  • Left: the event has occurred before the observed period (all population but not all information, e.g. menarche date)
  • Interval: the event can be occurred between two times (but don’t know when)
  • Left truncated: starting point is after the beginning (different from Left, all the information but not complete population)
  1. Models:
  • statistical: non-informative censoring (Kaplan-Meier, Cox model, …)
  • probabilistic: independent censoring (life tables)
  • parametric (survival::survreg(), need to define the distribution) VS non-parametric (survival::survfit() or rms::npsurv(), no need to define distribution)

1.2 Simulated Data

  1. Simulate a sample of \(n = 100\) or \(1000\) exponential survival times, w/ mean \(\theta = 5\).
  • Non censored
set.seed(171002)
n              <- c(thousand = 1000)                                   # samples
t              <- rexp(n, rate = 5)                   # random exponential times
status_no_cens <- rep(1, times = n)         # no censored data --> all are cases
  • Uniform censoring over \([0, a]\), w/ \(a = 1, a = 0.5\) or \(a = 2\)
a              <- c(cens_05 = 0.5)   # upper bound of the uniform censoring dist
cens           <- runif(n, min = 0, max = a)                    # censored times
t_cens         <- pmin(t, cens)    # censored times are earlier than event times
status_cens    <- status_no_cens - (t_cens == cens)      # remove censored cases
  1. Plot the observed survival times
  • Non censored and censored
# NOTE: for the plots to be comparable, xlim and ylim have to be the same range
#       for both the plots. Moreover to drow well adjusted plots, they were set
#       a posteriori.
hist(t,
  main = 'Hystogram of uncensored times',
  col  = 'green',
  xlim = c(0, 1.5),
  ylim = c(0, 400),
  labels = TRUE                        # add the labels over the top of the bars
)
hist(t_cens,
  main = 'Hystogram of censored times (a = 0.5)',
  col  = 'red',
  xlim = c(0, 1.5),
  ylim = c(0, 400),
  labels = TRUE
)

  1. Parametric estimation of survival function
  • Uncensored
# `?survreg` := "Regression for a Parametric Survival Model"
# 
# R formula: y ~ x <--> math formula: y = f(x)
# 
# Here we want to model the response (labelled time) as they are, w/out any
# furter investigation on the effect on them from some other variable
survreg(Surv(t, status_no_cens) ~ 1,
  dist = 'exponential'
) %>%
  summary     # here `summary()` add some more statistics to the standard output
## 
## Call:
## survreg(formula = Surv(t, status_no_cens) ~ 1, dist = "exponential")
##             Value Std. Error     z p
## (Intercept) -1.58     0.0316 -50.1 0
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= 584.3   Loglik(intercept only)= 584.3
## Number of Newton-Raphson Iterations: 4 
## n= 1000
  • Censored
survreg(Surv(t_cens, status_cens) ~ 1,
  dist = 'exponential'
) %>%
  summary
## 
## Call:
## survreg(formula = Surv(t_cens, status_cens) ~ 1, dist = "exponential")
##             Value Std. Error     z p
## (Intercept) -1.57     0.0401 -39.2 0
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= 355.9   Loglik(intercept only)= 355.9
## Number of Newton-Raphson Iterations: 4 
## n= 1000
  1. Non parametric estimation of survival and the distribution functions
  • Uncensored
# `?survfit` := "Create survival curves"
survfit(Surv(t, status_no_cens) ~ 1)
## Call: survfit(formula = Surv(t, status_no_cens) ~ 1)
## 
##        n   events   median  0.95LCL  0.95UCL 
## 1000.000 1000.000    0.140    0.128    0.158
# Here we would like to compare to approach to survival plots:
# 1. Using the packege _survival_, so the standard one
# 2. Uisng the package _rms_, a comprehensive package for regression analyses

# Using survival `plot` provided by the _survival_ package
# (`?survival:::plot.survfit`), we can continue to
# use the `survfit()` function for nonparametric survival estimation from the
# same _survival_ package
survfit(Surv(t, status_no_cens) ~ 1) %>% 
  plot(
    xlim     = c(0, 1.55),
    conf.int  = TRUE,
    mark.time = TRUE,
    col       = 'green',
    main      = 'Uncensored --- survival'
)


# Using the survplot from the _rms_ package (`survplot`), we have to switch to
# the `npsurv()` function for nonparametric survival estimation from the _rms_
# package
npsurv(Surv(t, status_no_cens) ~ 1) %>% 
  survplot(
    xlim     = c(0, 1.5),
    conf.int = TRUE,
    n.risk   = TRUE,
    col      = 'green'
)
title(main = 'Uncensored --- rms') # unfortunally survplot do not have an
                                   # integrated option for the title...

  • censored
survfit(Surv(t_cens, status_cens) ~ 1)
## Call: survfit(formula = Surv(t_cens, status_cens) ~ 1)
## 
##        n   events   median  0.95LCL  0.95UCL 
## 1000.000  623.000    0.141    0.130    0.158
survfit(Surv(t_cens, status_cens) ~ 1) %>%
  plot(
    xlim     = c(0, 0.55),
    conf.int  = TRUE,
    mark.time = TRUE,
    col      = 'red',
    main      = 'Censored (a = 0.5)'
)

npsurv(Surv(t_cens, status_cens) ~ 1) %>% 
  survplot(
    xlim     = c(0, 0.5),
    conf.int = TRUE,
    n.risk   = TRUE,
    col      = 'red'
)
title(main = 'Censored (a = 0.5) --- rms')

1.3 mgus data from survival package

  1. Load and explore data
data(mgus)                                                                # load
head(mgus)                                                       # first 10 rows
##   id age    sex dxyr pcdx pctime futime death alb creat  hgb mspike
## 1  1  78 female   68 <NA>     NA    748     1 2.8   1.2 11.5    2.0
## 2  2  73 female   66   LP   1310   6751     1  NA    NA   NA    1.3
## 3  3  87   male   68 <NA>     NA    277     1 2.2   1.1 11.2    1.3
## 4  4  86   male   69 <NA>     NA   1815     1 2.8   1.3 15.3    1.8
## 5  5  74 female   68 <NA>     NA   2587     1 3.0   0.8  9.8    1.4
## 6  6  81   male   68 <NA>     NA    563     1 2.9   0.9 11.5    1.8
dim(mgus)                                              # number of rows and cols
## [1] 241  12
names(mgus)                                                # name of the columns
##  [1] "id"     "age"    "sex"    "dxyr"   "pcdx"   "pctime" "futime"
##  [8] "death"  "alb"    "creat"  "hgb"    "mspike"
str(mgus)                                   # R internal structure of the object
## 'data.frame':    241 obs. of  12 variables:
##  $ id    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ age   : atomic  78 73 87 86 74 81 72 79 85 58 ...
##   ..- attr(*, "label")= chr "AGE AT date_on"
##  $ sex   : Factor w/ 2 levels "female","male": 1 1 2 2 1 2 1 1 1 2 ...
##   ..- attr(*, "label")= chr "Sex"
##  $ dxyr  : num  68 66 68 69 68 68 68 69 70 65 ...
##  $ pcdx  : Factor w/ 4 levels "AM","LP","MA",..: NA 2 NA NA NA NA NA NA NA NA ...
##  $ pctime: atomic  NA 1310 NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Progression to Group 4 (days)"
##  $ futime: atomic  748 6751 277 1815 2587 ...
##   ..- attr(*, "label")= chr "Follow-Up Time"
##  $ death : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ alb   : atomic  2.8 NA 2.2 2.8 3 2.9 3 3.1 3.2 3.5 ...
##   ..- attr(*, "label")= chr "Serum Albumin"
##  $ creat : atomic  1.2 NA 1.1 1.3 0.8 0.9 0.8 0.8 1 1 ...
##   ..- attr(*, "label")= chr "Serum Creatinine"
##  $ hgb   : atomic  11.5 NA 11.2 15.3 9.8 11.5 13.5 15.5 12.4 14.8 ...
##   ..- attr(*, "label")= chr "Hemoglobin"
##  $ mspike: atomic  2 1.3 1.3 1.8 1.4 1.8 1.3 1.4 1.5 2.2 ...
##   ..- attr(*, "label")= chr "Serum M-Spike"
##  - attr(*, "formats")=List of 1
##   ..$ death:List of 2
##   .. ..$ values: num  0 1
##   .. ..$ labels: chr  "Alive" "Dead"
summary(mgus)                                              # summary from base R
##        id           age            sex           dxyr        pcdx    
##  Min.   :  1   Min.   :34.00   female:104   Min.   :56.0   AM  :  8  
##  1st Qu.: 61   1st Qu.:55.00   male  :137   1st Qu.:66.0   LP  :  5  
##  Median :121   Median :63.00                Median :68.0   MA  :  7  
##  Mean   :121   Mean   :62.87                Mean   :67.4   MM  : 44  
##  3rd Qu.:181   3rd Qu.:72.00                3rd Qu.:70.0   NA's:177  
##  Max.   :241   Max.   :90.00                Max.   :73.0             
##                                                                      
##      pctime          futime          death             alb       
##  Min.   :  365   Min.   :    6   Min.   :0.0000   Min.   :1.800  
##  1st Qu.: 2469   1st Qu.: 2422   1st Qu.:1.0000   1st Qu.:2.900  
##  Median : 3778   Median : 5022   Median :1.0000   Median :3.200  
##  Mean   : 4342   Mean   : 5425   Mean   :0.9336   Mean   :3.204  
##  3rd Qu.: 5750   3rd Qu.: 8264   3rd Qu.:1.0000   3rd Qu.:3.500  
##  Max.   :11685   Max.   :14325   Max.   :1.0000   Max.   :5.100  
##  NA's   :177                                      NA's   :31     
##      creat            hgb            mspike     
##  Min.   :0.600   Min.   : 7.40   Min.   :0.300  
##  1st Qu.:0.900   1st Qu.:12.20   1st Qu.:1.500  
##  Median :1.000   Median :13.20   Median :1.700  
##  Mean   :1.095   Mean   :13.15   Mean   :1.764  
##  3rd Qu.:1.100   3rd Qu.:14.50   3rd Qu.:2.000  
##  Max.   :6.400   Max.   :16.60   Max.   :3.200  
##  NA's   :43      NA's   :1
describe(mgus)   # more comprehensive description from _Hisc_ package, loaded by
## mgus 
## 
##  12  Variables      241  Observations
## ---------------------------------------------------------------------------
## id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      241        0      241        1      121    80.67       13       25 
##      .25      .50      .75      .90      .95 
##       61      121      181      217      229 
## 
## lowest :   1   2   3   4   5, highest: 237 238 239 240 241
## ---------------------------------------------------------------------------
## age : AGE AT date_on 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      241        0       53    0.999    62.87    13.42       44       48 
##      .25      .50      .75      .90      .95 
##       55       63       72       78       81 
## 
## lowest : 34 35 36 37 38, highest: 84 85 86 87 90
## ---------------------------------------------------------------------------
## sex : Sex 
##        n  missing distinct 
##      241        0        2 
##                         
## Value      female   male
## Frequency     104    137
## Proportion  0.432  0.568
## ---------------------------------------------------------------------------
## dxyr 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      241        0       17     0.97     67.4    3.073       61       63 
##      .25      .50      .75      .90      .95 
##       66       68       70       70       70 
##                                                                       
## Value         56    58    59    60    61    62    63    64    65    66
## Frequency      1     1     5     5     2     7     7    10    10    18
## Proportion 0.004 0.004 0.021 0.021 0.008 0.029 0.029 0.041 0.041 0.075
##                                                     
## Value         67    68    69    70    71    72    73
## Frequency     24    40    45    62     2     1     1
## Proportion 0.100 0.166 0.187 0.257 0.008 0.004 0.004
## ---------------------------------------------------------------------------
## pcdx 
##        n  missing distinct 
##       64      177        4 
##                                   
## Value         AM    LP    MA    MM
## Frequency      8     5     7    44
## Proportion 0.125 0.078 0.109 0.688
## ---------------------------------------------------------------------------
## pctime : Progression to Group 4 (days) 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       64      177       63        1     4342     3030     1223     1409 
##      .25      .50      .75      .90      .95 
##     2469     3778     5750     8946    10051 
## 
## lowest :   365   700   954  1218  1249, highest:  9723 10109 10359 11354 11685
## ---------------------------------------------------------------------------
## futime : Follow-Up Time 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      241        0      237        1     5425     4222      283      779 
##      .25      .50      .75      .90      .95 
##     2422     5022     8264    11425    12140 
## 
## lowest :     6     7    31    32    39, highest: 12931 13019 13152 14111 14325
## ---------------------------------------------------------------------------
## death 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      241        0        2    0.186      225   0.9336   0.1245 
## 
## ---------------------------------------------------------------------------
## alb : Serum Albumin 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      210       31       26    0.995    3.204   0.5293      2.3      2.6 
##      .25      .50      .75      .90      .95 
##      2.9      3.2      3.5      3.8      3.9 
## 
## lowest : 1.8 1.9 2.1 2.2 2.3, highest: 4.0 4.1 4.3 4.5 5.1
## ---------------------------------------------------------------------------
## creat : Serum Creatinine 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      198       43       19    0.978    1.095     0.39    0.700    0.800 
##      .25      .50      .75      .90      .95 
##    0.900    1.000    1.100    1.300    1.615 
##                                                                       
## Value        0.6   0.7   0.8   0.9   1.0   1.1   1.2   1.3   1.4   1.5
## Frequency      4    13    26    42    35    29    18    12     4     4
## Proportion 0.020 0.066 0.131 0.212 0.177 0.146 0.091 0.061 0.020 0.020
##                                                                 
## Value        1.6   1.7   2.0   2.5   2.6   3.5   3.6   3.7   6.4
## Frequency      1     3     1     1     1     1     1     1     1
## Proportion 0.005 0.015 0.005 0.005 0.005 0.005 0.005 0.005 0.005
## ---------------------------------------------------------------------------
## hgb : Hemoglobin 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      240        1       66    0.999    13.15    1.865    10.20    11.09 
##      .25      .50      .75      .90      .95 
##    12.20    13.20    14.50    15.11    15.51 
## 
## lowest :  7.4  7.7  8.4  9.5  9.6, highest: 15.9 16.1 16.2 16.5 16.6
## ---------------------------------------------------------------------------
## mspike : Serum M-Spike 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      241        0       23    0.993    1.764   0.4687      1.1      1.3 
##      .25      .50      .75      .90      .95 
##      1.5      1.7      2.0      2.3      2.5 
## 
## lowest : 0.3 0.8 0.9 1.0 1.1, highest: 2.5 2.6 2.7 2.9 3.2
## ---------------------------------------------------------------------------
                 # the _rms_ one

mgus_df <- as_tibble(mgus)         # tidy data frame (important info printed all
                                   # together, and visualization auto-adjusted
                                   # to the consol width)
mgus_df
## # A tibble: 241 x 12
##       id   age    sex  dxyr   pcdx pctime futime death   alb creat   hgb
##  * <dbl> <dbl> <fctr> <dbl> <fctr>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1    78 female    68   <NA>     NA    748     1   2.8   1.2  11.5
##  2     2    73 female    66     LP   1310   6751     1    NA    NA    NA
##  3     3    87   male    68   <NA>     NA    277     1   2.2   1.1  11.2
##  4     4    86   male    69   <NA>     NA   1815     1   2.8   1.3  15.3
##  5     5    74 female    68   <NA>     NA   2587     1   3.0   0.8   9.8
##  6     6    81   male    68   <NA>     NA    563     1   2.9   0.9  11.5
##  7     7    72 female    68   <NA>     NA   1135     1   3.0   0.8  13.5
##  8     8    79 female    69   <NA>     NA   2016     1   3.1   0.8  15.5
##  9     9    85 female    70   <NA>     NA   2422     1   3.2   1.0  12.4
## 10    10    58   male    65   <NA>     NA   6155     1   3.5   1.0  14.8
## # ... with 231 more rows, and 1 more variables: mspike <dbl>
  1. Non parametric Kaplan-Meyer estimation of the survival function
  • Estimate the survival function from randomization overall and according to sex.
survfit(Surv(futime, death) ~ 1,
  data = mgus_df
) %>%
  plot(
    conf.int  = TRUE,
    mark.time = TRUE,
    col       = 'blue',
    main      = 'Survival function for mgus data'
)


survfit(Surv(futime, death) ~ sex,
  data = mgus_df
) %>%
  plot(
    conf.int  = TRUE,
    mark.time = TRUE,
    main      = 'Survival function for mgus data according to sex',
    col       = c('red', 'blue'),
    lty       = c(2, 3)
)
legend(
  x = 10000, y = 1,
  legend = c("Female", "Male"),
  col    = c('red', 'blue'),
  lty    = c(2, 3)
)

# For survival object the package _survminer_ provide ggplot2 plots
# (`?ggsurvplot`) which could be very interesting and quite comprehensive.

survfit(Surv(futime, death) ~ sex,
  data = mgus_df
) %>% 
  ggsurvplot(
    conf.int            = TRUE,                      # draw confidence intervals
    pval                = TRUE,                                    # show pvalue
    pval.method         = TRUE,                            # print the test name
    title               = 'Survival curves for overall death according to sex.',
    xlab                = 'Days',
    legend              = 'right',                             # legend position
    legend.title        = 'Sex',
    legend.labs         = c('Female', 'Male'),
    risk.table          = TRUE,     # admits interesting options other than TRUE
    cumcensor           = TRUE,
    cumevents           = TRUE,
    pval.size           = 3.5,   # from here these are options passed to `ggpar`
    risk.table.fontsize = 3,     # for a better visualization
    fontsize            = 3,     # (auto-explicatives)
    xscale              = 30.44
  )

Note: No female reaches the end of the f-up!

  • Test the effect of sex
# Using __survival__ (no plot method is provided for this solution)
survdiff(Surv(futime, death) ~ sex,
  data = mgus_df
)
## Call:
## survdiff(formula = Surv(futime, death) ~ sex, data = mgus_df)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=female 104       94      113      3.08      6.25
## sex=male   137      131      112      3.08      6.25
## 
##  Chisq= 6.2  on 1 degrees of freedom, p= 0.0124
# using __rms__
dd <- datadist(mgus_df)  # To evaluate cph, _rms_ needs this object which simply
                         # store statistics about the data.
                         # 
                         # Note: the name of the object (i.e. "dd") has to be 
                         #       exactly the same as the one specified into the
                         #       option set just after the `library(rms)` call.
                         #       (See: Chapter settings) 
cox_model <- cph(Surv(futime, death) ~ sex,
  data  = mgus_df
)

summary(cox_model)                             # return effect size and HR w/ CI
##              Effects              Response : Surv(futime, death) 
## 
##  Factor            Low High Diff. Effect   S.E.    Lower 0.95 Upper 0.95
##  sex - female:male 2   1    NA    -0.33853 0.13603 -0.60514   -0.071916 
##   Hazard Ratio     2   1    NA     0.71282      NA  0.54600    0.930610
Predict(cox_model) %>%          # Compute predicted values and confidence limits
                                #
                                # Note: pay attention to Title-case "P"redict
  plot(
    groups = 'sex',
    anova  = anova(cox_model),       # Compute and print the $\chi^2$ statistics
    pval   = TRUE                    # print the pvalue 
)

1.4 Non parametric Kaplan-Meier estimation of the survival function

  1. Let consider a sample of \(n = 500\)
n <- 500
  1. Simulate the dates of entry in the cohort, from January, 2010 to January, 2017
n_days <- 365.25 * 7              # Seven years, taking into account bissextiles
time_start <- runif(n = n,
  min = 0,
  max = n_days
) %>% 
  as.Date(origin = '2010-01-01')
  1. Simulate the data-set of death, assuming exponential death times of mean \(2\) years
mean_death_time <- 365.25 * 2
death_t         <- rexp(n, rate = 1 / mean_death_time)
status_no_cens  <- rep(1, n)
  1. Let fix the reference date of the analyses of June, 2017
end_date     <- as.Date('2017-06-01')           # Fixed date for the end of f-up
death_r_cens <- pmin(death_t, end_date - time_start)
status_cens  <- status_no_cens - (death_t == death_r_cens)
  1. Estimate the survival function from randomization
survfit(Surv(death_r_cens, status_cens) ~ 1) %>% 
  plot(
    conf.int  = TRUE,
    mark.time = TRUE,
    main      = 'Survival curve from randomization (right censored at 2017-06-01)',
    col       = 'blue',
    xlab      = 'Years',
    xscale    = 365.25
)