SamplingDesignTools: Tools for Dealing with Complex Sampling Designs

Author

Yilin Ning

Published

2022-11-14

Getting Started

Installation

Install the SamplingDesignTools package from GitHub (package devtools needed):

install.packages("devtools") # if the package is not yet installed
devtools::install_github("nyilin/SamplingDesignTools")

Load package:

library(SamplingDesignTools)
library(survival)
library(Epi) # To draw (non-counter-matched) nested case-control sample
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)

Example Datasets

This package uses two simulated cohort data (cohort_1 and cohort_2) for illustrative purpose.

cohort_1

Dataset cohort_1 consists of 10,000 subjects with age simulated from \(N(55, 10^2)\)) and gender simulated with \(P(\text{Male}=0.5)\). The survival outcome simulated from the following true hazard: \[\log \{h(t)\} = \log \{h_0\} + \log(1.1) \text{Age} + \log(2) \text{Gender}.\]

Time (\(t\)) is measured in years and censored at 25 years. Censoring is indicated by \(y=0\).

data("cohort_1")
dim(cohort_1)
## [1] 10000     5
table(cohort_1$y)
## 
##    0    1 
## 9418  582
kable(head(cohort_1))
id y t age gender
1 0 25.00000 47 1
2 0 10.65152 58 0
3 0 25.00000 46 0
4 0 15.84131 52 0
5 0 22.57659 49 0
6 0 25.00000 63 0
m_cox_cohort_1 <- coxph(Surv(t, y) ~ age + gender, data = cohort_1)
summary(m_cox_cohort_1)
## Call:
## coxph(formula = Surv(t, y) ~ age + gender, data = cohort_1)
## 
##   n= 10000, number of events= 582 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## age    0.100340  1.105546 0.004213 23.820   <2e-16 ***
## gender 0.781068  2.183804 0.087990  8.877   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## age        1.106     0.9045     1.096     1.115
## gender     2.184     0.4579     1.838     2.595
## 
## Concordance= 0.784  (se = 0.009 )
## Likelihood ratio test= 643.4  on 2 df,   p=<2e-16
## Wald test            = 643.8  on 2 df,   p=<2e-16
## Score (logrank) test = 642.4  on 2 df,   p=<2e-16

cohort_2

Dataset cohort_2 consists of 100,000 subjects, with survival outcome simulated from the following true hazard: \[\log \{h(t)\} = \log \{h_0\} + \log(1.5)x + \log(4)z + \log(2)xz + \log(1.01) \text{Gender} + \log(1.01) \text{Age}.\]

Time (\(t\)) is measured in years and censored at 25 years. Censoring is indicated by \(y=0\). Age is also recorded in 6 categories: <35, 36-45, 46-55, 56-65, 66-75 and >75.

data("cohort_2")
dim(cohort_2)
## [1] 100000      8
table(cohort_2$y)
## 
##     0     1 
## 97227  2773
kable(head(cohort_2))
id y t x age age_cat gender z
1 0 25.000000 1 -2 (45,55] 0 0
2 0 19.819801 1 -4 (45,55] 1 0
3 0 25.000000 1 -5 (45,55] 0 0
4 0 12.414616 1 20 (75, Inf] 1 0
5 0 25.000000 1 -2 (45,55] 0 1
6 0 1.019023 0 -15 (35,45] 1 0
m_cox_cohort_2 <- coxph(Surv(t, y) ~ x * z + age + gender, data = cohort_2)
summary(m_cox_cohort_2)
## Call:
## coxph(formula = Surv(t, y) ~ x * z + age + gender, data = cohort_2)
## 
##   n= 100000, number of events= 2773 
## 
##             coef exp(coef)  se(coef)      z Pr(>|z|)    
## x       0.382501  1.465946  0.109906  3.480 0.000501 ***
## z       1.495078  4.459686  0.126331 11.835  < 2e-16 ***
## age     0.007139  1.007165  0.001898  3.762 0.000169 ***
## gender -0.086074  0.917526  0.038010 -2.265 0.023542 *  
## x:z     0.640698  1.897805  0.135081  4.743 2.11e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## x         1.4659     0.6822    1.1819    1.8183
## z         4.4597     0.2242    3.4815    5.7127
## age       1.0072     0.9929    1.0034    1.0109
## gender    0.9175     1.0899    0.8517    0.9885
## x:z       1.8978     0.5269    1.4564    2.4730
## 
## Concordance= 0.762  (se = 0.005 )
## Likelihood ratio test= 2907  on 5 df,   p=<2e-16
## Wald test            = 2435  on 5 df,   p=<2e-16
## Score (logrank) test = 3528  on 5 df,   p=<2e-16