SamplingDesignTools: Tools for Dealing with Complex Sampling Designs

Author

Yilin Ning

Published

2022-11-14

Getting Started

Installation

Install the SamplingDesignTools package from GitHub (package devtools needed):

install.packages("devtools") # if the package is not yet installed
devtools::install_github("nyilin/SamplingDesignTools")

Load package:

library(SamplingDesignTools)
library(survival)
library(Epi) # To draw (non-counter-matched) nested case-control sample
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)

Example Datasets

This package uses two simulated cohort data (cohort_1 and cohort_2) for illustrative purpose.

`cohort_1`

Dataset cohort_1 consists of 10,000 subjects with age simulated from \(N(55, 10^2)\)) and gender simulated with \(P(\text{Male}=0.5)\). The survival outcome simulated from the following true hazard: \[\log \{h(t)\} = \log \{h_0\} + \log(1.1) \text{Age} + \log(2) \text{Gender}.\]

Time (\(t\)) is measured in years and censored at 25 years. Censoring is indicated by \(y=0\).

data("cohort_1")
dim(cohort_1)
## [1] 10000     5
table(cohort_1$y)
## 
##    0    1 
## 9418  582
kable(head(cohort_1))

id	t	age	gender
1	25.00000	47	1
2	10.65152	58	0
3	25.00000	46	0
4	15.84131	52	0
5	22.57659	49	0
6	25.00000	63	0

m_cox_cohort_1 <- coxph(Surv(t, y) ~ age + gender, data = cohort_1)
summary(m_cox_cohort_1)
## Call:
## coxph(formula = Surv(t, y) ~ age + gender, data = cohort_1)
## 
##   n= 10000, number of events= 582 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## age    0.100340  1.105546 0.004213 23.820   <2e-16 ***
## gender 0.781068  2.183804 0.087990  8.877   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## age        1.106     0.9045     1.096     1.115
## gender     2.184     0.4579     1.838     2.595
## 
## Concordance= 0.784  (se = 0.009 )
## Likelihood ratio test= 643.4  on 2 df,   p=<2e-16
## Wald test            = 643.8  on 2 df,   p=<2e-16
## Score (logrank) test = 642.4  on 2 df,   p=<2e-16

`cohort_2`

Dataset cohort_2 consists of 100,000 subjects, with survival outcome simulated from the following true hazard: \[\log \{h(t)\} = \log \{h_0\} + \log(1.5)x + \log(4)z + \log(2)xz + \log(1.01) \text{Gender} + \log(1.01) \text{Age}.\]

Time (\(t\)) is measured in years and censored at 25 years. Censoring is indicated by \(y=0\). Age is also recorded in 6 categories: <35, 36-45, 46-55, 56-65, 66-75 and >75.

data("cohort_2")
dim(cohort_2)
## [1] 100000      8
table(cohort_2$y)
## 
##     0     1 
## 97227  2773
kable(head(cohort_2))

id	t	x	age	age_cat	gender	z
1	25.000000	1	-2	(45,55]	0	0
2	19.819801	1	-4	(45,55]	1	0
3	25.000000	1	-5	(45,55]	0	0
4	12.414616	1	20	(75, Inf]	1	0
5	25.000000	1	-2	(45,55]	0	1
6	1.019023	0	-15	(35,45]	1	0

m_cox_cohort_2 <- coxph(Surv(t, y) ~ x * z + age + gender, data = cohort_2)
summary(m_cox_cohort_2)
## Call:
## coxph(formula = Surv(t, y) ~ x * z + age + gender, data = cohort_2)
## 
##   n= 100000, number of events= 2773 
## 
##             coef exp(coef)  se(coef)      z Pr(>|z|)    
## x       0.382501  1.465946  0.109906  3.480 0.000501 ***
## z       1.495078  4.459686  0.126331 11.835  < 2e-16 ***
## age     0.007139  1.007165  0.001898  3.762 0.000169 ***
## gender -0.086074  0.917526  0.038010 -2.265 0.023542 *  
## x:z     0.640698  1.897805  0.135081  4.743 2.11e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## x         1.4659     0.6822    1.1819    1.8183
## z         4.4597     0.2242    3.4815    5.7127
## age       1.0072     0.9929    1.0034    1.0109
## gender    0.9175     1.0899    0.8517    0.9885
## x:z       1.8978     0.5269    1.4564    2.4730
## 
## Concordance= 0.762  (se = 0.005 )
## Likelihood ratio test= 2907  on 5 df,   p=<2e-16
## Wald test            = 2435  on 5 df,   p=<2e-16
## Score (logrank) test = 3528  on 5 df,   p=<2e-16