學習目標

單變項機率函數

Simulating data is a very powerful way to test your understanding of statistical concepts. We are going to use simulations to learn the basics of probability.

# libraries needed for these examples
library(tidyverse)
library(dplyr)
library(MASS)
library(faux)
library(plotly)
set.seed(8675309) # makes sure random numbers are reproducible

R的內建機率函數命名格式

r____ 依機率密度函數(PDF)製造隨機亂數/量值 ~ runif, rbinom, rnorm, rpois

d____ 依機率密度函數(PDF)計算對應量值的機率密度 ~ dunif, dbinom, dnorm, dpois

p____ 依累積機率函數(CPF)計算量值對應的累積機率 ~ punif, pbinom, pnorm, ppois

q____ 依累積機率函數(CPF)反函數計算累積機率對應的量值 ~ qunif, qbinom, qnorm, qpois

要深入學習機率函數,推薦seeing theory

均勻分佈(Uniform Distribution)

The uniform distribution is the simplest distribution. All numbers in the range have an equal probability of being sampled.

Take a minute to think of things in your own research that are uniformly distributed.

模擬連續變數

runif(n, min=0, max=1)

使用均勻分佈(uniform probability distribution)的機率密度函數製造100000筆隨機亂數,繪製次數直方圖。

Use runif() to sample from a continuous uniform distribution.

u <- runif(100000, min = 0, max = 1)

# plot to visualise
ggplot() + 
  geom_histogram(aes(u), binwidth = 0.05, boundary = 0,
                 fill = "white", colour = "black")

模擬間斷變數

sample(x, size, replace = FALSE, prob = NULL)

使用sample()從間斷數值(6個人來做快篩,結果是陽性的人數1 ~ 6),隨機10000次的結果,繪製次數直方圖。

You can use sample() to simulate events like rolling dice or choosing from a deck of cards. The code below simulates rolling a 6-sided die 10000 times. We set replace to TRUE so that each event is independent. See what happens if you set replace to FALSE.

cases <- sample(1:6, 10000, replace = TRUE)

# plot the results
ggplot() + 
  geom_histogram(aes(cases), binwidth = 1, 
                 fill = "white", color = "black")
Distribution of dice rolls.

Distribution of dice rolls.

善用sample(),能模擬各種卡牌遊戲抽牌狀況。

pet_types <- c("cat", "dog", "ferret", "bird", "fish")
sample(pet_types, 10, replace = TRUE)
##  [1] "cat"    "cat"    "cat"    "cat"    "ferret" "dog"    "bird"   "cat"   
##  [9] "dog"    "fish"

Ferrets are a much less common pet than cats and dogs, so our sample isn’t very realistic. You can set the probabilities of each item in the list with the prob argument.

pet_types <- c("cat", "dog", "ferret", "bird", "fish")
pet_prob <- c(0.3, 0.4, 0.1, 0.1, 0.1)
sample(pet_types, 10, replace = TRUE, prob = pet_prob)
##  [1] "fish" "dog"  "cat"  "dog"  "cat"  "dog"  "fish" "dog"  "cat"  "fish"

二項分佈機率函數

The binomial distribution is useful for modeling binary data, where each observation can have one of two outcomes, like success/failure, yes/no or head/tails.

模擬抽樣分佈

rbinom(n, size, prob)

rbinom() 的機率密度函數需要設定三個參數

  • n = 觀察值個數
  • size = 試驗次數
  • prob = 每次試驗結果是“陽性”的機率

以下設定每次試驗投擲1枚硬幣20次,每次試驗出現硬幣頭像朝上的比例以1表示。

# 20 individual coin flips of a fair coin
rbinom(20, 1, 0.5)
##  [1] 1 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0
# 20 individual coin flips of a baised (0.75) coin
rbinom(20, 1, 0.75)
##  [1] 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 1

以下設定每次試驗投擲20枚硬幣各1次,每次試驗出現硬幣頭像朝上的總次數。

rbinom(1, 20, 0.75)
## [1] 13

以下設定每次試驗投擲20枚硬幣各1次,每次試驗出現硬幣頭像朝上的總次數。進行10次試驗。

rbinom(10, 20, 0.5)
##  [1] 10 14 11  7 11 13  6 10  9  9

進行1000次試驗,全部結果以直方圖呈現。

flips <- rbinom(1000, 20, 0.5)

ggplot() +
  geom_histogram(
    aes(flips), 
    binwidth = 1, 
    fill = "white", 
    color = "black"
  )

Run the simulation above several times, noting how the histogram changes. Try changing the values of n, size, and prob.

常態分佈(Normal Distribution)

rnorm(n, mean, sd)

rnorm() 的機率密度函數需要設定三個參數

  • n = 觀察值個數
  • mean = 常態分佈平均值,預設值為0
  • sd = 常態分佈標準差,預設值為1

以下設定一個常態分佈平均值是10,標準差2,製造的100000個隨機變數。

dv <- rnorm(1e5, 10, 2)

# proportions of normally-distributed data 
# within 1, 2, or 3 SD of the mean
sd1 <- .6827 ## pnorm(1) - pnorm(-1)
sd2 <- .9545 ## pnorm(2) - pnorm(-2)
sd3 <- .9973 ## pnorm(3) - pnorm(-3)

ggplot() +
  geom_density(aes(dv), fill = "white") +
  geom_vline(xintercept = mean(dv), color = "red") +
  geom_vline(xintercept = quantile(dv, .5 - sd1/2), color = "darkgreen") +
  geom_vline(xintercept = quantile(dv, .5 + sd1/2), color = "darkgreen") +
  geom_vline(xintercept = quantile(dv, .5 - sd2/2), color = "blue") +
  geom_vline(xintercept = quantile(dv, .5 + sd2/2), color = "blue") +
  geom_vline(xintercept = quantile(dv, .5 - sd3/2), color = "purple") +
  geom_vline(xintercept = quantile(dv, .5 + sd3/2), color = "purple") +
  scale_x_continuous(
    limits = c(0,20), 
    breaks = seq(0,20)
  )

卜瓦松機率分佈(Poisson Distribution)

rpois(n, lambda)

rpois()的機率密度函數需要設定兩個參數

  • n = 觀察值個數
  • lambda = 每個觀察值的平均事件數目

以下設定某國家某一年疫情期間,每天通報20位確診個案的模擬結果,以直方圖表示。

cases <- rpois(n = 365, lambda = 20)

ggplot() +
  geom_histogram(
    aes(cases), 
    binwidth = 1, 
    fill = "white", 
    color = "black"
  )

多變量機率函數

每次試驗測量結果,來自至少兩個以上的隨機變數。

Bivariate Normal

以下R code設定兩個常態分佈隨機變數,彼此相關係數0.5。

n   <- 1000 # number of random samples
# name the mu values to give the resulting columns names
mu     <- c(x = 10, y = 20) # the means of the samples
sd <- c(5, 6)   # the SDs of the samples

rho <- 0.5  # population correlation between the two variables

# correlation matrix
cor_mat <- matrix(c(  1, rho, 
                    rho,   1), 2) 

# create the covariance matrix
sigma <- (sd %*% t(sd)) * cor_mat

# sample from bivariate normal distribution
bvn <- MASS::mvrnorm(n, mu, sigma) 
bvn %>%
  as_tibble() %>%
  ggplot(aes(x, y)) +
    geom_point(alpha = 0.5) + 
    geom_smooth(method = "lm") +
    geom_density2d()
## `geom_smooth()` using formula 'y ~ x'

Multivariate Normal

以下R code設定三個常態分佈隨機變數,彼此相關係數0.5。

n      <- 200 # number of random samples
mu     <- c(x = 10, y = 20, z = 30) # the means of the samples
sd <- c(8, 9, 10)   # the SDs of the samples

rho1_2 <- 0.5 # correlation between x and y
rho1_3 <- 0   # correlation between x and z
rho2_3 <- 0.7 # correlation between y and z

# correlation matrix
cor_mat <- matrix(c(     1, rho1_2, rho1_3, 
                    rho1_2,      1, rho2_3,
                    rho1_3, rho2_3,      1), 3) 

sigma <- (sd %*% t(sd)) * cor_mat
bvn3 <- MASS::mvrnorm(n, mu, sigma)

cor(bvn3) # check correlation matrix
##           x         y         z
## x 1.0000000 0.5896674 0.1513108
## y 0.5896674 1.0000000 0.7468737
## z 0.1513108 0.7468737 1.0000000
#set up the marker style
marker_style = list(
    color = "#ff0000", 
    line = list(
      color = "#444", 
      width = 1
    ), 
    opacity = 0.5,
    size = 5
  )

# convert bvn3 to a tibble, plot and add markers
bvn3 %>%
  as_tibble() %>%
  plot_ly(x = ~x, y = ~y, z = ~z, marker = marker_style) ## %>%
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
##  add_markers()

Faux

Lida DeBrunie開發的R套件Faux,是製造來自多變量機率函數的隨機變數的另一種工具。優點是可依研究設計的製造虛擬資料。

rnorm_multi()將前一段“sim_bvn3”的工作打包在一個函式裡完成。

bvn3 <- rnorm_multi(
  n = n, 
  vars = 3,
  mu = mu, 
  sd = sd,
  r = c(rho1_2, rho1_3, rho2_3),
  varnames = c("x", "y", "z")
)

check_sim_stats(bvn3)
##     n var    x    y    z  mean   sd
## 1 200   x 1.00 0.54 0.10 10.35 7.66
## 2 200   y 0.54 1.00 0.67 20.01 8.77
## 3 200   z 0.10 0.67 1.00 30.37 9.59

sim_design()能依照設計的變項操作條件,製造虛擬資料。

b <- list(pet = c(cat = "Cat Owners",
                  dog = "Dog Owners"))
w <- list(time = c("morning",
                   "noon",
                   "night"))
mu <- data.frame(
  cat    = c(10, 12, 14),
  dog    = c(10, 15, 20),
  row.names = w$time
)
sd <- c(3, 3, 3, 5, 5, 5)

pet_data <- sim_design(
  within = w, 
  between = b,
  n = 100, 
  mu = mu,
  sd = sd, 
  r = .5)

check_sim_stats(pet_data, between = "pet")
##   pet   n     var morning noon night  mean   sd
## 1 cat 100 morning    1.00 0.57  0.51 10.62 3.48
## 2 cat 100    noon    0.57 1.00  0.59 12.44 3.01
## 3 cat 100   night    0.51 0.59  1.00 14.61 3.14
## 4 dog 100 morning    1.00 0.55  0.50  9.44 4.92
## 5 dog 100    noon    0.55 1.00  0.48 14.18 5.90
## 6 dog 100   night    0.50 0.48  1.00 19.42 5.36

統計學名詞

R套件與各種統計軟體會報告重要的統計學名詞,提供使用者判讀統計分析結果。

The effect is some measure of your data. This will depend on the type of data you have and the type of statistical test you are using. For example, if you flipped a coin 100 times and it landed heads 66 times, the effect would be 66/100. You can then use the exact binomial test to compare this effect to the null effect you would expect from a fair coin (50/100) or to any other effect you choose. The effect size refers to the difference between the effect in your data and the null effect (usually a chance value).

null effect: 統計分析結果只有隨機誤差 effect: 統計分析結果有隨機誤差與測試效果 效果量(effect size): 排除隨機誤差的統計分析結果;純粹的測試效果

視覺化解說:Kristoffer Magnusson的效果量互動式圖解

{#p-value} The p-value of a test is the probability of seeing an effect at least as extreme as what you have, if the real effect was the value you are testing against (e.g., a null effect). So if you used a binomial test to test against a chance probability of 1/6 (e.g., the probability of rolling 1 with a 6-sided die), then a p-value of 0.17 means that you could expect to see effects at least as extreme as your data 17% of the time just by chance alone.

統計分析結果只有隨機誤差的條件機率。 \(p(\frac{D}{\theta})\)

cases <- sample(1:6, 10000, replace = TRUE)

# plot the results
ggplot() + 
  geom_histogram(aes(cases), binwidth = 1, 
                 fill = c("black",rep("white",5)), color = "black") +
  labs(title="不確定傳染病盛行率,可能的檢驗結果")

視覺化解說:Kristoffer Magnusson的獨立組比較的p值互動式圖解

{#alpha} If you are using null hypothesis significance testing (NHST), then you need to decide on a cutoff value (alpha) for making a decision to reject the null hypothesis. We call p-values below the alpha cutoff significant. In psychology, alpha is traditionally set at 0.05, but there are good arguments for setting a different criterion in some circumstances.

分析顯示p值小於顯著水準(\(\alpha\)),可判定統計分析結果_不只有隨機誤差_。

{#false-pos}{#false-neg} The probability that a test concludes there is an effect when there is really no effect (e.g., concludes a fair coin is biased) is called the false positive rate (or Type I Error Rate). The alpha is the false positive rate we accept for a test. The probability that a test concludes there is no effect when there really is one (e.g., concludes a biased coin is fair) is called the false negative rate (or Type II Error Rate). The beta is the false negative rate we accept for a test.

偽陽率(Type I error rate): 判斷統計分析結果出現測試效果,其實只有隨機誤差的機率

偽陰率(Type II error rate): 判斷統計分析結果只有隨機誤差,其實有出現測試效果的機率

The false positive rate is not the overall probability of getting a false positive, but the probability of a false positive under the null hypothesis. Similarly, the false negative rate is the probability of a false negative under the alternative hypothesis. Unless we know the probability that we are testing a null effect, we can’t say anything about the overall probability of false positives or negatives. If 100% of the hypotheses we test are false, then all significant effects are false positives, but if all of the hypotheses we test are true, then all of the positives are true positives and the overall false positive rate is 0.

兩種機率無法從一次實驗或調查做出評估,需要累積數次測試效果的統計。

{#power}{#sesoi} Power is equal to 1 minus beta (i.e., the true positive rate), and depends on the effect size, how many samples we take (n), and what we set alpha to. For any test, if you specify all but one of these values, you can calculate the last. The effect size you use in power calculations should be the smallest effect size of interest (SESOI). See [@TOSTtutorial](https://doi.org/10.1177/2515245918770963) for a tutorial on methods for choosing an SESOI.

考驗力(Power): (1)可在事前根據理論預測或過去資料估計;(2)可在完成統計分析結果後估計。

只有事前估計的考驗力才具有參考價值。

預期的最小效果量(SESOI): 預測研究條件可測得的最小測試效果。

視覺化解說:Kristoffer Magnusson的互動式圖解假設檢定

:電子書此處有習題。詳見電子書原版

{#conf-int} The confidence interval is a range around some value (such as a mean) that has some probability (usually 95%, but you can calculate CIs for any percentage) of containing the parameter, if you repeated the process many times.

A 95% CI does not mean that there is a 95% probability that the true mean lies within this range, but that, if you repeated the study many times and calculated the CI this same way every time, you’d expect the true mean to be inside the CI in 95% of the studies. This seems like a subtle distinction, but can lead to some misunderstandings. See [@Morey2016](https://link.springer.com/article/10.3758/s13423-015-0947-8) for more detailed discussion.

信賴區間(confidence interval): 當每次研究分析統計結果都有計算95%信賴區間,1000次中約950次的信賴區間能覆蓋相同的平均值。然而Morey et al.(2016)指出信賴區間與測試效果不完全相同,必須謹慎解讀。

視覺化解說:Kristoffer Magnusson的互動式圖解信賴區間

統計檢定

二項檢定

binom.test(x, n, p)

You can test a binomial distribution against a specific probability using the exact binomial test.

  • x = 試驗結果是“陽性”的次數
  • n = 參與試驗的樣本數
  • p = 試驗結果是“陽性”的機率

十位接受傳染病快篩的民眾,每位會得到傳染病的機率是0.6,其中1位篩檢為陽性,能確定是真的染疫嗎?

n <- 10
fair_coin <- rbinom(1, n, 0.5)
biased_coin <- rbinom(1, n, 0.6)

binom.test(fair_coin, n, p = 0.5)
## 
##  Exact binomial test
## 
## data:  fair_coin and n
## number of successes = 5, number of trials = 10, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.187086 0.812914
## sample estimates:
## probability of success 
##                    0.5
binom.test(biased_coin, n, p = 0.5)
## 
##  Exact binomial test
## 
## data:  biased_coin and n
## number of successes = 4, number of trials = 10, p-value = 0.7539
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.1215523 0.7376219
## sample estimates:
## probability of success 
##                    0.4

模擬樣本的假設檢定結果

To estimate these rates, we need to repeat the sampling above many times. A function is ideal for repeating the exact same procedure over and over. Set the arguments of the function to variables that you might want to change. Here, we will want to estimate power for:

  • different sample sizes (n)
  • different effects (bias)
  • different hypothesised probabilities (p, defaults to 0.5)
sim_binom_test <- function(n, bias, p = 0.5) {
  # simulate 1 coin flip n times with the specified bias
  coin <- rbinom(1, n, bias)
  # run a binomial test on the simulated data for the specified p
  btest <- binom.test(coin, n, p)
  # returun the p-value of this test
  btest$p.value
}

Once you’ve created your function, test it a few times, changing the values.

sim_binom_test(100, 0.6)
## [1] 0.3682016

估計考驗力

Then you can use the replicate() function to run it many times and save all the output values. You can calculate the power of your analysis by checking the proportion of your simulated analyses that have a p-value less than your alpha (the probability of rejecting the null hypothesis when the null hypothesis is true).

my_reps <- replicate(1e4, sim_binom_test(100, 0.6))

alpha <- 0.05 # this does not always have to be 0.05

mean(my_reps < alpha)
## [1] 0.46

1e4 is just scientific notation for a 1 followed by 4 zeros (10000). When you’re running simulations, you usually want to run a lot of them. It’s a pain to keep track of whether you’ve typed 5 or 6 zeros (100000 vs 1000000) and this will change your running time by an order of magnitude.

You can plot the distribution of p-values.

ggplot() + 
  geom_histogram(
    aes(my_reps), 
    binwidth = 0.05, 
    boundary = 0,
    fill = "white", 
    color = "black"
  )

t檢定

t.test(x, y, alternative, mu, paired)

Use a t-test to compare the mean of one distribution to a null hypothesis (one-sample t-test), compare the means of two samples (independent-samples t-test), or compare pairs of values (paired-samples t-test).

You can run a one-sample t-test comparing the mean of your data to mu. Here is a simulated distribution with a mean of 0.5 and an SD of 1, creating an effect size of 0.5 SD when tested against a mu of 0. Run the simulation a few times to see how often the t-test returns a significant p-value (or run it in the shiny app).

sim_norm <- rnorm(100, 0.5, 1)
t.test(sim_norm, mu = 0)
## 
##  One Sample t-test
## 
## data:  sim_norm
## t = 4.355, df = 99, p-value = 3.249e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.2575676 0.6887013
## sample estimates:
## mean of x 
## 0.4731344

Run an independent-samples t-test by comparing two lists of values.

a <- rnorm(100, 0.5, 1)
b <- rnorm(100, 0.7, 1)
t_ind <- t.test(a, b, paired = FALSE)
t_ind
## 
##  Welch Two Sample t-test
## 
## data:  a and b
## t = -1.191, df = 197.95, p-value = 0.2351
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4981642  0.1230172
## sample estimates:
## mean of x mean of y 
## 0.4949820 0.6825555

The paired argument defaults to FALSE, but it’s good practice to always explicitly set it so you are never confused about what type of test you are performing.

We can use the names() function to find out the names of all the t.test parameters and use this to just get one type of data, like the test statistic (e.g., t-value).

names(t_ind)
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"
t_ind$statistic
##         t 
## -1.190953

Alternatively, use broom::tidy() to convert the output into a tidy table.

broom::tidy(t_ind)
## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   -0.188     0.495     0.683     -1.19   0.235      198.   -0.498     0.123
## # ... with 2 more variables: method <chr>, alternative <chr>

模擬樣本的假設檢定結果

If you want to run the simulation many times and record information each time, first you need to turn your simulation into a function.

sim_t_ind <- function(n, m1, sd1, m2, sd2) {
  # simulate v1
  v1 <- rnorm(n, m1, sd1)
  
  #simulate v2
  v2 <- rnorm(n, m2, sd2)
    
  # compare using an independent samples t-test
  t_ind <- t.test(v1, v2, paired = FALSE)
  
  # return the p-value
  return(t_ind$p.value)
}

Run it a few times to check that it gives you sensible values.

sim_t_ind(100, 0.7, 1, 0.5, 1)
## [1] 0.005898957

估計考驗力

Now replicate the simulation 1000 times.

運用模擬資料估計

my_reps <- replicate(1e4, sim_t_ind(100, 0.7, 1, 0.5, 1))

alpha <- 0.05
power <- mean(my_reps < alpha)
power
## [1] 0.2925

Run the code above several times. How much does the power value fluctuate? How many replications do you need to run to get a reliable estimate of power?

You can plot the distribution of p-values.

ggplot() + 
  geom_histogram(
    aes(my_reps), 
    binwidth = 0.05, 
    boundary = 0,
    fill = "white", 
    color = "black"
  )

Compare your power estimate from simluation to a power calculation using power.t.test(). Here, delta is the difference between m1 and m2 above.

運用公式逼近

power.t.test(n = 100, 
             delta = 0.2, 
             sd = 1, 
             sig.level = alpha, 
             type = "two.sample")
## 
##      Two-sample t test power calculation 
## 
##               n = 100
##           delta = 0.2
##              sd = 1
##       sig.level = 0.05
##           power = 0.2902664
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

What do you think the distribution of p-values is when there is no effect (i.e., the means are identical)? Check this yourself.

Make sure the boundary argument is set to 0 for p-value histograms. See what happens with a null effect if boundary is not set.

相關係數檢定

You can test if two continuous variables are related to each other using the cor() function.

Below is one way to generate two correlated variables: a is drawn from a normal distribution, while x and y the sum of and another value drawn from a random normal distribution. We’ll learn later how to generate specific correlations in simulated data.

n <- 100 # number of random samples

a <- rnorm(n, 0, 1)
x <- a + rnorm(n, 0, 1)
y <- a + rnorm(n, 0, 1)

cor(x, y)
## [1] 0.49585
cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 5.6525, df = 98, p-value = 1.557e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3317414 0.6308292
## sample estimates:
##     cor 
## 0.49585

Set n to a large number like 1e6 so that the correlations are less affected by chance. Change the value of the mean for a, x, or y. Does it change the correlation between x and y? What happens when you increase or decrease the sd for a? Can you work out any rules here?

cor() defaults to Pearson’s correlations. Set the method argument to use Kendall or Spearman correlations.

cor(x, y, method = "spearman")
## [1] 0.5013741
cor.test(x, y, method="spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 83096, p-value = 1.634e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5013741

模擬樣本的假設檢定結果

Create a function that creates two variables with n observations and r correlation. Use the function cor.test() to give you p-values for the correlation.

sim_cor_test <- function(n = 100, r = 0) {
  dat <- rnorm_multi(
    n = n, 
    vars = 2, 
    r = r,
    varnames = c("x", "y")
  )

  ctest <- cor.test(dat$x, dat$y)
  ctest$p.value
}

Once you’ve created your function, test it a few times, changing the values.

sim_cor_test(50, .5)
## [1] 2.874814e-08

估計考驗力

Now replicate the simulation 1000 times.

運用模擬資料估計

my_reps <- replicate(1e4, sim_cor_test(50, 0.5))

alpha <- 0.05
power <- mean(my_reps < alpha)
power
## [1] 0.9682

Compare to the value calcuated by the pwr package.

運用公式逼近

pwr::pwr.r.test(n = 50, r = 0.5)
## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 50
##               r = 0.5
##       sig.level = 0.05
##           power = 0.9669813
##     alternative = two.sided

真實研究案例

This example uses the Growth Chart Data Tables from the US CDC. The data consist of height in centimeters for the z-scores of –2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, and 2 by sex (1=male; 2=female) and half-month of age (from 24.0 to 240.5 months).

美國青少年以下身高紀錄資料庫。

Q:今天要抽樣調查,確認美國少年身高高於少女,需要找多少位美國少年與美國少女?

載入與整頓資料

We have to do a little data wrangling first. Have a look at the data after you import it and relabel Sex to male and female instead of 1 and 2. Also convert Agemos (age in months) to years. Relabel the column 0 as mean and calculate a new column named sd as the difference between columns 1 and 0.

orig_height_age <- read_csv("https://www.cdc.gov/growthcharts/data/zscore/zstatage.csv") 
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Sex = col_character(),
##   Agemos = col_character(),
##   `-2` = col_double(),
##   `-1.5` = col_double(),
##   `-1` = col_double(),
##   `-0.5` = col_double(),
##   `0` = col_double(),
##   `0.5` = col_double(),
##   `1` = col_double(),
##   `1.5` = col_double(),
##   `2` = col_double()
## )
height_age <- orig_height_age %>%
  filter(Sex %in% c(1,2)) %>%
  mutate(
    sex = recode(Sex, "1" = "male", "2" = "female"),
    age = as.numeric(Agemos)/12,
    sd = `1` - `0`
  ) %>%
  dplyr::select(sex, age, mean = `0`, sd)

If you run the code above without putting dplyr:: before the select() function, you might get an error message. This is because the MASS package also has a function called select() and, since we loaded MASS after tidyverse, the MASS function becomes the default. When you loaded MASS, you should have seen a warning like “The following object is masked from ‘package:dplyr’: select”. You can use functions with the same name from different packages by specifying the package before the function name, separated by two colons.

繪圖/資料視覺化

Plot your new data frame to see how mean height changes with age for boys and girls.

ggplot(height_age, aes(age, mean, color = sex)) +
  geom_smooth(aes(ymin = mean - sd, ymax = mean + sd), stat="identity")

取得分組平均值及標準差

(電子書無這步)

Create new variables for the means and SDs for 20-year-old men and women.

height_sub <- height_age %>% filter(age == 20)

m_mean <- height_sub %>% filter(sex == "male") %>% pull(mean)
m_sd   <- height_sub %>% filter(sex == "male") %>% pull(sd)
f_mean <- height_sub %>% filter(sex == "female") %>% pull(mean)
f_sd   <- height_sub %>% filter(sex == "female") %>% pull(sd)

height_sub
## # A tibble: 2 x 4
##   sex      age  mean    sd
##   <chr>  <dbl> <dbl> <dbl>
## 1 male      20  177.  7.12
## 2 female    20  163.  6.46

模擬資料的假設檢定函式

Simulate 50 random male heights and 50 random female heights using the rnorm() function and the means and SDs above. Plot the data.

sim_height <- tibble(
  male = rnorm(50, m_mean, m_sd),
  female = rnorm(50, f_mean, f_sd)
) %>%
  gather("sex", "height", male:female)

ggplot(sim_height) +
  geom_density(aes(height, fill = sex), alpha = 0.5) +
  xlim(125, 225)

Run the simulation above several times, noting how the density plot changes. Try changing the age you’re simulating.

分析模擬資料

Use the sim_t_ind(n, m1, sd1, m2, sd2) function we created above to generate one simulation with a sample size of 50 in each group using the means and SDs of male and female 14-year-olds.

height_sub <- height_age %>% filter(age == 14)
m_mean <- height_sub %>% filter(sex == "male") %>% pull(mean)
m_sd   <- height_sub %>% filter(sex == "male") %>% pull(sd)
f_mean <- height_sub %>% filter(sex == "female") %>% pull(mean)
f_sd   <- height_sub %>% filter(sex == "female") %>% pull(sd)

sim_t_ind(50, m_mean, m_sd, f_mean, f_sd)
## [1] 0.06608042

迭代模擬分析

(估計統計考驗力)

Now replicate this 1e4 times using the replicate() function. This function will save the returned p-values in a list (my_reps). We can then check what proportion of those p-values are less than our alpha value. This is the power of our test.

my_reps <- replicate(1e4, sim_t_ind(50, m_mean, m_sd, f_mean, f_sd))

alpha <- 0.05
power <- mean(my_reps < alpha)
power
## [1] 0.642

評估預測正確率

This design has about 65% power to detect the sex difference in height (with a 2-tailed test). Modify the sim_t_ind function for a 1-tailed prediction.

You could just set alternative equal to “greater” in the function, but it might be better to add the alternative argument to your function (giving it the same default value as t.test) and change the value of alternative in the function to alternative.

sim_t_ind <- function(n, m1, sd1, m2, sd2, alternative = "two.sided") {
  v1 <- rnorm(n, m1, sd1)
  v2 <- rnorm(n, m2, sd2)
  t_ind <- t.test(v1, v2, paired = FALSE, alternative = alternative)
  
  return(t_ind$p.value)
}

alpha <- 0.05
my_reps <- replicate(1e4, sim_t_ind(50, m_mean, m_sd, f_mean, f_sd, "greater"))
mean(my_reps < alpha)
## [1] 0.7593

圖解樣本數-考驗力曲線

What if we want to find out what sample size will give us 80% power? We can try trial and error. We know the number should be slightly larger than 50. But you can search more systematically by repeating your power calculation for a range of sample sizes.

This might seem like overkill for a t-test, where you can easily look up sample size calculators online, but it is a valuable skill to learn for when your analyses become more complicated.

Start with a relatively low number of replications and/or more spread-out samples to estimate where you should be looking more specifically. Then you can repeat with a narrower/denser range of sample sizes and more iterations.

alpha <- 0.05
power_table <- tibble(
  n = seq(20, 100, by = 5)
) %>%
  mutate(power = map_dbl(n, function(n) {
    ps <- replicate(1e3, sim_t_ind(n, m_mean, m_sd, f_mean, f_sd, "greater"))
    mean(ps < alpha)
  }))

ggplot(power_table, aes(n, power)) +
  geom_smooth() +
  geom_point() +
  geom_hline(yintercept = 0.8)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Now we can narrow down our search to values around 55 (plus or minus 5) and increase the number of replications from 1e3 to 1e4.

power_table <- tibble(
  n = seq(50, 60)
) %>%
  mutate(power = map_dbl(n, function(n) {
    ps <- replicate(1e3, sim_t_ind(n, m_mean, m_sd, f_mean, f_sd, "greater"))
    mean(ps < alpha)
  }))

##ggplot(power_table, aes(n, power)) +
##  geom_smooth() +
##  geom_point() +
##  geom_hline(yintercept = 0.8) +
##  scale_x_continuous(breaks = sample_size)

knitr::kable(power_table)
n power
50 0.773
51 0.781
52 0.749
53 0.798
54 0.787
55 0.797
56 0.812
57 0.809
58 0.808
59 0.827
60 0.823