要實施分析計畫之前，必須先整頓為要分析的單位。整頓包括篩選有需要的變數或觀察值，以及進行繪圖或重要分析之前的加工。

學習目標

dplyr總和變數的函式系列
dplyr變數加工的函式系列
dplyr新造變數的函式系列
運用字串、日期等字符資料整頓資料表單
推薦秘訣表：Data Import, Data Transformation, Regular Expressions, Dates and Times

預覽示範資料 `disgust`

These examples will use data from disgust.csv, which contains data from the Three Domain Disgust Scale. Each participant is identified by a unique user_id and each questionnaire completion has a unique id.

disgust <- read_csv("https://psyteachr.github.io/msc-data-skills/data/disgust.csv")

glimpse(disgust)

## Rows: 20,000
## Columns: 24
## $ id        <dbl> 1199, 1, 1599, 13332, 23, 1160, 7980, 552, 37829, 6902, 6158~
## $ user_id   <dbl> 0, 1, 2, 2118, 2311, 3630, 4458, 4651, 4976, 5469, 6066, 609~
## $ date      <date> 2008-10-07, 2008-07-10, 2008-10-27, 2012-01-02, 2008-07-15,~
## $ moral1    <dbl> 5, 2, 1, 0, 4, 1, 3, 2, 6, 0, 4, 1, 1, 4, 2, 1, NA, 3, 1, 3,~
## $ moral2    <dbl> 6, 2, 1, 1, 4, 5, 4, 4, 6, 1, 5, 2, 4, 4, 5, 3, NA, 5, 2, 3,~
## $ moral3    <dbl> 4, 1, 1, 1, 4, NA, 3, 3, 6, 3, 6, 2, 3, 4, 4, 2, NA, 4, 3, 3~
## $ moral4    <dbl> 6, 2, 1, 1, 4, 5, 4, 5, 0, 4, 5, 2, 4, 4, 6, 4, NA, 5, 2, 3,~
## $ moral5    <dbl> 5, 1, NA, 1, 4, 5, 4, 5, 6, 1, 5, 1, 3, 3, 5, 5, NA, 3, 2, 3~
## $ moral6    <dbl> 5, 1, NA, 2, 4, 5, 3, 5, 0, 0, 4, 2, 4, 3, 4, 4, NA, 4, 3, 0~
## $ moral7    <dbl> 6, 1, 1, 1, 4, 1, 3, 3, 0, 1, 4, 1, 3, 2, 5, 4, NA, 3, 4, 3,~
## $ sexual1   <dbl> 4, 3, 1, 4, 2, 0, 2, 4, 0, 3, 3, 1, 1, 4, 2, 0, NA, 1, 3, 2,~
## $ sexual2   <dbl> 0, 1, NA, 3, 1, 5, 4, 2, 0, 5, 0, 0, 0, 2, 1, 0, NA, 3, 0, 0~
## $ sexual3   <dbl> 1, 1, 1, 0, 2, 0, 3, 4, 0, 2, 1, 0, 0, 1, 0, 0, NA, 3, 0, 6,~
## $ sexual4   <dbl> 0, 2, NA, 6, 1, 2, 2, 6, 0, 4, 6, 3, 1, 3, 2, 0, NA, 5, 2, 6~
## $ sexual5   <dbl> 1, 1, 1, 0, 1, 0, 1, 6, 0, 6, 3, 0, 0, 2, 1, 0, NA, 5, 0, 2,~
## $ sexual6   <dbl> 4, 2, NA, 3, 1, 1, 5, 6, 0, 6, 5, 4, 2, 3, 4, 0, NA, 3, 3, 6~
## $ sexual7   <dbl> 5, 2, NA, 5, 5, 0, 4, 2, 0, 5, 3, 4, 1, 6, 3, 0, NA, 3, 5, 3~
## $ pathogen1 <dbl> 6, 3, NA, 5, 5, 6, 6, 5, 6, 5, 6, 4, 4, 5, 3, 5, NA, 4, 4, 3~
## $ pathogen2 <dbl> 1, 2, NA, 6, 5, 3, 4, 6, 6, 2, 5, 3, 4, 2, 3, 4, NA, 5, 4, 2~
## $ pathogen3 <dbl> 6, 3, 1, 4, 4, 1, 4, 6, 6, 4, 5, 1, 3, 0, 2, 3, NA, 3, 3, 5,~
## $ pathogen4 <dbl> 5, 3, NA, 6, 4, 1, 3, 4, 6, 4, 5, 1, 4, 4, 4, 1, NA, 5, 4, 0~
## $ pathogen5 <dbl> 4, 2, NA, 5, 5, 3, 3, 6, 0, 2, 5, 4, 5, 5, 5, 4, NA, 4, 4, 5~
## $ pathogen6 <dbl> 5, 3, NA, 5, 4, 1, 2, 1, 0, 2, 5, 1, 4, 5, 1, 2, NA, 5, 5, 5~
## $ pathogen7 <dbl> 6, 3, NA, 4, 3, 0, 3, 6, 6, 6, 5, 3, 5, 4, 5, 3, NA, 5, 2, 6~

Questionnaire Instructions: The following items describe a variety of concepts. Please rate how disgusting you find the concepts described in the items, where 0 means that you do not find the concept disgusting at all, and 6 means that you find the concept extremely disgusting.

colname	question
moral1	Shoplifting a candy bar from a convenience store
moral2	Stealing from a neighbor
moral3	A student cheating to get good grades
moral4	Deceiving a friend
moral5	Forging someone’s signature on a legal document
moral6	Cutting to the front of a line to purchase the last few tickets to a show
moral7	Intentionally lying during a business transaction
sexual1	Hearing two strangers having sex
sexual2	Performing oral sex
sexual3	Watching a pornographic video
sexual4	Finding out that someone you don’t like has sexual fantasies about you
sexual5	Bringing someone you just met back to your room to have sex
sexual6	A stranger of the opposite sex intentionally rubbing your thigh in an elevator
sexual7	Having anal sex with someone of the opposite sex
pathogen1	Stepping on dog poop
pathogen2	Sitting next to someone who has red sores on their arm
pathogen3	Shaking hands with a stranger who has sweaty palms
pathogen4	Seeing some mold on old leftovers in your refrigerator
pathogen5	Standing close to a person who has body odor
pathogen6	Seeing a cockroach run across the floor
pathogen7	Accidentally touching a person’s bloody cut

`dplyr`總和變數的函式系列

Most of the data wrangling you’ll want to do with psychological data will involve the tidyr verbs you learned in Chapter 3 and the six main dplyr verbs: select, filter, arrange, mutate, summarise, and group_by.

select()

選出需要變數、建立小型資料集合

Select columns by name or number.

You can select each column individually, separated by commas (e.g., col1, col2). You can also select all columns between two columns by separating them with a colon (e.g., start_col:end_col).

moral <- disgust %>% select(user_id, moral1:moral7)
names(moral)

## [1] "user_id" "moral1"  "moral2"  "moral3"  "moral4"  "moral5"  "moral6" 
## [8] "moral7"

You can select columns by number, which is useful when the column names are long or complicated.

sexual <- disgust %>% select(2, 11:17)
names(sexual)

## [1] "user_id" "sexual1" "sexual2" "sexual3" "sexual4" "sexual5" "sexual6"
## [8] "sexual7"

You can use a minus symbol to unselect columns, leaving all of the other columns. If you want to exclude a span of columns, put parentheses around the span first (e.g., -(moral1:moral7), not -moral1:moral7).

pathogen <- disgust %>% select(-id, -date, -(moral1:sexual7))
names(pathogen)

## [1] "user_id"   "pathogen1" "pathogen2" "pathogen3" "pathogen4" "pathogen5"
## [7] "pathogen6" "pathogen7"

You can select columns based on criteria about the column names.{#select_helpers}

select()的好用參數

`starts_with()`

Select columns that start with a character string.

u <- disgust %>% select(starts_with("u"))
names(u)

## [1] "user_id"

`ends_with()`

Select columns that end with a character string.

firstq <- disgust %>% select(ends_with("7"))
names(firstq)

## [1] "moral7"    "sexual7"   "pathogen7"

`contains()`

Select columns that contain a character string.

pathogen <- disgust %>% select(contains("pathogen"))
names(pathogen)

## [1] "pathogen1" "pathogen2" "pathogen3" "pathogen4" "pathogen5" "pathogen6"
## [7] "pathogen7"

`num_range()`

Select columns with a name that matches the pattern prefix.

moral2_4 <- disgust %>% select(num_range("moral", 2:4))
names(moral2_4)

## [1] "moral2" "moral3" "moral4"

Use width to set the number of digits with leading zeros. For example, num_range(‘var_’, 8:10, width=2) selects columns var_08, var_09, and var_10.

filter()

選出符合需要的觀察值、建立小型資料集合

Select rows by matching column criteria.

Select all rows where the user_id is 1 (that’s Lisa).

disgust %>% filter(user_id == 1)

## # A tibble: 1 x 24
##      id user_id date       moral1 moral2 moral3 moral4 moral5 moral6 moral7
##   <dbl>   <dbl> <date>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     1       1 2008-07-10      2      2      1      2      1      1      1
## # ... with 14 more variables: sexual1 <dbl>, sexual2 <dbl>, sexual3 <dbl>,
## #   sexual4 <dbl>, sexual5 <dbl>, sexual6 <dbl>, sexual7 <dbl>,
## #   pathogen1 <dbl>, pathogen2 <dbl>, pathogen3 <dbl>, pathogen4 <dbl>,
## #   pathogen5 <dbl>, pathogen6 <dbl>, pathogen7 <dbl>

Remember to use == and not = to check if two things are equivalent. A single = assigns the righthand value to the lefthand variable and (usually) evaluates to TRUE.

You can select on multiple criteria by separating them with commas.

多重條件篩選

amoral <- disgust %>% filter(
  moral1 == 0, 
  moral2 == 0,
  moral3 == 0, 
  moral4 == 0,
  moral5 == 0,
  moral6 == 0,
  moral7 == 0
)

You can use the symbols &, |, and ! to mean “and”, “or”, and “not”. You can also use other operators to make equations.

# everyone who chose either 0 or 7 for question moral1
moral_extremes <- disgust %>% 
  filter(moral1 == 0 | moral1 == 7)

# everyone who chose the same answer for all moral questions
moral_consistent <- disgust %>% 
  filter(
    moral2 == moral1 & 
      moral3 == moral1 & 
      moral4 == moral1 &
      moral5 == moral1 &
      moral6 == moral1 &
      moral7 == moral1
  )


# everyone who did not answer 7 for all 7 moral questions
moral_no_ceiling <- disgust %>%
  filter(moral1+moral2+moral3+moral4+moral5+moral6+moral7 != 7*7)

匹配運算元

Sometimes you need to exclude some participant IDs for reasons that can’t be described in code. the %in% operator is useful here for testing if a column value is in a list. Surround the equation with parentheses and put ! in front to test that a value is not in the list.

運用匹配運算元 %in% 設定篩選條件

length(unique(disgust$user_id))

## [1] 20000

no_researchers <- disgust %>%
  filter(!(user_id %in% c(1,2)))

glimpse(no_researchers)

## Rows: 19,998
## Columns: 24
## $ id        <dbl> 1199, 13332, 23, 1160, 7980, 552, 37829, 6902, 6158, 4850, 1~
## $ user_id   <dbl> 0, 2118, 2311, 3630, 4458, 4651, 4976, 5469, 6066, 6093, 617~
## $ date      <date> 2008-10-07, 2012-01-02, 2008-07-15, 2008-10-06, 2011-09-05,~
## $ moral1    <dbl> 5, 0, 4, 1, 3, 2, 6, 0, 4, 1, 1, 4, 2, 1, NA, 3, 1, 3, 0, 1,~
## $ moral2    <dbl> 6, 1, 4, 5, 4, 4, 6, 1, 5, 2, 4, 4, 5, 3, NA, 5, 2, 3, 0, 5,~
## $ moral3    <dbl> 4, 1, 4, NA, 3, 3, 6, 3, 6, 2, 3, 4, 4, 2, NA, 4, 3, 3, 0, 4~
## $ moral4    <dbl> 6, 1, 4, 5, 4, 5, 0, 4, 5, 2, 4, 4, 6, 4, NA, 5, 2, 3, 4, 4,~
## $ moral5    <dbl> 5, 1, 4, 5, 4, 5, 6, 1, 5, 1, 3, 3, 5, 5, NA, 3, 2, 3, 0, 5,~
## $ moral6    <dbl> 5, 2, 4, 5, 3, 5, 0, 0, 4, 2, 4, 3, 4, 4, NA, 4, 3, 0, 0, 3,~
## $ moral7    <dbl> 6, 1, 4, 1, 3, 3, 0, 1, 4, 1, 3, 2, 5, 4, NA, 3, 4, 3, 0, 4,~
## $ sexual1   <dbl> 4, 4, 2, 0, 2, 4, 0, 3, 3, 1, 1, 4, 2, 0, NA, 1, 3, 2, 0, 0,~
## $ sexual2   <dbl> 0, 3, 1, 5, 4, 2, 0, 5, 0, 0, 0, 2, 1, 0, NA, 3, 0, 0, 0, 0,~
## $ sexual3   <dbl> 1, 0, 2, 0, 3, 4, 0, 2, 1, 0, 0, 1, 0, 0, NA, 3, 0, 6, 0, 0,~
## $ sexual4   <dbl> 0, 6, 1, 2, 2, 6, 0, 4, 6, 3, 1, 3, 2, 0, NA, 5, 2, 6, 0, 5,~
## $ sexual5   <dbl> 1, 0, 1, 0, 1, 6, 0, 6, 3, 0, 0, 2, 1, 0, NA, 5, 0, 2, 0, 0,~
## $ sexual6   <dbl> 4, 3, 1, 1, 5, 6, 0, 6, 5, 4, 2, 3, 4, 0, NA, 3, 3, 6, 0, 5,~
## $ sexual7   <dbl> 5, 5, 5, 0, 4, 2, 0, 5, 3, 4, 1, 6, 3, 0, NA, 3, 5, 3, 0, 1,~
## $ pathogen1 <dbl> 6, 5, 5, 6, 6, 5, 6, 5, 6, 4, 4, 5, 3, 5, NA, 4, 4, 3, 5, 4,~
## $ pathogen2 <dbl> 1, 6, 5, 3, 4, 6, 6, 2, 5, 3, 4, 2, 3, 4, NA, 5, 4, 2, 5, 2,~
## $ pathogen3 <dbl> 6, 4, 4, 1, 4, 6, 6, 4, 5, 1, 3, 0, 2, 3, NA, 3, 3, 5, 1, 2,~
## $ pathogen4 <dbl> 5, 6, 4, 1, 3, 4, 6, 4, 5, 1, 4, 4, 4, 1, NA, 5, 4, 0, 0, 1,~
## $ pathogen5 <dbl> 4, 5, 5, 3, 3, 6, 0, 2, 5, 4, 5, 5, 5, 4, NA, 4, 4, 5, 4, 3,~
## $ pathogen6 <dbl> 5, 5, 4, 1, 2, 1, 0, 2, 5, 1, 4, 5, 1, 2, NA, 5, 5, 5, 1, 1,~
## $ pathogen7 <dbl> 6, 4, 3, 0, 3, 6, 6, 6, 5, 3, 5, 4, 5, 3, NA, 5, 2, 6, 4, 1,~

Dates

資料數值有日期時間時，使用lubridate的函式能事半功倍

You can use the lubridate package to work with dates. For example, you can use the year() function to return just the year from the date column and then select only data collected in 2010.

disgust2010 <- disgust  %>%
  filter(year(date) == 2010)

glimpse(disgust2010)

## Rows: 1,514
## Columns: 24
## $ id        <dbl> 6902, 6158, 6362, 6302, 5429, 6732, 6367, 6476, 5778, 6181, ~
## $ user_id   <dbl> 5469, 6066, 7129, 39318, 43029, 71955, 84622, 93120, 96537, ~
## $ date      <date> 2010-12-06, 2010-04-18, 2010-06-09, 2010-05-20, 2010-01-02,~
## $ moral1    <dbl> 0, 4, 4, 2, 1, 2, 4, 3, 5, 0, 1, 3, 3, 4, 1, 0, 3, 3, 6, 1, ~
## $ moral2    <dbl> 1, 5, 4, 4, 1, 5, 6, 6, 5, 6, 1, 6, 4, 4, 3, 2, 5, 4, 6, 3, ~
## $ moral3    <dbl> 3, 6, 4, 1, 1, 3, 6, 4, 3, 4, 1, 5, 2, 2, 2, 0, 3, 1, 6, 4, ~
## $ moral4    <dbl> 4, 5, 4, 4, 3, 6, 6, 6, 4, 6, NA, 6, 4, 4, 3, 1, 4, 3, 6, 4,~
## $ moral5    <dbl> 1, 5, 3, 5, 6, 3, 6, 5, 5, 0, 1, 5, 4, 4, 3, 0, 4, 3, 6, 1, ~
## $ moral6    <dbl> 0, 4, 3, 6, 4, 2, 6, 3, 5, 6, 2, 3, 4, 0, 5, 4, 4, 0, 5, 3, ~
## $ moral7    <dbl> 1, 4, 2, 0, 2, 5, 6, 4, 5, 6, 1, 6, 4, 4, 4, 1, 3, 3, 6, 2, ~
## $ sexual1   <dbl> 3, 3, 4, 1, 2, 4, 2, 5, 0, 0, 2, 3, 4, 1, 0, 0, 1, 3, 0, 1, ~
## $ sexual2   <dbl> 5, 0, 2, 0, 0, 3, 0, 4, 0, 0, 0, 5, 6, 0, 1, 0, 0, 1, 0, 4, ~
## $ sexual3   <dbl> 2, 1, 1, 0, 1, 3, 0, 5, 0, 0, 2, 0, 1, 1, 0, 0, 0, 1, 0, 0, ~
## $ sexual4   <dbl> 4, 6, 3, 1, 4, 6, 0, 5, 3, 0, 4, 6, 5, 3, 0, 4, 3, 5, 6, 3, ~
## $ sexual5   <dbl> 6, 3, 2, 0, 6, 6, 1, 5, 0, 0, 2, 6, 4, 0, 0, 5, 0, 5, 6, 4, ~
## $ sexual6   <dbl> 6, 5, 3, 0, 6, 6, 0, 4, 4, 0, 5, 6, 4, 1, 0, 5, 5, 5, 6, 5, ~
## $ sexual7   <dbl> 5, 3, 6, 1, 6, 5, 0, 3, 3, 6, 3, 6, 5, 5, 2, 5, 1, 6, 6, 5, ~
## $ pathogen1 <dbl> 5, 6, 5, 3, 4, 4, 6, 5, 6, 4, 4, 5, 6, 5, 2, 4, 2, 4, 6, NA,~
## $ pathogen2 <dbl> 2, 5, 2, 2, 6, 2, 5, 6, 0, 4, 5, 1, 3, 1, 5, 1, 1, 3, 5, 3, ~
## $ pathogen3 <dbl> 4, 5, 0, 3, 6, 6, 6, 4, 1, 0, 4, 4, 1, 4, 1, 4, 1, 3, 5, 2, ~
## $ pathogen4 <dbl> 4, 5, 4, 2, 6, 5, 2, 5, 4, 6, 4, 3, 3, 4, 4, 0, 1, 2, 6, 3, ~
## $ pathogen5 <dbl> 2, 5, 5, 3, 6, 6, 6, 6, 5, 6, 4, 5, 1, 4, 0, 4, 4, 4, 5, 4, ~
## $ pathogen6 <dbl> 2, 5, 5, 2, 6, 6, 5, 2, 1, 6, 4, 4, 6, 0, 1, 0, 3, 2, 6, 0, ~
## $ pathogen7 <dbl> 6, 5, 4, 4, 4, 3, 6, 6, 2, 6, 5, 6, 0, 4, 2, 0, 3, 6, 6, 4, ~

Or select data from at least 5 years ago. You can use the range function to check the minimum and maxiumum dates in the resulting dataset.

disgust_5ago <- disgust %>%
  filter(date < today() - dyears(5))

range(disgust_5ago$date)

## [1] "2008-07-10" "2016-03-22"

arrange()

資料按照指定變數數值重新排序

Sort your dataset using arrange().

disgust_order <- disgust %>%
  arrange(id)

head(disgust_order)

## # A tibble: 6 x 24
##      id user_id date       moral1 moral2 moral3 moral4 moral5 moral6 moral7
##   <dbl>   <dbl> <date>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     1       1 2008-07-10      2      2      1      2      1      1      1
## 2     3  155324 2008-07-11      2      4      3      5      2      1      4
## 3     4  155366 2008-07-12      6      6      6      3      6      6      6
## 4     5  155370 2008-07-12      6      6      4      6      6      6      6
## 5     6  155386 2008-07-12      2      4      0      4      0      0      0
## 6     7  155409 2008-07-12      4      5      5      4      5      1      5
## # ... with 14 more variables: sexual1 <dbl>, sexual2 <dbl>, sexual3 <dbl>,
## #   sexual4 <dbl>, sexual5 <dbl>, sexual6 <dbl>, sexual7 <dbl>,
## #   pathogen1 <dbl>, pathogen2 <dbl>, pathogen3 <dbl>, pathogen4 <dbl>,
## #   pathogen5 <dbl>, pathogen6 <dbl>, pathogen7 <dbl>

Reverse the order using desc()

disgust_order <- disgust %>%
  arrange(desc(id))

head(disgust_order)

## # A tibble: 6 x 24
##      id user_id date       moral1 moral2 moral3 moral4 moral5 moral6 moral7
##   <dbl>   <dbl> <date>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 39456  356866 2017-08-21      1      1      1      1      1      1      1
## 2 39447  128727 2017-08-13      2      4      1      2      2      5      3
## 3 39371  152955 2017-06-13      6      6      3      6      6      6      6
## 4 39342   48303 2017-05-22      4      5      4      4      6      4      5
## 5 39159  151633 2017-04-04      4      5      6      5      3      6      2
## 6 38942  370464 2017-02-01      1      5      0      6      5      5      5
## # ... with 14 more variables: sexual1 <dbl>, sexual2 <dbl>, sexual3 <dbl>,
## #   sexual4 <dbl>, sexual5 <dbl>, sexual6 <dbl>, sexual7 <dbl>,
## #   pathogen1 <dbl>, pathogen2 <dbl>, pathogen3 <dbl>, pathogen4 <dbl>,
## #   pathogen5 <dbl>, pathogen6 <dbl>, pathogen7 <dbl>

mutate()

以原變數數值為內容重新組合計算，造出分析需要的新變數。

Add new columns. This is one of the most useful functions in the tidyverse.

Refer to other columns by their names (unquoted). You can add more than one column, just separate the columns with a comma. Once you make a new column, you can use it in further column definitions e.g., total below).

disgust_total <- disgust %>%
  mutate(
    pathogen = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7,
    moral = moral1 + moral2 + moral3 + moral4 + moral5 + moral6 + moral7,
    sexual = sexual1 + sexual2 + sexual3 + sexual4 + sexual5 + sexual6 + sexual7,
    total = pathogen + moral + sexual,
    user_id = paste0("U", user_id)
  )

You can overwrite a column by giving a new column the same name as the old column. Make sure that you mean to do this and that you aren’t trying to use the old column value after you redefine it.

summarise()

指定變數進行需要的描述統計，輸出到新的資料集合。

Create summary statistics for the dataset. Check the Data Wrangling Cheat Sheet or the Data Transformation Cheat Sheet for various summary functions. Some common ones are: mean(), sd(), n(), sum(), and quantile().

disgust_total %>%
  summarise(
    n = n(),
    q25 = quantile(total, .25, na.rm = TRUE),
    q50 = quantile(total, .50, na.rm = TRUE),
    q75 = quantile(total, .75, na.rm = TRUE),
    avg_total = mean(total, na.rm = TRUE),
    sd_total  = sd(total, na.rm = TRUE),
    min_total = min(total, na.rm = TRUE),
    max_total = max(total, na.rm = TRUE)
  )

## # A tibble: 1 x 8
##       n   q25   q50   q75 avg_total sd_total min_total max_total
##   <int> <dbl> <dbl> <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
## 1 20000    59    71    83      70.7     18.2         0       126

group_by()

指定變數為分組檢索值。

Create subsets of the data. You can use this to create summaries, like the mean value for all of your experimental groups.

Here, we’ll use mutate to create a new column called year, group by year, and calculate the average scores.

disgust_total %>%
  mutate(year = year(date)) %>%
  group_by(year) %>%
  summarise(
    n = n(),
    avg_total = mean(total, na.rm = TRUE),
    sd_total  = sd(total, na.rm = TRUE),
    min_total = min(total, na.rm = TRUE),
    max_total = max(total, na.rm = TRUE)
  )

## # A tibble: 10 x 6
##     year     n avg_total sd_total min_total max_total
##    <dbl> <int>     <dbl>    <dbl>     <dbl>     <dbl>
##  1  2008  2578      70.3     18.5         0       126
##  2  2009  2580      69.7     18.6         3       126
##  3  2010  1514      70.6     18.9         6       126
##  4  2011  6046      71.3     17.8         0       126
##  5  2012  5938      70.4     18.4         0       126
##  6  2013  1251      71.6     17.6         0       126
##  7  2014    58      70.5     17.2        19       113
##  8  2015    21      74.3     16.9        43       107
##  9  2016     8      67.9     32.6         0       110
## 10  2017     6      57.2     27.9        21        90

You can use filter after group_by. The following example returns the lowest total score from each year.

disgust_total %>%
  mutate(year = year(date)) %>%
  select(user_id, year, total) %>%
  group_by(year) %>%
  filter(rank(total) == 1) %>%
  arrange(year)

## # A tibble: 7 x 3
## # Groups:   year [7]
##   user_id  year total
##   <chr>   <dbl> <dbl>
## 1 U236585  2009     3
## 2 U292359  2010     6
## 3 U245384  2013     0
## 4 U206293  2014    19
## 5 U407089  2015    43
## 6 U453237  2016     0
## 7 U356866  2017    21

You can also use mutate after group_by. The following example calculates subject-mean-centered scores by grouping the scores by user_id and then subtracting the group-specific mean from each score. Note the use of gather to tidy the data into a long format first.

disgust_smc <- disgust %>%
  gather("question", "score", moral1:pathogen7) %>%
  group_by(user_id) %>%
  mutate(score_smc = score - mean(score, na.rm = TRUE))

All Together

在此示範清理到整頓資料一次完成。

A lot of what we did above would be easier if the data were tidy, so let’s do that first. Then we can use group_by to calculate the domain scores.

It is good practice to use ungroup() after using group_by and summarise. Forgetting to ungroup the dataset won’t affect some further processing, but can really mess up other things.

Then we can spread out the 3 domains, calculate the total score, remove any rows with a missing (NA) total, and calculate mean values by year.

## 計算每位參與者在各向度的平均評分
disgust_tidy <- read_csv("../data/disgust.csv") %>%
  gather("question", "score", moral1:pathogen7) %>%
  separate(question, c("domain","q_num"), sep = -1) %>%
  group_by(id, user_id, date, domain) %>%
  summarise(score = mean(score)) %>%
  ungroup()

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   date = col_date(format = "")
## )
## i<U+00A0>Use `spec()` for the full column specifications.

## `summarise()` has grouped output by 'id', 'user_id', 'date'. You can override using the `.groups` argument.

head(disgust_tidy)

## # A tibble: 6 x 5
##      id user_id date       domain   score
##   <dbl>   <dbl> <date>     <chr>    <dbl>
## 1     1       1 2008-07-10 moral     1.43
## 2     1       1 2008-07-10 pathogen  2.71
## 3     1       1 2008-07-10 sexual    1.71
## 4     3  155324 2008-07-11 moral     3   
## 5     3  155324 2008-07-11 pathogen  2.57
## 6     3  155324 2008-07-11 sexual    1.86

disgust_tidy2 <- disgust_tidy %>%
  spread(domain, score) %>%
  mutate(
    total = moral + sexual + pathogen, ## 計算三個向度的平均總分
    year = year(date)   ## 製造新變數year儲存年份資訊
  ) %>%
  filter(!is.na(total)) %>%  ## 排除沒有答完所有題目的參與者
  arrange(user_id) 

head(disgust_tidy2)

## # A tibble: 6 x 8
##      id user_id date       moral pathogen sexual total  year
##   <dbl>   <dbl> <date>     <dbl>    <dbl>  <dbl> <dbl> <dbl>
## 1  1199       0 2008-10-07  5.29     4.71   2.14 12.1   2008
## 2     1       1 2008-07-10  1.43     2.71   1.71  5.86  2008
## 3 13332    2118 2012-01-02  1        5      3     9     2012
## 4    23    2311 2008-07-15  4        4.29   1.86 10.1   2008
## 5  7980    4458 2011-09-05  3.43     3.57   3    10     2011
## 6   552    4651 2008-08-23  3.86     4.86   4.29 13     2008

disgust_tidy3 <- disgust_tidy2 %>%
  group_by(year) %>%       ## 依年分計算人次、各向度平均評分、第一位及最後一位參與者id
  summarise(
    n = n(),
    avg_pathogen = mean(pathogen),
    avg_moral = mean(moral),
    avg_sexual = mean(sexual),
    first_user = first(user_id),
    last_user = last(user_id)
  )

head(disgust_tidy3)

## # A tibble: 6 x 7
##    year     n avg_pathogen avg_moral avg_sexual first_user last_user
##   <dbl> <int>        <dbl>     <dbl>      <dbl>      <dbl>     <dbl>
## 1  2008  2392         3.70      3.81       2.54          0    188708
## 2  2009  2410         3.67      3.76       2.53       6093    251959
## 3  2010  1418         3.73      3.84       2.51       5469    319641
## 4  2011  5586         3.76      3.81       2.63       4458    406569
## 5  2012  5375         3.74      3.77       2.55       2118    458194
## 6  2013  1222         3.77      3.91       2.55       7646    462428

`dplyr`變數加工的函式系列

Use the code examples below and the help pages to figure out what the following one-table verbs do. Most have pretty self-explanatory names.

rename()

為變數改名字

iris_underscore <- iris %>%
  rename(sepal_length = Sepal.Length,
         sepal_width = Sepal.Width,
         petal_length = Petal.Length,
         petal_width = Petal.Width)

names(iris_underscore)

## [1] "sepal_length" "sepal_width"  "petal_length" "petal_width"  "Species"

Almost everyone gets confused at some point with rename() and tries to put the original names on the left and the new names on the right. Try it and see what the error message looks like.

distinct()

移除重覆的觀察值

# create a data table with duplicated values
dupes <- tibble(
  id = rep(1:5, 2),
  dv = rep(LETTERS[1:5], 2)
)

distinct(dupes)

## # A tibble: 5 x 2
##      id dv   
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C    
## 4     4 D    
## 5     5 E

count()

依指定變數，計算觀察值個數

# how many observations from each species are in iris?
count(iris, Species)

##      Species  n
## 1     setosa 50
## 2 versicolor 50
## 3  virginica 50

slice()

移除指定列的觀察值

tibble(
  id = 1:10,
  condition = rep(c("A","B"), 5)
) %>%
  slice(3:6, 9)

## # A tibble: 5 x 2
##      id condition
##   <int> <chr>    
## 1     3 A        
## 2     4 B        
## 3     5 A        
## 4     6 B        
## 5     9 A

pull()

抽出指定的變數

iris %>%
  group_by(Species) %>%
  summarise_all(mean) %>%
  pull(Sepal.Length)

## [1] 5.006 5.936 6.588

`dplyr`新造變數的函式系列

必須置於總和變數函式的參數，才能發揮作用。

Window functions use the order of rows to calculate values. You can use them to do things that require ranking or ordering, like choose the top scores in each class, or acessing the previous and next rows, like calculating cumulative sums or means.

The dplyr window functions vignette has very good detailed explanations of these functions, but we’ve described a few of the most useful ones below.

Ranking functions

rank系列函數計算指定變數內數值的排序，輸出到指定新變數。

tibble(
  id = 1:5,
  "Data Skills" = c(16, 17, 17, 19, 20), 
  "Statistics"  = c(14, 16, 18, 18, 19)
) %>%
  gather(class, grade, 2:3) %>%
  group_by(class) %>%
  mutate(row_number = row_number(),
         rank       = rank(grade),
         min_rank   = min_rank(grade),
         dense_rank = dense_rank(grade),
         quartile   = ntile(grade, 4),
         percentile = ntile(grade, 100))

## # A tibble: 10 x 9
## # Groups:   class [2]
##       id class    grade row_number  rank min_rank dense_rank quartile percentile
##    <int> <chr>    <dbl>      <int> <dbl>    <int>      <int>    <int>      <int>
##  1     1 Data Sk~    16          1   1          1          1        1          1
##  2     2 Data Sk~    17          2   2.5        2          2        1          2
##  3     3 Data Sk~    17          3   2.5        2          2        2          3
##  4     4 Data Sk~    19          4   4          4          3        3          4
##  5     5 Data Sk~    20          5   5          5          4        4          5
##  6     1 Statist~    14          1   1          1          1        1          1
##  7     2 Statist~    16          2   2          2          2        1          2
##  8     3 Statist~    18          3   3.5        3          3        2          3
##  9     4 Statist~    18          4   3.5        3          3        3          4
## 10     5 Statist~    19          5   5          5          4        4          5

What are the differences among row_number(), rank(), min_rank(), dense_rank(), and ntile()?
Why doesn’t row_number() need an argument?
What would happen if you gave it the argument grade or class?
What do you think would happen if you removed the group_by(class) line above?
What if you added id to the grouping?
What happens if you change the order of the rows?
What does the second argument in ntile() do?

You can use window functions to group your data into quantiles.

使用rank系列函數分組，再做其他整頓工作。

iris %>%
  group_by(tertile = ntile(Sepal.Length, 3)) %>%
  summarise(mean.Sepal.Length = mean(Sepal.Length))

## # A tibble: 3 x 2
##   tertile mean.Sepal.Length
##     <int>             <dbl>
## 1       1              4.94
## 2       2              5.81
## 3       3              6.78

Offset functions

依指定變數內數值的登錄順序，紀錄前x位(lag())或後x位(lead())的數值，輸出到新變數

lag_lead <- tibble(x = 1:6) %>%
  mutate(lag = lag(x),
         lag2 = lag(x, n = 2),
         lead = lead(x, default = 0))
head(lag_lead)

## # A tibble: 6 x 4
##       x   lag  lag2  lead
##   <int> <int> <int> <dbl>
## 1     1    NA    NA     2
## 2     2     1    NA     3
## 3     3     2     1     4
## 4     4     3     2     5
## 5     5     4     3     6
## 6     6     5     4     0

trials <- tibble(
  trial = sample(1:10, 10),
  cond = sample(c("exp", "ctrl"), 10, T),
  score = rpois(10, 4)
) %>%
  mutate(
    score_change = score - lag(score, order_by = trial),
    change_cond = cond != lag(cond, order_by = trial, 
                              default = "no condition")
  ) %>%
  arrange(trial)

Look at the help pages for lag() and lead().

What happens if you remove the order_by argument or change it to cond?
What does the default argument do?
Can you think of circumstances in your own data where you might need to use lag() or lead()?

Cumulative aggregates

依指定變數內數值的登錄順序，逐項計算並輸出到新變數。

cumsum(), cummin(), and cummax() are base R functions for calcumaling cumulative means, minimums, and maximums. The dplyr package introduces cumany() and cumall(), which return TRUE if any or all of the previous values meet their criteria.

tibble(
  time = 1:10,
  obs = c(1, 0, 1, 2, 4, 3, 1, 0, 3, 5)
) %>%
  mutate(
    cumsum = cumsum(obs),
    cummin = cummin(obs),
    cummax = cummax(obs),
    cumany = cumany(obs == 3),
    cumall = cumall(obs < 4)
  )

## # A tibble: 10 x 7
##     time   obs cumsum cummin cummax cumany cumall
##    <int> <dbl>  <dbl>  <dbl>  <dbl> <lgl>  <lgl> 
##  1     1     1      1      1      1 FALSE  TRUE  
##  2     2     0      1      0      1 FALSE  TRUE  
##  3     3     1      2      0      1 FALSE  TRUE  
##  4     4     2      4      0      2 FALSE  TRUE  
##  5     5     4      8      0      4 FALSE  FALSE 
##  6     6     3     11      0      4 TRUE   FALSE 
##  7     7     1     12      0      4 TRUE   FALSE 
##  8     8     0     12      0      4 TRUE   FALSE 
##  9     9     3     15      0      4 TRUE   FALSE 
## 10    10     5     20      0      5 TRUE   FALSE

What would happen if you change cumany(obs == 3) to cumany(obs > 2)?
What would happen if you change cumall(obs < 4) to cumall(obs < 2)?
Can you think of circumstances in your own data where you might need to use cumany() or cumall()?

整頓資料

陳紹慶

2021/4/28

學習目標

預覽示範資料 `disgust`

`dplyr`總和變數的函式系列

select()

`starts_with()`

`ends_with()`

`contains()`

`num_range()`

filter()

匹配運算元

Dates

arrange()

mutate()

summarise()

group_by()

All Together

`dplyr`變數加工的函式系列

rename()

distinct()

count()

slice()

pull()

`dplyr`新造變數的函式系列

Ranking functions

Offset functions

Cumulative aggregates

整頓資料

陳紹慶

2021/4/28

學習目標

預覽示範資料 disgust

dplyr總和變數的函式系列

select()

starts_with()

ends_with()

contains()

num_range()

filter()

匹配運算元

Dates

arrange()

mutate()

summarise()

group_by()

All Together

dplyr變數加工的函式系列

rename()

distinct()

count()

slice()

pull()

dplyr新造變數的函式系列

Ranking functions

Offset functions

Cumulative aggregates

預覽示範資料 `disgust`

`dplyr`總和變數的函式系列

`starts_with()`

`ends_with()`

`contains()`

`num_range()`

`dplyr`變數加工的函式系列

`dplyr`新造變數的函式系列