要實施分析計畫之前,必須先整頓為要分析的單位。整頓包括篩選有需要的變數或觀察值,以及進行繪圖或重要分析之前的加工。
dplyr
總和變數的函式系列
dplyr
變數加工的函式系列
dplyr
新造變數的函式系列
運用字串、日期等字符資料整頓資料表單
推薦秘訣表:Data Import, Data Transformation, Regular Expressions, Dates and Times
disgust
These examples will use data from disgust.csv, which contains data from the Three Domain Disgust Scale. Each participant is identified by a unique user_id
and each questionnaire completion has a unique id
.
disgust <- read_csv("https://psyteachr.github.io/msc-data-skills/data/disgust.csv")
glimpse(disgust)
## Rows: 20,000
## Columns: 24
## $ id <dbl> 1199, 1, 1599, 13332, 23, 1160, 7980, 552, 37829, 6902, 6158~
## $ user_id <dbl> 0, 1, 2, 2118, 2311, 3630, 4458, 4651, 4976, 5469, 6066, 609~
## $ date <date> 2008-10-07, 2008-07-10, 2008-10-27, 2012-01-02, 2008-07-15,~
## $ moral1 <dbl> 5, 2, 1, 0, 4, 1, 3, 2, 6, 0, 4, 1, 1, 4, 2, 1, NA, 3, 1, 3,~
## $ moral2 <dbl> 6, 2, 1, 1, 4, 5, 4, 4, 6, 1, 5, 2, 4, 4, 5, 3, NA, 5, 2, 3,~
## $ moral3 <dbl> 4, 1, 1, 1, 4, NA, 3, 3, 6, 3, 6, 2, 3, 4, 4, 2, NA, 4, 3, 3~
## $ moral4 <dbl> 6, 2, 1, 1, 4, 5, 4, 5, 0, 4, 5, 2, 4, 4, 6, 4, NA, 5, 2, 3,~
## $ moral5 <dbl> 5, 1, NA, 1, 4, 5, 4, 5, 6, 1, 5, 1, 3, 3, 5, 5, NA, 3, 2, 3~
## $ moral6 <dbl> 5, 1, NA, 2, 4, 5, 3, 5, 0, 0, 4, 2, 4, 3, 4, 4, NA, 4, 3, 0~
## $ moral7 <dbl> 6, 1, 1, 1, 4, 1, 3, 3, 0, 1, 4, 1, 3, 2, 5, 4, NA, 3, 4, 3,~
## $ sexual1 <dbl> 4, 3, 1, 4, 2, 0, 2, 4, 0, 3, 3, 1, 1, 4, 2, 0, NA, 1, 3, 2,~
## $ sexual2 <dbl> 0, 1, NA, 3, 1, 5, 4, 2, 0, 5, 0, 0, 0, 2, 1, 0, NA, 3, 0, 0~
## $ sexual3 <dbl> 1, 1, 1, 0, 2, 0, 3, 4, 0, 2, 1, 0, 0, 1, 0, 0, NA, 3, 0, 6,~
## $ sexual4 <dbl> 0, 2, NA, 6, 1, 2, 2, 6, 0, 4, 6, 3, 1, 3, 2, 0, NA, 5, 2, 6~
## $ sexual5 <dbl> 1, 1, 1, 0, 1, 0, 1, 6, 0, 6, 3, 0, 0, 2, 1, 0, NA, 5, 0, 2,~
## $ sexual6 <dbl> 4, 2, NA, 3, 1, 1, 5, 6, 0, 6, 5, 4, 2, 3, 4, 0, NA, 3, 3, 6~
## $ sexual7 <dbl> 5, 2, NA, 5, 5, 0, 4, 2, 0, 5, 3, 4, 1, 6, 3, 0, NA, 3, 5, 3~
## $ pathogen1 <dbl> 6, 3, NA, 5, 5, 6, 6, 5, 6, 5, 6, 4, 4, 5, 3, 5, NA, 4, 4, 3~
## $ pathogen2 <dbl> 1, 2, NA, 6, 5, 3, 4, 6, 6, 2, 5, 3, 4, 2, 3, 4, NA, 5, 4, 2~
## $ pathogen3 <dbl> 6, 3, 1, 4, 4, 1, 4, 6, 6, 4, 5, 1, 3, 0, 2, 3, NA, 3, 3, 5,~
## $ pathogen4 <dbl> 5, 3, NA, 6, 4, 1, 3, 4, 6, 4, 5, 1, 4, 4, 4, 1, NA, 5, 4, 0~
## $ pathogen5 <dbl> 4, 2, NA, 5, 5, 3, 3, 6, 0, 2, 5, 4, 5, 5, 5, 4, NA, 4, 4, 5~
## $ pathogen6 <dbl> 5, 3, NA, 5, 4, 1, 2, 1, 0, 2, 5, 1, 4, 5, 1, 2, NA, 5, 5, 5~
## $ pathogen7 <dbl> 6, 3, NA, 4, 3, 0, 3, 6, 6, 6, 5, 3, 5, 4, 5, 3, NA, 5, 2, 6~
Questionnaire Instructions: The following items describe a variety of concepts. Please rate how disgusting you find the concepts described in the items, where 0 means that you do not find the concept disgusting at all, and 6 means that you find the concept extremely disgusting.
colname | question |
---|---|
moral1 | Shoplifting a candy bar from a convenience store |
moral2 | Stealing from a neighbor |
moral3 | A student cheating to get good grades |
moral4 | Deceiving a friend |
moral5 | Forging someone’s signature on a legal document |
moral6 | Cutting to the front of a line to purchase the last few tickets to a show |
moral7 | Intentionally lying during a business transaction |
sexual1 | Hearing two strangers having sex |
sexual2 | Performing oral sex |
sexual3 | Watching a pornographic video |
sexual4 | Finding out that someone you don’t like has sexual fantasies about you |
sexual5 | Bringing someone you just met back to your room to have sex |
sexual6 | A stranger of the opposite sex intentionally rubbing your thigh in an elevator |
sexual7 | Having anal sex with someone of the opposite sex |
pathogen1 | Stepping on dog poop |
pathogen2 | Sitting next to someone who has red sores on their arm |
pathogen3 | Shaking hands with a stranger who has sweaty palms |
pathogen4 | Seeing some mold on old leftovers in your refrigerator |
pathogen5 | Standing close to a person who has body odor |
pathogen6 | Seeing a cockroach run across the floor |
pathogen7 | Accidentally touching a person’s bloody cut |
dplyr
總和變數的函式系列Most of the data wrangling you’ll want to do with psychological data will involve the tidyr
verbs you learned in Chapter 3 and the six main dplyr
verbs: select
, filter
, arrange
, mutate
, summarise
, and group_by
.
選出需要變數、建立小型資料集合
Select columns by name or number.
You can select each column individually, separated by commas (e.g., col1, col2
). You can also select all columns between two columns by separating them with a colon (e.g., start_col:end_col
).
moral <- disgust %>% select(user_id, moral1:moral7)
names(moral)
## [1] "user_id" "moral1" "moral2" "moral3" "moral4" "moral5" "moral6"
## [8] "moral7"
You can select columns by number, which is useful when the column names are long or complicated.
sexual <- disgust %>% select(2, 11:17)
names(sexual)
## [1] "user_id" "sexual1" "sexual2" "sexual3" "sexual4" "sexual5" "sexual6"
## [8] "sexual7"
You can use a minus symbol to unselect columns, leaving all of the other columns. If you want to exclude a span of columns, put parentheses around the span first (e.g., -(moral1:moral7)
, not -moral1:moral7
).
pathogen <- disgust %>% select(-id, -date, -(moral1:sexual7))
names(pathogen)
## [1] "user_id" "pathogen1" "pathogen2" "pathogen3" "pathogen4" "pathogen5"
## [7] "pathogen6" "pathogen7"
You can select columns based on criteria about the column names.{#select_helpers}
select()
的好用參數
starts_with()
Select columns that start with a character string.
u <- disgust %>% select(starts_with("u"))
names(u)
## [1] "user_id"
ends_with()
Select columns that end with a character string.
firstq <- disgust %>% select(ends_with("7"))
names(firstq)
## [1] "moral7" "sexual7" "pathogen7"
contains()
Select columns that contain a character string.
pathogen <- disgust %>% select(contains("pathogen"))
names(pathogen)
## [1] "pathogen1" "pathogen2" "pathogen3" "pathogen4" "pathogen5" "pathogen6"
## [7] "pathogen7"
num_range()
Select columns with a name that matches the pattern prefix
.
moral2_4 <- disgust %>% select(num_range("moral", 2:4))
names(moral2_4)
## [1] "moral2" "moral3" "moral4"
Use width
to set the number of digits with leading zeros. For example, num_range(‘var_’, 8:10, width=2)
selects columns var_08
, var_09
, and var_10
.
選出符合需要的觀察值、建立小型資料集合
Select rows by matching column criteria.
Select all rows where the user_id is 1 (that’s Lisa).
disgust %>% filter(user_id == 1)
## # A tibble: 1 x 24
## id user_id date moral1 moral2 moral3 moral4 moral5 moral6 moral7
## <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 2008-07-10 2 2 1 2 1 1 1
## # ... with 14 more variables: sexual1 <dbl>, sexual2 <dbl>, sexual3 <dbl>,
## # sexual4 <dbl>, sexual5 <dbl>, sexual6 <dbl>, sexual7 <dbl>,
## # pathogen1 <dbl>, pathogen2 <dbl>, pathogen3 <dbl>, pathogen4 <dbl>,
## # pathogen5 <dbl>, pathogen6 <dbl>, pathogen7 <dbl>
Remember to use ==
and not =
to check if two things are equivalent. A single =
assigns the righthand value to the lefthand variable and (usually) evaluates to TRUE
.
You can select on multiple criteria by separating them with commas.
多重條件篩選
amoral <- disgust %>% filter(
moral1 == 0,
moral2 == 0,
moral3 == 0,
moral4 == 0,
moral5 == 0,
moral6 == 0,
moral7 == 0
)
You can use the symbols &
, |
, and !
to mean “and”, “or”, and “not”. You can also use other operators to make equations.
# everyone who chose either 0 or 7 for question moral1
moral_extremes <- disgust %>%
filter(moral1 == 0 | moral1 == 7)
# everyone who chose the same answer for all moral questions
moral_consistent <- disgust %>%
filter(
moral2 == moral1 &
moral3 == moral1 &
moral4 == moral1 &
moral5 == moral1 &
moral6 == moral1 &
moral7 == moral1
)
# everyone who did not answer 7 for all 7 moral questions
moral_no_ceiling <- disgust %>%
filter(moral1+moral2+moral3+moral4+moral5+moral6+moral7 != 7*7)
Sometimes you need to exclude some participant IDs for reasons that can’t be described in code. the %in%
operator is useful here for testing if a column value is in a list. Surround the equation with parentheses and put !
in front to test that a value is not in the list.
運用匹配運算元 %in% 設定篩選條件
length(unique(disgust$user_id))
## [1] 20000
no_researchers <- disgust %>%
filter(!(user_id %in% c(1,2)))
glimpse(no_researchers)
## Rows: 19,998
## Columns: 24
## $ id <dbl> 1199, 13332, 23, 1160, 7980, 552, 37829, 6902, 6158, 4850, 1~
## $ user_id <dbl> 0, 2118, 2311, 3630, 4458, 4651, 4976, 5469, 6066, 6093, 617~
## $ date <date> 2008-10-07, 2012-01-02, 2008-07-15, 2008-10-06, 2011-09-05,~
## $ moral1 <dbl> 5, 0, 4, 1, 3, 2, 6, 0, 4, 1, 1, 4, 2, 1, NA, 3, 1, 3, 0, 1,~
## $ moral2 <dbl> 6, 1, 4, 5, 4, 4, 6, 1, 5, 2, 4, 4, 5, 3, NA, 5, 2, 3, 0, 5,~
## $ moral3 <dbl> 4, 1, 4, NA, 3, 3, 6, 3, 6, 2, 3, 4, 4, 2, NA, 4, 3, 3, 0, 4~
## $ moral4 <dbl> 6, 1, 4, 5, 4, 5, 0, 4, 5, 2, 4, 4, 6, 4, NA, 5, 2, 3, 4, 4,~
## $ moral5 <dbl> 5, 1, 4, 5, 4, 5, 6, 1, 5, 1, 3, 3, 5, 5, NA, 3, 2, 3, 0, 5,~
## $ moral6 <dbl> 5, 2, 4, 5, 3, 5, 0, 0, 4, 2, 4, 3, 4, 4, NA, 4, 3, 0, 0, 3,~
## $ moral7 <dbl> 6, 1, 4, 1, 3, 3, 0, 1, 4, 1, 3, 2, 5, 4, NA, 3, 4, 3, 0, 4,~
## $ sexual1 <dbl> 4, 4, 2, 0, 2, 4, 0, 3, 3, 1, 1, 4, 2, 0, NA, 1, 3, 2, 0, 0,~
## $ sexual2 <dbl> 0, 3, 1, 5, 4, 2, 0, 5, 0, 0, 0, 2, 1, 0, NA, 3, 0, 0, 0, 0,~
## $ sexual3 <dbl> 1, 0, 2, 0, 3, 4, 0, 2, 1, 0, 0, 1, 0, 0, NA, 3, 0, 6, 0, 0,~
## $ sexual4 <dbl> 0, 6, 1, 2, 2, 6, 0, 4, 6, 3, 1, 3, 2, 0, NA, 5, 2, 6, 0, 5,~
## $ sexual5 <dbl> 1, 0, 1, 0, 1, 6, 0, 6, 3, 0, 0, 2, 1, 0, NA, 5, 0, 2, 0, 0,~
## $ sexual6 <dbl> 4, 3, 1, 1, 5, 6, 0, 6, 5, 4, 2, 3, 4, 0, NA, 3, 3, 6, 0, 5,~
## $ sexual7 <dbl> 5, 5, 5, 0, 4, 2, 0, 5, 3, 4, 1, 6, 3, 0, NA, 3, 5, 3, 0, 1,~
## $ pathogen1 <dbl> 6, 5, 5, 6, 6, 5, 6, 5, 6, 4, 4, 5, 3, 5, NA, 4, 4, 3, 5, 4,~
## $ pathogen2 <dbl> 1, 6, 5, 3, 4, 6, 6, 2, 5, 3, 4, 2, 3, 4, NA, 5, 4, 2, 5, 2,~
## $ pathogen3 <dbl> 6, 4, 4, 1, 4, 6, 6, 4, 5, 1, 3, 0, 2, 3, NA, 3, 3, 5, 1, 2,~
## $ pathogen4 <dbl> 5, 6, 4, 1, 3, 4, 6, 4, 5, 1, 4, 4, 4, 1, NA, 5, 4, 0, 0, 1,~
## $ pathogen5 <dbl> 4, 5, 5, 3, 3, 6, 0, 2, 5, 4, 5, 5, 5, 4, NA, 4, 4, 5, 4, 3,~
## $ pathogen6 <dbl> 5, 5, 4, 1, 2, 1, 0, 2, 5, 1, 4, 5, 1, 2, NA, 5, 5, 5, 1, 1,~
## $ pathogen7 <dbl> 6, 4, 3, 0, 3, 6, 6, 6, 5, 3, 5, 4, 5, 3, NA, 5, 2, 6, 4, 1,~
資料數值有日期時間時,使用lubridate
的函式能事半功倍
You can use the lubridate
package to work with dates. For example, you can use the year()
function to return just the year from the date
column and then select only data collected in 2010.
disgust2010 <- disgust %>%
filter(year(date) == 2010)
glimpse(disgust2010)
## Rows: 1,514
## Columns: 24
## $ id <dbl> 6902, 6158, 6362, 6302, 5429, 6732, 6367, 6476, 5778, 6181, ~
## $ user_id <dbl> 5469, 6066, 7129, 39318, 43029, 71955, 84622, 93120, 96537, ~
## $ date <date> 2010-12-06, 2010-04-18, 2010-06-09, 2010-05-20, 2010-01-02,~
## $ moral1 <dbl> 0, 4, 4, 2, 1, 2, 4, 3, 5, 0, 1, 3, 3, 4, 1, 0, 3, 3, 6, 1, ~
## $ moral2 <dbl> 1, 5, 4, 4, 1, 5, 6, 6, 5, 6, 1, 6, 4, 4, 3, 2, 5, 4, 6, 3, ~
## $ moral3 <dbl> 3, 6, 4, 1, 1, 3, 6, 4, 3, 4, 1, 5, 2, 2, 2, 0, 3, 1, 6, 4, ~
## $ moral4 <dbl> 4, 5, 4, 4, 3, 6, 6, 6, 4, 6, NA, 6, 4, 4, 3, 1, 4, 3, 6, 4,~
## $ moral5 <dbl> 1, 5, 3, 5, 6, 3, 6, 5, 5, 0, 1, 5, 4, 4, 3, 0, 4, 3, 6, 1, ~
## $ moral6 <dbl> 0, 4, 3, 6, 4, 2, 6, 3, 5, 6, 2, 3, 4, 0, 5, 4, 4, 0, 5, 3, ~
## $ moral7 <dbl> 1, 4, 2, 0, 2, 5, 6, 4, 5, 6, 1, 6, 4, 4, 4, 1, 3, 3, 6, 2, ~
## $ sexual1 <dbl> 3, 3, 4, 1, 2, 4, 2, 5, 0, 0, 2, 3, 4, 1, 0, 0, 1, 3, 0, 1, ~
## $ sexual2 <dbl> 5, 0, 2, 0, 0, 3, 0, 4, 0, 0, 0, 5, 6, 0, 1, 0, 0, 1, 0, 4, ~
## $ sexual3 <dbl> 2, 1, 1, 0, 1, 3, 0, 5, 0, 0, 2, 0, 1, 1, 0, 0, 0, 1, 0, 0, ~
## $ sexual4 <dbl> 4, 6, 3, 1, 4, 6, 0, 5, 3, 0, 4, 6, 5, 3, 0, 4, 3, 5, 6, 3, ~
## $ sexual5 <dbl> 6, 3, 2, 0, 6, 6, 1, 5, 0, 0, 2, 6, 4, 0, 0, 5, 0, 5, 6, 4, ~
## $ sexual6 <dbl> 6, 5, 3, 0, 6, 6, 0, 4, 4, 0, 5, 6, 4, 1, 0, 5, 5, 5, 6, 5, ~
## $ sexual7 <dbl> 5, 3, 6, 1, 6, 5, 0, 3, 3, 6, 3, 6, 5, 5, 2, 5, 1, 6, 6, 5, ~
## $ pathogen1 <dbl> 5, 6, 5, 3, 4, 4, 6, 5, 6, 4, 4, 5, 6, 5, 2, 4, 2, 4, 6, NA,~
## $ pathogen2 <dbl> 2, 5, 2, 2, 6, 2, 5, 6, 0, 4, 5, 1, 3, 1, 5, 1, 1, 3, 5, 3, ~
## $ pathogen3 <dbl> 4, 5, 0, 3, 6, 6, 6, 4, 1, 0, 4, 4, 1, 4, 1, 4, 1, 3, 5, 2, ~
## $ pathogen4 <dbl> 4, 5, 4, 2, 6, 5, 2, 5, 4, 6, 4, 3, 3, 4, 4, 0, 1, 2, 6, 3, ~
## $ pathogen5 <dbl> 2, 5, 5, 3, 6, 6, 6, 6, 5, 6, 4, 5, 1, 4, 0, 4, 4, 4, 5, 4, ~
## $ pathogen6 <dbl> 2, 5, 5, 2, 6, 6, 5, 2, 1, 6, 4, 4, 6, 0, 1, 0, 3, 2, 6, 0, ~
## $ pathogen7 <dbl> 6, 5, 4, 4, 4, 3, 6, 6, 2, 6, 5, 6, 0, 4, 2, 0, 3, 6, 6, 4, ~
Or select data from at least 5 years ago. You can use the range
function to check the minimum and maxiumum dates in the resulting dataset.
disgust_5ago <- disgust %>%
filter(date < today() - dyears(5))
range(disgust_5ago$date)
## [1] "2008-07-10" "2016-03-22"
資料按照指定變數數值重新排序
Sort your dataset using arrange()
.
disgust_order <- disgust %>%
arrange(id)
head(disgust_order)
## # A tibble: 6 x 24
## id user_id date moral1 moral2 moral3 moral4 moral5 moral6 moral7
## <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 2008-07-10 2 2 1 2 1 1 1
## 2 3 155324 2008-07-11 2 4 3 5 2 1 4
## 3 4 155366 2008-07-12 6 6 6 3 6 6 6
## 4 5 155370 2008-07-12 6 6 4 6 6 6 6
## 5 6 155386 2008-07-12 2 4 0 4 0 0 0
## 6 7 155409 2008-07-12 4 5 5 4 5 1 5
## # ... with 14 more variables: sexual1 <dbl>, sexual2 <dbl>, sexual3 <dbl>,
## # sexual4 <dbl>, sexual5 <dbl>, sexual6 <dbl>, sexual7 <dbl>,
## # pathogen1 <dbl>, pathogen2 <dbl>, pathogen3 <dbl>, pathogen4 <dbl>,
## # pathogen5 <dbl>, pathogen6 <dbl>, pathogen7 <dbl>
Reverse the order using desc()
disgust_order <- disgust %>%
arrange(desc(id))
head(disgust_order)
## # A tibble: 6 x 24
## id user_id date moral1 moral2 moral3 moral4 moral5 moral6 moral7
## <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 39456 356866 2017-08-21 1 1 1 1 1 1 1
## 2 39447 128727 2017-08-13 2 4 1 2 2 5 3
## 3 39371 152955 2017-06-13 6 6 3 6 6 6 6
## 4 39342 48303 2017-05-22 4 5 4 4 6 4 5
## 5 39159 151633 2017-04-04 4 5 6 5 3 6 2
## 6 38942 370464 2017-02-01 1 5 0 6 5 5 5
## # ... with 14 more variables: sexual1 <dbl>, sexual2 <dbl>, sexual3 <dbl>,
## # sexual4 <dbl>, sexual5 <dbl>, sexual6 <dbl>, sexual7 <dbl>,
## # pathogen1 <dbl>, pathogen2 <dbl>, pathogen3 <dbl>, pathogen4 <dbl>,
## # pathogen5 <dbl>, pathogen6 <dbl>, pathogen7 <dbl>
以原變數數值為內容重新組合計算,造出分析需要的新變數。
Add new columns. This is one of the most useful functions in the tidyverse.
Refer to other columns by their names (unquoted). You can add more than one column, just separate the columns with a comma. Once you make a new column, you can use it in further column definitions e.g., total
below).
disgust_total <- disgust %>%
mutate(
pathogen = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7,
moral = moral1 + moral2 + moral3 + moral4 + moral5 + moral6 + moral7,
sexual = sexual1 + sexual2 + sexual3 + sexual4 + sexual5 + sexual6 + sexual7,
total = pathogen + moral + sexual,
user_id = paste0("U", user_id)
)
You can overwrite a column by giving a new column the same name as the old column. Make sure that you mean to do this and that you aren’t trying to use the old column value after you redefine it.
指定變數進行需要的描述統計,輸出到新的資料集合。
Create summary statistics for the dataset. Check the Data Wrangling Cheat Sheet or the Data Transformation Cheat Sheet for various summary functions. Some common ones are: mean()
, sd()
, n()
, sum()
, and quantile()
.
disgust_total %>%
summarise(
n = n(),
q25 = quantile(total, .25, na.rm = TRUE),
q50 = quantile(total, .50, na.rm = TRUE),
q75 = quantile(total, .75, na.rm = TRUE),
avg_total = mean(total, na.rm = TRUE),
sd_total = sd(total, na.rm = TRUE),
min_total = min(total, na.rm = TRUE),
max_total = max(total, na.rm = TRUE)
)
## # A tibble: 1 x 8
## n q25 q50 q75 avg_total sd_total min_total max_total
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20000 59 71 83 70.7 18.2 0 126
指定變數為分組檢索值。
Create subsets of the data. You can use this to create summaries, like the mean value for all of your experimental groups.
Here, we’ll use mutate
to create a new column called year
, group by year
, and calculate the average scores.
disgust_total %>%
mutate(year = year(date)) %>%
group_by(year) %>%
summarise(
n = n(),
avg_total = mean(total, na.rm = TRUE),
sd_total = sd(total, na.rm = TRUE),
min_total = min(total, na.rm = TRUE),
max_total = max(total, na.rm = TRUE)
)
## # A tibble: 10 x 6
## year n avg_total sd_total min_total max_total
## <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2008 2578 70.3 18.5 0 126
## 2 2009 2580 69.7 18.6 3 126
## 3 2010 1514 70.6 18.9 6 126
## 4 2011 6046 71.3 17.8 0 126
## 5 2012 5938 70.4 18.4 0 126
## 6 2013 1251 71.6 17.6 0 126
## 7 2014 58 70.5 17.2 19 113
## 8 2015 21 74.3 16.9 43 107
## 9 2016 8 67.9 32.6 0 110
## 10 2017 6 57.2 27.9 21 90
You can use filter
after group_by
. The following example returns the lowest total score from each year.
disgust_total %>%
mutate(year = year(date)) %>%
select(user_id, year, total) %>%
group_by(year) %>%
filter(rank(total) == 1) %>%
arrange(year)
## # A tibble: 7 x 3
## # Groups: year [7]
## user_id year total
## <chr> <dbl> <dbl>
## 1 U236585 2009 3
## 2 U292359 2010 6
## 3 U245384 2013 0
## 4 U206293 2014 19
## 5 U407089 2015 43
## 6 U453237 2016 0
## 7 U356866 2017 21
You can also use mutate
after group_by
. The following example calculates subject-mean-centered scores by grouping the scores by user_id
and then subtracting the group-specific mean from each score. Note the use of gather
to tidy the data into a long format first.
disgust_smc <- disgust %>%
gather("question", "score", moral1:pathogen7) %>%
group_by(user_id) %>%
mutate(score_smc = score - mean(score, na.rm = TRUE))
在此示範清理到整頓資料一次完成。
A lot of what we did above would be easier if the data were tidy, so let’s do that first. Then we can use group_by
to calculate the domain scores.
It is good practice to use ungroup()
after using group_by
and summarise
. Forgetting to ungroup the dataset won’t affect some further processing, but can really mess up other things.
Then we can spread out the 3 domains, calculate the total score, remove any rows with a missing (NA
) total, and calculate mean values by year.
## 計算每位參與者在各向度的平均評分
disgust_tidy <- read_csv("../data/disgust.csv") %>%
gather("question", "score", moral1:pathogen7) %>%
separate(question, c("domain","q_num"), sep = -1) %>%
group_by(id, user_id, date, domain) %>%
summarise(score = mean(score)) %>%
ungroup()
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## date = col_date(format = "")
## )
## i<U+00A0>Use `spec()` for the full column specifications.
## `summarise()` has grouped output by 'id', 'user_id', 'date'. You can override using the `.groups` argument.
head(disgust_tidy)
## # A tibble: 6 x 5
## id user_id date domain score
## <dbl> <dbl> <date> <chr> <dbl>
## 1 1 1 2008-07-10 moral 1.43
## 2 1 1 2008-07-10 pathogen 2.71
## 3 1 1 2008-07-10 sexual 1.71
## 4 3 155324 2008-07-11 moral 3
## 5 3 155324 2008-07-11 pathogen 2.57
## 6 3 155324 2008-07-11 sexual 1.86
disgust_tidy2 <- disgust_tidy %>%
spread(domain, score) %>%
mutate(
total = moral + sexual + pathogen, ## 計算三個向度的平均總分
year = year(date) ## 製造新變數year儲存年份資訊
) %>%
filter(!is.na(total)) %>% ## 排除沒有答完所有題目的參與者
arrange(user_id)
head(disgust_tidy2)
## # A tibble: 6 x 8
## id user_id date moral pathogen sexual total year
## <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1199 0 2008-10-07 5.29 4.71 2.14 12.1 2008
## 2 1 1 2008-07-10 1.43 2.71 1.71 5.86 2008
## 3 13332 2118 2012-01-02 1 5 3 9 2012
## 4 23 2311 2008-07-15 4 4.29 1.86 10.1 2008
## 5 7980 4458 2011-09-05 3.43 3.57 3 10 2011
## 6 552 4651 2008-08-23 3.86 4.86 4.29 13 2008
disgust_tidy3 <- disgust_tidy2 %>%
group_by(year) %>% ## 依年分計算人次、各向度平均評分、第一位及最後一位參與者id
summarise(
n = n(),
avg_pathogen = mean(pathogen),
avg_moral = mean(moral),
avg_sexual = mean(sexual),
first_user = first(user_id),
last_user = last(user_id)
)
head(disgust_tidy3)
## # A tibble: 6 x 7
## year n avg_pathogen avg_moral avg_sexual first_user last_user
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2008 2392 3.70 3.81 2.54 0 188708
## 2 2009 2410 3.67 3.76 2.53 6093 251959
## 3 2010 1418 3.73 3.84 2.51 5469 319641
## 4 2011 5586 3.76 3.81 2.63 4458 406569
## 5 2012 5375 3.74 3.77 2.55 2118 458194
## 6 2013 1222 3.77 3.91 2.55 7646 462428
dplyr
變數加工的函式系列Use the code examples below and the help pages to figure out what the following one-table verbs do. Most have pretty self-explanatory names.
為變數改名字
iris_underscore <- iris %>%
rename(sepal_length = Sepal.Length,
sepal_width = Sepal.Width,
petal_length = Petal.Length,
petal_width = Petal.Width)
names(iris_underscore)
## [1] "sepal_length" "sepal_width" "petal_length" "petal_width" "Species"
Almost everyone gets confused at some point with rename()
and tries to put the original names on the left and the new names on the right. Try it and see what the error message looks like.
移除重覆的觀察值
# create a data table with duplicated values
dupes <- tibble(
id = rep(1:5, 2),
dv = rep(LETTERS[1:5], 2)
)
distinct(dupes)
## # A tibble: 5 x 2
## id dv
## <int> <chr>
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
依指定變數,計算觀察值個數
# how many observations from each species are in iris?
count(iris, Species)
## Species n
## 1 setosa 50
## 2 versicolor 50
## 3 virginica 50
移除指定列的觀察值
tibble(
id = 1:10,
condition = rep(c("A","B"), 5)
) %>%
slice(3:6, 9)
## # A tibble: 5 x 2
## id condition
## <int> <chr>
## 1 3 A
## 2 4 B
## 3 5 A
## 4 6 B
## 5 9 A
抽出指定的變數
iris %>%
group_by(Species) %>%
summarise_all(mean) %>%
pull(Sepal.Length)
## [1] 5.006 5.936 6.588
dplyr
新造變數的函式系列必須置於總和變數函式的參數,才能發揮作用。
Window functions use the order of rows to calculate values. You can use them to do things that require ranking or ordering, like choose the top scores in each class, or acessing the previous and next rows, like calculating cumulative sums or means.
The dplyr window functions vignette has very good detailed explanations of these functions, but we’ve described a few of the most useful ones below.
rank
系列函數計算指定變數內數值的排序,輸出到指定新變數。
tibble(
id = 1:5,
"Data Skills" = c(16, 17, 17, 19, 20),
"Statistics" = c(14, 16, 18, 18, 19)
) %>%
gather(class, grade, 2:3) %>%
group_by(class) %>%
mutate(row_number = row_number(),
rank = rank(grade),
min_rank = min_rank(grade),
dense_rank = dense_rank(grade),
quartile = ntile(grade, 4),
percentile = ntile(grade, 100))
## # A tibble: 10 x 9
## # Groups: class [2]
## id class grade row_number rank min_rank dense_rank quartile percentile
## <int> <chr> <dbl> <int> <dbl> <int> <int> <int> <int>
## 1 1 Data Sk~ 16 1 1 1 1 1 1
## 2 2 Data Sk~ 17 2 2.5 2 2 1 2
## 3 3 Data Sk~ 17 3 2.5 2 2 2 3
## 4 4 Data Sk~ 19 4 4 4 3 3 4
## 5 5 Data Sk~ 20 5 5 5 4 4 5
## 6 1 Statist~ 14 1 1 1 1 1 1
## 7 2 Statist~ 16 2 2 2 2 1 2
## 8 3 Statist~ 18 3 3.5 3 3 2 3
## 9 4 Statist~ 18 4 3.5 3 3 3 4
## 10 5 Statist~ 19 5 5 5 4 4 5
row_number()
, rank()
, min_rank()
, dense_rank()
, and ntile()
?
row_number()
need an argument?
grade
or class
?
group_by(class)
line above?
id
to the grouping?
ntile()
do?
You can use window functions to group your data into quantiles.
使用rank
系列函數分組,再做其他整頓工作。
iris %>%
group_by(tertile = ntile(Sepal.Length, 3)) %>%
summarise(mean.Sepal.Length = mean(Sepal.Length))
## # A tibble: 3 x 2
## tertile mean.Sepal.Length
## <int> <dbl>
## 1 1 4.94
## 2 2 5.81
## 3 3 6.78
依指定變數內數值的登錄順序,紀錄前x位(lag()
)或後x位(lead()
)的數值,輸出到新變數
lag_lead <- tibble(x = 1:6) %>%
mutate(lag = lag(x),
lag2 = lag(x, n = 2),
lead = lead(x, default = 0))
head(lag_lead)
## # A tibble: 6 x 4
## x lag lag2 lead
## <int> <int> <int> <dbl>
## 1 1 NA NA 2
## 2 2 1 NA 3
## 3 3 2 1 4
## 4 4 3 2 5
## 5 5 4 3 6
## 6 6 5 4 0
trials <- tibble(
trial = sample(1:10, 10),
cond = sample(c("exp", "ctrl"), 10, T),
score = rpois(10, 4)
) %>%
mutate(
score_change = score - lag(score, order_by = trial),
change_cond = cond != lag(cond, order_by = trial,
default = "no condition")
) %>%
arrange(trial)
Look at the help pages for lag()
and lead()
.
order_by
argument or change it to cond
?
default
argument do?
lag()
or lead()
?
依指定變數內數值的登錄順序,逐項計算並輸出到新變數。
cumsum()
, cummin()
, and cummax()
are base R functions for calcumaling cumulative means, minimums, and maximums. The dplyr package introduces cumany()
and cumall()
, which return TRUE
if any or all of the previous values meet their criteria.
tibble(
time = 1:10,
obs = c(1, 0, 1, 2, 4, 3, 1, 0, 3, 5)
) %>%
mutate(
cumsum = cumsum(obs),
cummin = cummin(obs),
cummax = cummax(obs),
cumany = cumany(obs == 3),
cumall = cumall(obs < 4)
)
## # A tibble: 10 x 7
## time obs cumsum cummin cummax cumany cumall
## <int> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 1 1 1 1 1 FALSE TRUE
## 2 2 0 1 0 1 FALSE TRUE
## 3 3 1 2 0 1 FALSE TRUE
## 4 4 2 4 0 2 FALSE TRUE
## 5 5 4 8 0 4 FALSE FALSE
## 6 6 3 11 0 4 TRUE FALSE
## 7 7 1 12 0 4 TRUE FALSE
## 8 8 0 12 0 4 TRUE FALSE
## 9 9 3 15 0 4 TRUE FALSE
## 10 10 5 20 0 5 TRUE FALSE
cumany(obs == 3)
to cumany(obs > 2)
?
cumall(obs < 4)
to cumall(obs < 2)
?
cumany()
or cumall()
?