任何收集資料的工作完成後,必須要先清理資料。
認識寬表單與長表單
dplyr
清理資料函式系列
使用管道運算子 %>%
清理前的資料表單
最後欄位total_meanRT
混合兩種數值,需要分開為兩個欄位。
id | score_1 | score_2 | score_3 | rt_1 | rt_2 | rt_3 | total_meanRT |
---|---|---|---|---|---|---|---|
1 | 4 | 3 | 7 | 857 | 890 | 859 | 14 (869) |
2 | 3 | 1 | 1 | 902 | 900 | 959 | 5 (920) |
3 | 2 | 5 | 4 | 757 | 823 | 901 | 11 (827) |
4 | 6 | 2 | 6 | 844 | 788 | 624 | 14 (752) |
5 | 1 | 7 | 2 | 659 | 764 | 690 | 10 (704) |
清理後的資料表單
id | trial | rt | score | total | mean_rt |
---|---|---|---|---|---|
1 | 1 | 857 | 4 | 14 | 869 |
1 | 2 | 890 | 3 | 14 | 869 |
1 | 3 | 859 | 7 | 14 | 869 |
2 | 1 | 902 | 3 | 5 | 920 |
2 | 2 | 900 | 1 | 5 | 920 |
2 | 3 | 959 | 1 | 5 | 920 |
3 | 1 | 757 | 2 | 11 | 827 |
3 | 2 | 823 | 5 | 11 | 827 |
3 | 3 | 901 | 4 | 11 | 827 |
4 | 1 | 844 | 6 | 14 | 752 |
4 | 2 | 788 | 2 | 14 | 752 |
4 | 3 | 624 | 6 | 14 | 752 |
5 | 1 | 659 | 1 | 10 | 704 |
5 | 2 | 764 | 7 | 10 | 704 |
5 | 3 | 690 | 2 | 10 | 704 |
id | Q1 | Q2 | Q3 |
---|---|---|---|
A | 1 | 2 | 3 |
B | 4 | 5 | 6 |
id | item | value |
---|---|---|
A | Q1 | 1 |
A | Q2 | 2 |
A | Q3 | 3 |
B | Q1 | 4 |
B | Q2 | 5 |
B | Q3 | 6 |
大部分R函式參數指向資料表單變項,因此資料分析採用長表單設定資料變項集合有許多好處。
將這份寬表單改成長表單
data.frame(
id = c("Lisa","Robbie","Steven"),
fav_colour = c("red","orange","green"),
fav_animal = c("echidna","babirusa","frog")
)
## id fav_colour fav_animal
## 1 Lisa red echidna
## 2 Robbie orange babirusa
## 3 Steven green frog
提示: fav_colour
與fav_animal
分別是兩個問題的反應。
Pivot()
預覽示範資料
personality <- read_csv("https://psyteachr.github.io/msc-data-skills/data/personality.csv")
user_id | date | Op1 | Ne1 | Ne2 | Op2 | Ex1 | Ex2 | Co1 | Co2 | Ne3 | Ag1 | Ag2 | Ne4 | Ex3 | Co3 | Op3 | Ex4 | Op4 | Ex5 | Ag3 | Co4 | Co5 | Ne5 | Op5 | Ag4 | Op6 | Co6 | Ex6 | Ne6 | Co7 | Ag5 | Co8 | Ex7 | Ne7 | Co9 | Op7 | Ne8 | Ag6 | Ag7 | Co10 | Ex8 | Ex9 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2006-03-23 | 3 | 4 | 0 | 6 | 3 | 3 | 3 | 3 | 0 | 2 | 1 | 3 | 3 | 2 | 2 | 1 | 3 | 3 | 1 | 3 | 0 | 3 | 6 | 1 | 0 | 6 | 3 | 1 | 3 | 3 | 3 | 3 | NA | 3 | 0 | 2 | NA | 3 | 1 | 2 | 4 |
1 | 2006-02-08 | 6 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 6 | 6 | 0 | 6 | 0 | 0 | 0 | 0 | 6 | 6 | 0 | 6 | 0 | 6 | 0 | 6 | 6 | 6 | 6 | 0 | 6 | 0 | 6 | 6 | 0 | 6 | 0 | 6 | 0 | 6 |
2 | 2005-10-24 | 6 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 6 | 5 | 1 | 5 | 1 | 1 | 1 | 1 | 5 | 5 | 1 | 5 | 1 | 5 | 1 | 5 | 5 | 5 | 5 | 1 | 5 | 1 | 5 | 5 | 1 | 5 | 1 | 5 | 1 | 5 |
5 | 2005-12-07 | 6 | 4 | 4 | 4 | 2 | 3 | 3 | 3 | 1 | 4 | 0 | 2 | 5 | 3 | 5 | 3 | 6 | 6 | 1 | 5 | 5 | 4 | 2 | 4 | 1 | 4 | 3 | 1 | 1 | 0 | 1 | 4 | 2 | 4 | 5 | 1 | 2 | 1 | 5 | 4 | 5 |
8 | 2006-07-27 | 6 | 1 | 2 | 6 | 2 | 3 | 5 | 4 | 0 | 6 | 5 | 3 | 3 | 4 | 5 | 3 | 6 | 3 | 0 | 5 | 5 | 1 | 5 | 6 | 6 | 6 | 0 | 0 | 3 | 2 | 3 | 1 | 0 | 3 | 5 | 1 | 3 | 1 | 3 | 3 | 5 |
108 | 2006-02-28 | 3 | 2 | 1 | 4 | 4 | 4 | 4 | 3 | 1 | 5 | 4 | 2 | 3 | 4 | 4 | 3 | 3 | 3 | 4 | 3 | 3 | 1 | 4 | 5 | 4 | 5 | 4 | 1 | 4 | 5 | 4 | 2 | 2 | 4 | 4 | 1 | 4 | 3 | 5 | 4 | 2 |
寬表單轉長表單,可將變項名稱分解到兩個欄位。
pivot_longerdata, cols, names_to = "name", names_sep = NULL, values_to = ...)
cols
refers to the columns you want to make long You can refer to them by their names, like col1, col2, col3, col4
or col1:col4
or by their numbers, like 8, 9, 10
or 8:10
.names_to
is what you want to call the new columns that the gathered column headers will go into; it’s “domain” and “qnumber” in this example.names_sep
is an optional argument if you have more than one value for names_to. It specifies the characters or position to split the values of the cols headers.values_to
is what you want to call the values in the columns ...
; they’re “score” in this example.personality_long <- pivot_longer(
data = personality,
cols = Op1:Ex9, # columns to make long
names_to = c("domain", "qnumber"), # new column names for headers
names_sep = 2, # how to split the headers
values_to = "score" # new column name for values
) %>%
glimpse()
## Rows: 615,000
## Columns: 5
## $ user_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ date <date> 2006-03-23, 2006-03-23, 2006-03-23, 2006-03-23, 2006-03-23, 2~
## $ domain <chr> "Op", "Ne", "Ne", "Op", "Ex", "Ex", "Co", "Co", "Ne", "Ag", "A~
## $ qnumber <chr> "1", "1", "2", "2", "1", "2", "1", "2", "3", "1", "2", "4", "3~
## $ score <dbl> 3, 4, 0, 6, 3, 3, 3, 3, 0, 2, 1, 3, 3, 2, 2, 1, 3, 3, 1, 3, 0,~
長表單轉寛表單,可組合兩個以上變項。
pivot_wider(data, names_to = "name", values_to = ..., names_sep = NULL)
names_from
is the columns that contain your new column headers.values_from
is the column that contains the values for the new columns.names_sep
is the character string used to join names if names_from is more than one column.personality_wide <- pivot_wider(
data = personality_long,
names_from = c(domain, qnumber),
values_from = score,
names_sep = ""
) %>%
glimpse()
## Rows: 15,000
## Columns: 43
## $ user_id <dbl> 0, 1, 2, 5, 8, 108, 233, 298, 426, 436, 685, 807, 871, 881, 94~
## $ date <date> 2006-03-23, 2006-02-08, 2005-10-24, 2005-12-07, 2006-07-27, 2~
## $ Op1 <dbl> 3, 6, 6, 6, 6, 3, 3, 6, 6, 3, 4, 5, 5, 5, 6, 4, 1, 2, 5, 6, 4,~
## $ Ne1 <dbl> 4, 0, 0, 4, 1, 2, 3, 4, 0, 3, 3, 3, 2, 1, 1, 3, 4, 5, 2, 4, 5,~
## $ Ne2 <dbl> 0, 6, 6, 4, 2, 1, 2, 3, 1, 2, 5, 5, 3, 1, 1, 1, 1, 6, 1, 2, 5,~
## $ Op2 <dbl> 6, 0, 0, 4, 6, 4, 4, 0, 0, 3, 4, 3, 3, 4, 5, 3, 3, 4, 1, 6, 6,~
## $ Ex1 <dbl> 3, 0, 0, 2, 2, 4, 4, 3, 5, 4, 1, 1, 3, 3, 1, 3, 5, 1, 0, 4, 1,~
## $ Ex2 <dbl> 3, 0, 0, 3, 3, 4, 5, 2, 5, 3, 4, 1, 3, 2, 1, 6, 5, 3, 4, 4, 1,~
## $ Co1 <dbl> 3, 0, 0, 3, 5, 4, 3, 4, 5, 3, 3, 3, 1, 5, 5, 4, 4, 5, 6, 4, 2,~
## $ Co2 <dbl> 3, 0, 0, 3, 4, 3, 3, 4, 5, 3, 5, 3, 3, 4, 5, 1, 5, 4, 5, 2, 5,~
## $ Ne3 <dbl> 0, 0, 0, 1, 0, 1, 4, 4, 0, 4, 2, 5, 1, 2, 5, 5, 2, 2, 1, 2, 5,~
## $ Ag1 <dbl> 2, 0, 0, 4, 6, 5, 5, 4, 2, 5, 4, 3, 2, 4, 5, 3, 5, 5, 5, 4, 4,~
## $ Ag2 <dbl> 1, 6, 6, 0, 5, 4, 5, 3, 4, 3, 5, 1, 5, 4, 2, 6, 5, 5, 5, 5, 2,~
## $ Ne4 <dbl> 3, 6, 6, 2, 3, 2, 3, 3, 0, 4, 4, 5, 5, 4, 5, 3, 2, 5, 2, 4, 5,~
## $ Ex3 <dbl> 3, 6, 5, 5, 3, 3, 3, 0, 6, 1, 4, 2, 3, 2, 1, 2, 5, 1, 0, 5, 5,~
## $ Co3 <dbl> 2, 0, 1, 3, 4, 4, 5, 4, 5, 3, 4, 3, 4, 4, 5, 4, 2, 4, 5, 2, 2,~
## $ Op3 <dbl> 2, 6, 5, 5, 5, 4, 3, 2, 4, 3, 3, 6, 5, 5, 6, 5, 4, 4, 3, 6, 5,~
## $ Ex4 <dbl> 1, 0, 1, 3, 3, 3, 4, 3, 5, 3, 2, 0, 3, 3, 1, 2, NA, 4, 4, 4, 1~
## $ Op4 <dbl> 3, 0, 1, 6, 6, 3, 3, 0, 6, 3, 4, 5, 4, 5, 6, 6, 2, 2, 4, 5, 5,~
## $ Ex5 <dbl> 3, 0, 1, 6, 3, 3, 4, 2, 5, 2, 2, 4, 2, 3, 0, 4, 5, 2, 3, 1, 1,~
## $ Ag3 <dbl> 1, 0, 1, 1, 0, 4, 4, 4, 3, 3, 4, 4, 3, 4, 4, 5, 5, 4, 5, 3, 4,~
## $ Co4 <dbl> 3, 6, 5, 5, 5, 3, 2, 4, 3, 1, 4, 3, 1, 2, 4, 2, NA, 5, 6, 1, 1~
## $ Co5 <dbl> 0, 6, 5, 5, 5, 3, 3, 1, 5, 1, 2, 4, 4, 4, 2, 1, 6, 4, 3, 1, 3,~
## $ Ne5 <dbl> 3, 0, 1, 4, 1, 1, 4, 5, 0, 3, 4, 6, 2, 0, 1, 1, 0, 4, 3, 1, 5,~
## $ Op5 <dbl> 6, 6, 5, 2, 5, 4, 3, 2, 6, 6, 2, 4, 3, 4, 6, 6, 6, 5, 3, 3, 5,~
## $ Ag4 <dbl> 1, 0, 1, 4, 6, 5, 5, 6, 6, 6, 4, 2, 4, 5, 4, 5, 6, 4, 5, 6, 5,~
## $ Op6 <dbl> 0, 6, 5, 1, 6, 4, 6, 0, 0, 3, 5, 3, 5, 5, 5, 2, 5, 1, 1, 6, 2,~
## $ Co6 <dbl> 6, 0, 1, 4, 6, 5, 6, 5, 4, 3, 5, 5, 4, 6, 6, 1, 3, 4, 5, 4, 6,~
## $ Ex6 <dbl> 3, 6, 5, 3, 0, 4, 3, 1, 6, 3, 2, 1, 4, 2, 1, 5, 6, 2, 1, 2, 1,~
## $ Ne6 <dbl> 1, 6, 5, 1, 0, 1, 3, 4, 0, 4, 4, 5, 2, 1, 5, 6, 1, 2, 2, 3, 5,~
## $ Co7 <dbl> 3, 6, 5, 1, 3, 4, NA, 2, 3, 3, 2, 2, 4, 2, 5, 2, 5, 5, 3, 1, 1~
## $ Ag5 <dbl> 3, 6, 5, 0, 2, 5, 6, 2, 2, 3, 4, 1, 3, 5, 2, 6, 5, 6, 5, 3, 3,~
## $ Co8 <dbl> 3, 0, 1, 1, 3, 4, 3, 0, 1, 3, 2, 2, 1, 2, 4, 3, 2, 4, 5, 2, 6,~
## $ Ex7 <dbl> 3, 6, 5, 4, 1, 2, 5, 3, 6, 3, 4, 3, 5, 1, 1, 6, 6, 3, 1, 1, 3,~
## $ Ne7 <dbl> NA, 0, 1, 2, 0, 2, 4, 4, 0, 3, 2, 5, 1, 2, 5, 2, 2, 4, 1, 3, 5~
## $ Co9 <dbl> 3, 6, 5, 4, 3, 4, 5, 3, 5, 3, 4, 3, 4, 4, 2, 4, 6, 5, 5, 2, 2,~
## $ Op7 <dbl> 0, 6, 5, 5, 5, 4, 6, 2, 1, 3, 2, 4, 5, 5, 6, 3, 6, 5, 2, 6, 5,~
## $ Ne8 <dbl> 2, 0, 1, 1, 1, 1, 5, 4, 0, 4, 4, 5, 1, 2, 5, 2, 1, 5, 1, 2, 5,~
## $ Ag6 <dbl> NA, 6, 5, 2, 3, 4, 5, 6, 1, 3, 4, 2, 3, 5, 1, 6, 2, 6, 6, 5, 3~
## $ Ag7 <dbl> 3, 0, 1, 1, 1, 3, 3, 5, 0, 3, 2, 1, 2, 3, 5, 6, 4, 4, 6, 6, 2,~
## $ Co10 <dbl> 1, 6, 5, 5, 3, 5, 1, 2, 5, 2, 4, 3, 4, 4, 3, 2, 5, 5, 5, 2, 2,~
## $ Ex8 <dbl> 2, 0, 1, 4, 3, 4, 2, 4, 6, 2, 4, 0, 4, 4, 1, 3, 5, 4, 3, 1, 1,~
## $ Ex9 <dbl> 4, 6, 5, 5, 5, 2, 3, 3, 6, 3, 3, 4, 4, 3, 2, 5, 5, 4, 4, 0, 4,~
寬表單轉長表單,只能將變項名稱集中到一個欄位。
gather(data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
key
is what you want to call the new column that the gathered column headers will go into; it’s “question” in this example.value
is what you want to call the values in the gathered columns; they’re “score” in this example....
refers to the columns you want to gather. You can refer to them by their names, like col1, col2, col3, col4
or col1:col4
or by their numbers, like 8, 9, 10
or 8:10
.na.rm
determines whether rows with NA
values should be removedconvert
whether to automatically convert the results to another data typefactor_key
whether to store the key values as a factor (with the same order as in the table) or character vectorpersonality
is in wide format, with a separate column for each question. Change it to long format, with a row for each user/question observation. The resulting dataframe should have the columns: user_id
, date
, question
, and score
.
personality_gathered <- gather(
data = personality,
key = "question", # new column name for gathered headers
value = "score", # new column name for gathered values
Op1:Ex9 # columns to gather
) %>%
glimpse()
## Rows: 615,000
## Columns: 4
## $ user_id <dbl> 0, 1, 2, 5, 8, 108, 233, 298, 426, 436, 685, 807, 871, 881, 9~
## $ date <date> 2006-03-23, 2006-02-08, 2005-10-24, 2005-12-07, 2006-07-27, ~
## $ question <chr> "Op1", "Op1", "Op1", "Op1", "Op1", "Op1", "Op1", "Op1", "Op1"~
## $ score <dbl> 3, 6, 6, 6, 6, 3, 3, 6, 6, 3, 4, 5, 5, 5, 6, 4, 1, 2, 5, 6, 4~
user_id | date | question | score |
---|---|---|---|
0 | 2006-03-23 | Op1 | 3 |
1 | 2006-02-08 | Op1 | 6 |
2 | 2005-10-24 | Op1 | 6 |
5 | 2005-12-07 | Op1 | 6 |
8 | 2006-07-27 | Op1 | 6 |
108 | 2006-02-28 | Op1 | 3 |
分裂一個變項與為兩個變項,通常搭配gather()
使用。
separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn")
col
is the column you want to separateinto
is a vector of new column namessep
is the character(s) that separate your new columns. This defaults to anything that isn’t alphanumeric, like .,_-/:
remove
determines whether the separated column (col
) will be removed from the new data table. The default is to remove it.convert
whether to automatically convert the results to another data typeextra
controls what happens when there are too many piecesfill
controls what happens when there are not enough piecesSplit the question
column into two columns: domain
and qnumber
.
There is no character to split on, here, but you can separate a column after a specific number of characters by setting sep
to an integer. For example, to split “abcde” after the third character, use sep = 3
, which results in c("abc", "de")
. You can also use negative number to split before the nth character from the right. For example, to split a column that has words of various lengths and 2-digit suffixes (like “lisa03”“,”amanda38"), you can use sep = -2
.
personality_sep <- separate(
data = personality_gathered,
col = question, # column to separate
into = c("domain", "qnumber"), # new column names
sep = 2 # where to separate
) %>%
glimpse()
## Rows: 615,000
## Columns: 5
## $ user_id <dbl> 0, 1, 2, 5, 8, 108, 233, 298, 426, 436, 685, 807, 871, 881, 94~
## $ date <date> 2006-03-23, 2006-02-08, 2005-10-24, 2005-12-07, 2006-07-27, 2~
## $ domain <chr> "Op", "Op", "Op", "Op", "Op", "Op", "Op", "Op", "Op", "Op", "O~
## $ qnumber <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1~
## $ score <dbl> 3, 6, 6, 6, 6, 3, 3, 6, 6, 3, 4, 5, 5, 5, 6, 4, 1, 2, 5, 6, 4,~
user_id | date | domain | qnumber | score |
---|---|---|---|---|
0 | 2006-03-23 | Op | 1 | 3 |
1 | 2006-02-08 | Op | 1 | 6 |
2 | 2005-10-24 | Op | 1 | 6 |
5 | 2005-12-07 | Op | 1 | 6 |
8 | 2006-07-27 | Op | 1 | 6 |
108 | 2006-02-28 | Op | 1 | 3 |
If you want to separate just at full stops, you need to use sep = "\\."
, not sep = "."
. The two slashes escape the full stop, making it interpreted as a literal full stop and not the regular expression for any character.
組合兩個變項到一個新變項,並移除原始變項。通常用來修改變項數值。
unite(data, col, ..., sep = "_", remove = TRUE)
col
is your new united column...
refers to the columns you want to unitesep
is the character(s) that will separate your united columnsremove
determines whether the united columns (...
) will be removed from the new data table. The default is to remove them.Put the domain and qnumber columns back together into a new column named domain_n
. Make it in a format like “Op_Q1”.
personality_unite <- unite(
data = personality_sep,
col = "domain_n", # new column name
domain, qnumber, # columns to unite
sep = "_Q" # separation characters
) %>%
glimpse()
## Rows: 615,000
## Columns: 4
## $ user_id <dbl> 0, 1, 2, 5, 8, 108, 233, 298, 426, 436, 685, 807, 871, 881, 9~
## $ date <date> 2006-03-23, 2006-02-08, 2005-10-24, 2005-12-07, 2006-07-27, ~
## $ domain_n <chr> "Op_Q1", "Op_Q1", "Op_Q1", "Op_Q1", "Op_Q1", "Op_Q1", "Op_Q1"~
## $ score <dbl> 3, 6, 6, 6, 6, 3, 3, 6, 6, 3, 4, 5, 5, 5, 6, 4, 1, 2, 5, 6, 4~
user_id | date | domain_n | score |
---|---|---|---|
0 | 2006-03-23 | Op_Q1 | 3 |
1 | 2006-02-08 | Op_Q1 | 6 |
2 | 2005-10-24 | Op_Q1 | 6 |
5 | 2005-12-07 | Op_Q1 | 6 |
8 | 2006-07-27 | Op_Q1 | 6 |
108 | 2006-02-28 | Op_Q1 | 3 |
寬表單轉長表單,只能轉換一個變項。通常與unite()
一起使用。
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)
You can reverse the processes above, as well. For example, you can convert data from long format into wide format.
key
is the column that contains your new column headersvalue
is the column that contains the values in the new spread columnspersonality_spread <- spread(
data = personality_unite,
key = domain_n, # column that contains new headers
value = score # column that contains new values
) %>%
glimpse()
## Rows: 15,000
## Columns: 43
## $ user_id <dbl> 0, 1, 2, 5, 8, 108, 233, 298, 426, 436, 685, 807, 871, 881, 94~
## $ date <date> 2006-03-23, 2006-02-08, 2005-10-24, 2005-12-07, 2006-07-27, 2~
## $ Ag_Q1 <dbl> 2, 0, 0, 4, 6, 5, 5, 4, 2, 5, 4, 3, 2, 4, 5, 3, 5, 5, 5, 4, 4,~
## $ Ag_Q2 <dbl> 1, 6, 6, 0, 5, 4, 5, 3, 4, 3, 5, 1, 5, 4, 2, 6, 5, 5, 5, 5, 2,~
## $ Ag_Q3 <dbl> 1, 0, 1, 1, 0, 4, 4, 4, 3, 3, 4, 4, 3, 4, 4, 5, 5, 4, 5, 3, 4,~
## $ Ag_Q4 <dbl> 1, 0, 1, 4, 6, 5, 5, 6, 6, 6, 4, 2, 4, 5, 4, 5, 6, 4, 5, 6, 5,~
## $ Ag_Q5 <dbl> 3, 6, 5, 0, 2, 5, 6, 2, 2, 3, 4, 1, 3, 5, 2, 6, 5, 6, 5, 3, 3,~
## $ Ag_Q6 <dbl> NA, 6, 5, 2, 3, 4, 5, 6, 1, 3, 4, 2, 3, 5, 1, 6, 2, 6, 6, 5, 3~
## $ Ag_Q7 <dbl> 3, 0, 1, 1, 1, 3, 3, 5, 0, 3, 2, 1, 2, 3, 5, 6, 4, 4, 6, 6, 2,~
## $ Co_Q1 <dbl> 3, 0, 0, 3, 5, 4, 3, 4, 5, 3, 3, 3, 1, 5, 5, 4, 4, 5, 6, 4, 2,~
## $ Co_Q10 <dbl> 1, 6, 5, 5, 3, 5, 1, 2, 5, 2, 4, 3, 4, 4, 3, 2, 5, 5, 5, 2, 2,~
## $ Co_Q2 <dbl> 3, 0, 0, 3, 4, 3, 3, 4, 5, 3, 5, 3, 3, 4, 5, 1, 5, 4, 5, 2, 5,~
## $ Co_Q3 <dbl> 2, 0, 1, 3, 4, 4, 5, 4, 5, 3, 4, 3, 4, 4, 5, 4, 2, 4, 5, 2, 2,~
## $ Co_Q4 <dbl> 3, 6, 5, 5, 5, 3, 2, 4, 3, 1, 4, 3, 1, 2, 4, 2, NA, 5, 6, 1, 1~
## $ Co_Q5 <dbl> 0, 6, 5, 5, 5, 3, 3, 1, 5, 1, 2, 4, 4, 4, 2, 1, 6, 4, 3, 1, 3,~
## $ Co_Q6 <dbl> 6, 0, 1, 4, 6, 5, 6, 5, 4, 3, 5, 5, 4, 6, 6, 1, 3, 4, 5, 4, 6,~
## $ Co_Q7 <dbl> 3, 6, 5, 1, 3, 4, NA, 2, 3, 3, 2, 2, 4, 2, 5, 2, 5, 5, 3, 1, 1~
## $ Co_Q8 <dbl> 3, 0, 1, 1, 3, 4, 3, 0, 1, 3, 2, 2, 1, 2, 4, 3, 2, 4, 5, 2, 6,~
## $ Co_Q9 <dbl> 3, 6, 5, 4, 3, 4, 5, 3, 5, 3, 4, 3, 4, 4, 2, 4, 6, 5, 5, 2, 2,~
## $ Ex_Q1 <dbl> 3, 0, 0, 2, 2, 4, 4, 3, 5, 4, 1, 1, 3, 3, 1, 3, 5, 1, 0, 4, 1,~
## $ Ex_Q2 <dbl> 3, 0, 0, 3, 3, 4, 5, 2, 5, 3, 4, 1, 3, 2, 1, 6, 5, 3, 4, 4, 1,~
## $ Ex_Q3 <dbl> 3, 6, 5, 5, 3, 3, 3, 0, 6, 1, 4, 2, 3, 2, 1, 2, 5, 1, 0, 5, 5,~
## $ Ex_Q4 <dbl> 1, 0, 1, 3, 3, 3, 4, 3, 5, 3, 2, 0, 3, 3, 1, 2, NA, 4, 4, 4, 1~
## $ Ex_Q5 <dbl> 3, 0, 1, 6, 3, 3, 4, 2, 5, 2, 2, 4, 2, 3, 0, 4, 5, 2, 3, 1, 1,~
## $ Ex_Q6 <dbl> 3, 6, 5, 3, 0, 4, 3, 1, 6, 3, 2, 1, 4, 2, 1, 5, 6, 2, 1, 2, 1,~
## $ Ex_Q7 <dbl> 3, 6, 5, 4, 1, 2, 5, 3, 6, 3, 4, 3, 5, 1, 1, 6, 6, 3, 1, 1, 3,~
## $ Ex_Q8 <dbl> 2, 0, 1, 4, 3, 4, 2, 4, 6, 2, 4, 0, 4, 4, 1, 3, 5, 4, 3, 1, 1,~
## $ Ex_Q9 <dbl> 4, 6, 5, 5, 5, 2, 3, 3, 6, 3, 3, 4, 4, 3, 2, 5, 5, 4, 4, 0, 4,~
## $ Ne_Q1 <dbl> 4, 0, 0, 4, 1, 2, 3, 4, 0, 3, 3, 3, 2, 1, 1, 3, 4, 5, 2, 4, 5,~
## $ Ne_Q2 <dbl> 0, 6, 6, 4, 2, 1, 2, 3, 1, 2, 5, 5, 3, 1, 1, 1, 1, 6, 1, 2, 5,~
## $ Ne_Q3 <dbl> 0, 0, 0, 1, 0, 1, 4, 4, 0, 4, 2, 5, 1, 2, 5, 5, 2, 2, 1, 2, 5,~
## $ Ne_Q4 <dbl> 3, 6, 6, 2, 3, 2, 3, 3, 0, 4, 4, 5, 5, 4, 5, 3, 2, 5, 2, 4, 5,~
## $ Ne_Q5 <dbl> 3, 0, 1, 4, 1, 1, 4, 5, 0, 3, 4, 6, 2, 0, 1, 1, 0, 4, 3, 1, 5,~
## $ Ne_Q6 <dbl> 1, 6, 5, 1, 0, 1, 3, 4, 0, 4, 4, 5, 2, 1, 5, 6, 1, 2, 2, 3, 5,~
## $ Ne_Q7 <dbl> NA, 0, 1, 2, 0, 2, 4, 4, 0, 3, 2, 5, 1, 2, 5, 2, 2, 4, 1, 3, 5~
## $ Ne_Q8 <dbl> 2, 0, 1, 1, 1, 1, 5, 4, 0, 4, 4, 5, 1, 2, 5, 2, 1, 5, 1, 2, 5,~
## $ Op_Q1 <dbl> 3, 6, 6, 6, 6, 3, 3, 6, 6, 3, 4, 5, 5, 5, 6, 4, 1, 2, 5, 6, 4,~
## $ Op_Q2 <dbl> 6, 0, 0, 4, 6, 4, 4, 0, 0, 3, 4, 3, 3, 4, 5, 3, 3, 4, 1, 6, 6,~
## $ Op_Q3 <dbl> 2, 6, 5, 5, 5, 4, 3, 2, 4, 3, 3, 6, 5, 5, 6, 5, 4, 4, 3, 6, 5,~
## $ Op_Q4 <dbl> 3, 0, 1, 6, 6, 3, 3, 0, 6, 3, 4, 5, 4, 5, 6, 6, 2, 2, 4, 5, 5,~
## $ Op_Q5 <dbl> 6, 6, 5, 2, 5, 4, 3, 2, 6, 6, 2, 4, 3, 4, 6, 6, 6, 5, 3, 3, 5,~
## $ Op_Q6 <dbl> 0, 6, 5, 1, 6, 4, 6, 0, 0, 3, 5, 3, 5, 5, 5, 2, 5, 1, 1, 6, 2,~
## $ Op_Q7 <dbl> 0, 6, 5, 5, 5, 4, 6, 2, 1, 3, 2, 4, 5, 5, 6, 3, 6, 5, 2, 6, 5,~
以下計算變項A,B平均數與變異數的R程式碼,總共有多少行?
Pipes are a way to order your code in a more readable format.
Let’s say you have a small data table with 10 participant IDs, two columns with variable type A, and 2 columns with variable type B. You want to calculate the mean of the A variables and the mean of the B variables and return a table with 10 rows (1 for each participant) and 3 columns (id
, A_mean
and B_mean
).
One way you could do this is by creating a new object at every step and using that object in the next step. This is pretty clear, but you’ve created 6 unnecessary data objects in your environment. This can get confusing in very long scripts.
# make a data table with 10 subjects
data_original <- tibble(
id = 1:10,
A1 = rnorm(10, 0),
A2 = rnorm(10, 1),
B1 = rnorm(10, 2),
B2 = rnorm(10, 3)
)
# gather columns A1 to B2 into "variable" and "value" columns
data_gathered <- gather(data_original, variable, value, A1:B2)
# separate the variable column at the _ into "var" and "var_n" columns
data_separated <- separate(data_gathered, variable, c("var", "var_n"), sep = 1)
# group the data by id and var
data_grouped <- group_by(data_separated, id, var)
# calculate the mean value for each id/var
data_summarised <- summarise(data_grouped, mean = mean(value))
## `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# spread the mean column into A and B columns
data_spread <- spread(data_summarised, var, mean)
# rename A and B to A_mean and B_mean
data <- rename(data_spread, A_mean = A, B_mean = B)
data
## # A tibble: 10 x 3
## # Groups: id [10]
## id A_mean B_mean
## <int> <dbl> <dbl>
## 1 1 1.31 2.01
## 2 2 1.14 0.845
## 3 3 0.0665 2.54
## 4 4 0.310 2.48
## 5 5 0.0945 2.35
## 6 6 -0.645 2.41
## 7 7 0.495 2.36
## 8 8 0.441 2.66
## 9 9 0.250 3.12
## 10 10 0.557 3.43
You can name each object data
and keep replacing the old data object with the new one at each step. This will keep your environment clean, but I don’t recommend it because it makes it too easy to accidentally run your code out of order when you are running line-by-line for development or debugging.
One way to avoid extra objects is to nest your functions, literally replacing each data object with the code that generated it in the previous step. This can be fine for very short chains.
其實一個函式的輸出,可做為另一個函式的參數。
mean_petal_width <- round(mean(iris$Petal.Width), 2)
But it gets extremely confusing for long chains:
運用這個特性,我們可以將以上範例的R程式碼改寫成以下示範:
# do not ever do this!!
data <- rename(
spread(
summarise(
group_by(
separate(
gather(
tibble(
id = 1:10,
A1 = rnorm(10, 0),
A2 = rnorm(10, 1),
B1 = rnorm(10, 2),
B2 = rnorm(10,3)),
variable, value, A1:B2),
variable, c("var", "var_n"), sep = 1),
id, var),
mean = mean(value)),
var, mean),
A_mean = A, B_mean = B)
## `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
The pipe lets you “pipe” the result of each function into the next function, allowing you to put your code in a logical order without creating too many extra objects.
但是這種程式寫作模式違反輸入到輸出的關連性,也不容易除錯,管道運算子幫了一些忙。
# calculate mean of A and B variables for each participant
data <- tibble(
id = 1:10,
A1 = rnorm(10, 0),
A2 = rnorm(10, 1),
B1 = rnorm(10, 2),
B2 = rnorm(10,3)
) %>%
gather(variable, value, A1:B2) %>%
separate(variable, c("var", "var_n"), sep=1) %>%
group_by(id, var) %>%
summarise(mean = mean(value)) %>%
spread(var, mean) %>%
rename(A_mean = A, B_mean = B)
## `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
You can read this code from top to bottom as follows:
data
with
variable
and value
column from columns A_1
to B_2
; and thenvariable
into 2 new columns called var
and var_n
, separate at character 1; and thenid
and var
; and thenmean
as the mean of the value
column for each group; and thenvar
and values in mean
; and thenA_mean
(old A
) and B_mean
(old B
)You can make intermediate objects whenever you need to break up your code because it’s getting too complicated or you need to debug something.
You can debug a pipe by highlighting from the beginning to just before the pipe you want to stop at. Try this by highlighting from data <-
to the end of the separate
function and typing cmd-return. What does data
look like now?
Chain all the steps above using pipes.
來看看如何用管道運算子清理資料
personality_sum <- read_csv("https://psyteachr.github.io/msc-data-skills/data/personality.csv") %>%
gather("question", "score", Op1:Ex9) %>%
separate(question, c("domain", "qnumber"), sep = 2) %>%
unite("domain_n", domain, qnumber, sep = "_Q") %>%
spread(domain_n, score)
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## date = col_date(format = "")
## )
## i<U+00A0>Use `spec()` for the full column specifications.
Get data on infant mortality rates from the CSV file infmort.csv
in the directory data
.
infmort <- read_csv("../data/infmort.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Country = col_character(),
## Year = col_double(),
## `Infant mortality rate (probability of dying between birth and age 1 per 1000 live births)` = col_character()
## )
head(infmort)
## # A tibble: 6 x 3
## Country Year `Infant mortality rate (probability of dying between birth a~
## <chr> <dbl> <chr>
## 1 Afghanist~ 2015 66.3 [52.7-83.9]
## 2 Afghanist~ 2014 68.1 [55.7-83.6]
## 3 Afghanist~ 2013 69.9 [58.7-83.5]
## 4 Afghanist~ 2012 71.7 [61.6-83.7]
## 5 Afghanist~ 2011 73.4 [64.4-84.2]
## 6 Afghanist~ 2010 75.1 [66.9-85.1]
infmort
資料集的待清理部分是什麼?
{#data-matmort} Get data on maternal mortality from from the excel file matmort.xls
in the directory data
matmort <- read_xls("../data/matmort.xls")
glimpse(matmort)
## Rows: 181
## Columns: 4
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", "A~
## $ `1990` <chr> "1 340 [ 878 - 1 950]", "71 [ 58 - 88]", "216 [ 141 - 327]",~
## $ `2000` <chr> "1 100 [ 745 - 1 570]", "43 [ 33 - 56]", "170 [ 118 - 241]",~
## $ `2015` <chr> "396 [ 253 - 620]", "29 [ 16 - 46]", "140 [ 82 - 244]", "47~
matmort
資料集的待清理部分是什麼?
matmort
matmort
is in wide format, with a separate column for each year. Change it to long format, with a row for each County/Year observation.
This example is complicated because the column names to gather are numbers. If the column names are non-standard (e.g., have spaces, start with numbers, or have special characters), you can enclose them in backticks (`) like the example below.
轉換方法1
matmort_long <- matmort %>%
pivot_longer(cols = `1990`:`2015`,
names_to = "Year",
values_to = "stats")
glimpse(matmort_long)
## Rows: 543
## Columns: 3
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Albania", "Alban~
## $ Year <chr> "1990", "2000", "2015", "1990", "2000", "2015", "1990", "2000"~
## $ stats <chr> "1 340 [ 878 - 1 950]", "1 100 [ 745 - 1 570]", "396 [ 253 - ~
轉換方法2
matmort_long <- matmort %>%
gather("Year", "stats", `1990`:`2015`)
glimpse(matmort_long)
## Rows: 543
## Columns: 3
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", "A~
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1990"~
## $ stats <chr> "1 340 [ 878 - 1 950]", "71 [ 58 - 88]", "216 [ 141 - 327]",~
The data in the stats
column is in an unusual format with some sort of confidence interval in brackets and lots of extra spaces. We don’t need any of the spaces, so first we’ll remove them with mutate
.
The separate
function will separate your data on anything that is not a number or letter, so try it first without specifying the sep
argument. The into
argument is a list of the new column names.
matmort_split <- matmort_long %>%
mutate(stats = gsub(" ", "", stats)) %>%
separate(stats, c("rate", "ci_low", "ci_hi"))
## Warning: Expected 3 pieces. Additional pieces discarded in 543 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
glimpse(matmort_split)
## Rows: 543
## Columns: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", "A~
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1990"~
## $ rate <chr> "1340", "71", "216", "1160", "72", "58", "8", "8", "64", "46",~
## $ ci_low <chr> "878", "58", "141", "627", "64", "51", "7", "7", "56", "34", "~
## $ ci_hi <chr> "1950", "88", "327", "2020", "80", "65", "9", "10", "74", "61"~
留意輸出的警告訊息!
The gsub(pattern, replacement, x)
function is a flexible way to do search and replace. The example above replaces all occurances of the pattern
" " (a space), with the replacement
"" (nothing), in the string x
(the stats
column). Use sub()
instead if you only want to replace the first occurance of a pattern. We only used a simple pattern here, but you can use more complicated regex patterns to replace, for example, all even numbers (e.g., gsub(“[:02468:]”, "“,”id = 123456“)
) or all occurances of the word colour in US or UK spelling (e.g., gsub(”colo(u)?r“,”**“,”replace color, colour, or colours, but not collors")
).
extra
處理分割數值
The previous example should have given you an error warning about “Too many values at 543 locations”. This is because separate
splits the column at the brackets and dashes, so the text 100[90-110]
would split into four values c(“100”, “90”, “110”, "“)
, but we only specified 3 new columns. The fourth value is always empty (just the part after the last bracket), so we are happy to drop it, but separate
generates a warning so you don’t do that accidentally. You can turn off the warning by adding the extra
argument and setting it to “drop”. Look at the help for ??tidyr::separate
to see what the other options do.
matmort_split <- matmort_long %>%
mutate(stats = gsub(" ", "", stats)) %>%
separate(stats, c("rate", "ci_low", "ci_hi"), extra = "drop")
glimpse(matmort_split)
## Rows: 543
## Columns: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", "A~
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1990"~
## $ rate <chr> "1340", "71", "216", "1160", "72", "58", "8", "8", "64", "46",~
## $ ci_low <chr> "878", "58", "141", "627", "64", "51", "7", "7", "56", "34", "~
## $ ci_hi <chr> "1950", "88", "327", "2020", "80", "65", "9", "10", "74", "61"~
infmort
Now do the same with infmort
. It’s already in long format, so you don’t need to use gather
, but the third column has a crazy long name, so we can just refer to it by its column number (3).
infmort_split <- infmort %>%
separate(3, c("rate", "ci_low", "ci_hi"), extra = "drop")
glimpse(infmort_split)
## Rows: 5,044
## Columns: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "A~
## $ Year <dbl> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 20~
## $ rate <chr> "66", "68", "69", "71", "73", "75", "76", "78", "80", "82", "8~
## $ ci_low <chr> "3", "1", "9", "7", "4", "1", "8", "6", "4", "3", "4", "7", "0~
## $ ci_hi <chr> "52", "55", "58", "61", "64", "66", "69", "71", "73", "75", "7~
留意輸出的資料內容!
Wait, that didn’t work at all! It split the column on spaces, brackets, and full stops. We just want to split on the spaces, brackets and dashes. So we need to manually set sep
to what the delimiters are. Also, once there are more than a few arguments specified for a function, it’s easier to read them if you put one argument on each line.
You can use regular expressions to separate complex columns. Here, we want to separate on dashes and brackets. You can separate on a list of delimiters by putting them in parentheses, separated by “|”. It’s a little more complicated because brackets have a special meaning in regex, so you need to “escape” the left one with two backslashes “\”.
sep
設定分隔字符再做一次清理infmort
infmort_split <- infmort %>%
separate(
col = 3,
into = c("rate", "ci_low", "ci_hi"),
extra = "drop",
sep = "(\\[|-|])"
)
glimpse(infmort_split)
## Rows: 5,044
## Columns: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "A~
## $ Year <dbl> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 20~
## $ rate <chr> "66.3 ", "68.1 ", "69.9 ", "71.7 ", "73.4 ", "75.1 ", "76.8 ",~
## $ ci_low <chr> "52.7", "55.7", "58.7", "61.6", "64.4", "66.9", "69.0", "71.2"~
## $ ci_hi <chr> "83.9", "83.6", "83.5", "83.7", "84.2", "85.1", "86.1", "87.3"~
留意輸出資料內容與型態
convert
重設變項資料型態That’s better. Notice the Year
, rate
, ci_low
and ci_hi
. That means these columns hold characters (like words), not numbers or integers. This can cause problems when you try to do thigs like average the numbers (you can’t average words), so we can fix it by adding the argument convert
and setting it to TRUE
.
再做一次清理infmort
infmort_split <- infmort %>%
separate(col = 3,
into = c("rate", "ci_low", "ci_hi"),
extra = "drop",
sep = "(\\[|-|])",
convert = TRUE)
glimpse(infmort_split)
## Rows: 5,044
## Columns: 5
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "A~
## $ Year <dbl> 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 20~
## $ rate <dbl> 66.3, 68.1, 69.9, 71.7, 73.4, 75.1, 76.8, 78.6, 80.4, 82.3, 84~
## $ ci_low <dbl> 52.7, 55.7, 58.7, 61.6, 64.4, 66.9, 69.0, 71.2, 73.4, 75.5, 77~
## $ ci_hi <dbl> 83.9, 83.6, 83.5, 83.7, 84.2, 85.1, 86.1, 87.3, 88.9, 90.7, 92~
留意輸出資料內容與型態
Do the same for matmort
.
再做一次清理matmort
matmort_split <- matmort_long %>%
mutate(stats = gsub(" ", "", stats)) %>%
separate(col = stats,
into = c("rate", "ci_low", "ci_hi"),
extra = "drop",
convert = TRUE)
glimpse(matmort_split)
## Rows: 543
## Columns: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", "A~
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1990"~
## $ rate <int> 1340, 71, 216, 1160, 72, 58, 8, 8, 64, 46, 26, 569, 58, 33, 9,~
## $ ci_low <int> 878, 58, 141, 627, 64, 51, 7, 7, 56, 34, 20, 446, 47, 28, 7, 4~
## $ ci_hi <int> 1950, 88, 327, 2020, 80, 65, 9, 10, 74, 61, 33, 715, 72, 38, 1~
留意輸出資料內容與型態
We can chain all the steps above together, since we don’t need those intermediate dataframes.
matmort2<- dataskills::matmort %>%
gather("Year", "stats", `1990`:`2015`) %>%
mutate(stats = gsub(" ", "", stats)) %>%
separate(
col = stats,
into = c("rate", "ci_low", "ci_hi"),
extra = "drop",
convert = TRUE
) %>%
glimpse()
## Rows: 543
## Columns: 5
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", "A~
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", "1990", "1990"~
## $ rate <int> 1340, 71, 216, 1160, 72, 58, 8, 8, 64, 46, 26, 569, 58, 33, 9,~
## $ ci_low <int> 878, 58, 141, 627, 64, 51, 7, 7, 56, 34, 20, 446, 47, 28, 7, 4~
## $ ci_hi <int> 1950, 88, 327, 2020, 80, 65, 9, 10, 74, 61, 33, 715, 72, 38, 1~
Spread out the infant mortality rate by year.
matmort_wide <- matmort2 %>%
spread(key = Year, value = rate) %>%
print()
## # A tibble: 542 x 6
## Country ci_low ci_hi `1990` `2000` `2015`
## <chr> <int> <int> <int> <int> <int>
## 1 Afghanistan 253 620 NA NA 396
## 2 Afghanistan 745 1570 NA 1100 NA
## 3 Afghanistan 878 1950 1340 NA NA
## 4 Albania 16 46 NA NA 29
## 5 Albania 33 56 NA 43 NA
## 6 Albania 58 88 71 NA NA
## 7 Algeria 82 244 NA NA 140
## 8 Algeria 118 241 NA 170 NA
## 9 Algeria 141 327 216 NA NA
## 10 Angola 221 988 NA NA 477
## # ... with 532 more rows
留意新表單有太多遺漏值(NA)
Nope, that didn’t work at all, but it’s a really common mistake when spreading data. This is because spread
matches on all the remaining columns, so Afghanistan with ci_low
of 52.7 is treated as a different observation than Afghanistan with ci_low
of 55.7. We can fix this by merging the rate
, ci_low
and ci_hi
columns back together.
This is where pivot_wider()
can be very useful. You can set values_from
to multiple column names and their names will be added to the names_from
values.
matmort_wide <- matmort2 %>%
pivot_wider(
names_from = Year,
values_from = c(rate, ci_low, ci_hi)
)
glimpse(matmort_wide)
## Rows: 181
## Columns: 10
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina"~
## $ rate_1990 <int> 1340, 71, 216, 1160, 72, 58, 8, 8, 64, 46, 26, 569, 58, 33~
## $ rate_2000 <int> 1100, 43, 170, 924, 60, 40, 9, 5, 48, 61, 21, 399, 48, 26,~
## $ rate_2015 <int> 396, 29, 140, 477, 52, 25, 6, 4, 25, 80, 15, 176, 27, 4, 7~
## $ ci_low_1990 <int> 878, 58, 141, 627, 64, 51, 7, 7, 56, 34, 20, 446, 47, 28, ~
## $ ci_low_2000 <int> 745, 33, 118, 472, 54, 35, 8, 4, 42, 50, 18, 322, 38, 22, ~
## $ ci_low_2015 <int> 253, 16, 82, 221, 44, 21, 5, 3, 17, 53, 12, 125, 19, 3, 5,~
## $ ci_hi_1990 <int> 1950, 88, 327, 2020, 80, 65, 9, 10, 74, 61, 33, 715, 72, 3~
## $ ci_hi_2000 <int> 1570, 56, 241, 1730, 65, 46, 10, 6, 55, 74, 26, 496, 58, 3~
## $ ci_hi_2015 <int> 620, 46, 244, 988, 63, 31, 7, 5, 35, 124, 19, 280, 37, 6, ~
慎選使用寬表格的時機與方法!
Students in the Institute of Neuroscience and Psychology at the University of Glasgow can use the online experiment builder platform, Experimentum. The platform is also open source on github for anyone who can install it on a web server. It allows you to group questionnaires and experiments into projects with randomisation and counterbalancing. Data for questionnaires and experiments are downloadable in long format, but researchers often need to put them in wide format for analysis.
Look at the help menu for built-in dataset dataskills::experimentum_quests to learn what each column is. Subjects are asked questions about dogs to test the different questionnaire response types.
names(dataskills::experimentum_quests)
## [1] "session_id" "project_id" "quest_id" "user_id" "user_sex"
## [6] "user_status" "user_age" "q_name" "q_id" "order"
## [11] "dv" "starttime" "endtime"
dataskills::experimentum_quests$q_name %>%
unique()
## [1] "current" "past" "name" "good" "country"
## [6] "text" "good_borzoi" "time"
dataskills::experimentum_quests$dv %>% head(8)
## [1] "1"
## [2] "1"
## [3] "Ruby"
## [4] "7"
## [5] "BG"
## [6] "Double syllable names are best for dogs. One syllable like \"max\" does not contain enough notage/sound to engage the proper response. I do not know where borzois orginate... I assume eastern europe. Dogs can tell our emotions by tone and body language just like we can tell what type of bark they are barking."
## [7] "32"
## [8] "1"
To get the dataset into wide format, where each question is in a separate column, use the following code:
q <- dataskills::experimentum_quests %>%
pivot_wider(id_cols = session_id:user_age,
names_from = q_name,
values_from = dv) %>%
type.convert(as.is = TRUE) %>%
print()
## # A tibble: 24 x 15
## session_id project_id quest_id user_id user_sex user_status user_age current
## <int> <int> <int> <int> <chr> <chr> <dbl> <int>
## 1 34034 1 1 31105 female guest 28.2 1
## 2 34104 1 1 31164 male registered 19.4 1
## 3 34326 1 1 31392 female guest 17 0
## 4 34343 1 1 31397 male guest 22 1
## 5 34765 1 1 31770 female guest 44 1
## 6 34796 1 1 31796 female guest 35.9 0
## 7 34806 1 1 31798 female guest 35 0
## 8 34822 1 1 31802 female guest 58 1
## 9 34864 1 1 31820 male guest 20 0
## 10 35014 1 1 31921 female student 39.2 1
## # ... with 14 more rows, and 7 more variables: past <int>, name <chr>,
## # good <int>, country <chr>, text <chr>, good_borzoi <int>, time <chr>
The responses in the dv column have multiple types (e.g., r glossary(“integer”)
, r glossary(“double”)
, and r glossary(“character”)
), but they are all represented as character strings when they’re in the same column. After you spread the data to wide format, each column should be given the ocrrect data type. The function type.convert()
makes a best guess at what type each new column should be and converts it. The argument as.is = TRUE
converts columns where none of the numbers have decimal places to integers.