真實的科學研究資料不只存在於一份表單，合併表單也是清理整頓的程序之一。

學習目標

結合式合併:依指定欄位結合所有變數
刪除式合併:依指定欄位保留左表單的觀察值
強制合併
依集合規律合併
推薦秘訣表：Data Import, Data Transformation, Regular Expressions

示範資料

首先製造兩份資料表單，subject存放參與者基本資料，exp存放參與者反應資料，共有變數欄位是id。

First, we’ll create two small data tables.

subject has id, sex and age for subjects 1-5. Age and sex are missing for subject 3.

subject <- tibble(
  id = seq(1,5),
  sex = c("m", "m", NA, "f", "f"),
  age = c(19, 22, NA, 19, 18)
)

id	sex	age
1	m	19
2	m	22
3	NA	NA
4	f	19
5	f	18

exp has subject id and the score from an experiment. Some subjects are missing, some completed twice, and some are not in the subject table.

exp <- tibble(
  id = c(2, 3, 4, 4, 5, 5, 6, 6, 7),
  score = c(10, 18, 21, 23, 9, 11, 11, 12, 3)
)

id	score
2	10
3	18
4	21
4	23
5	9
5	11
6	11
6	12
7	3

結合式合併(Mutating Joins)

4種依指定欄位含併資料表單的函式，都會有的參數如下

All the mutating joins have this basic syntax:

使用這種函式合併的表單，必須是長表單。

****_join(x, y, by = NULL, suffix = c(".x", ".y"))

x = the first (left) table
y = the second (right) table
{#join-by} by = what columns to match on. If you leave this blank, it will match on all columns with the same names in the two tables.
{#join-suffix} suffix = if columns have the same name in the two tables, but you aren’t joining by them, they get a suffix to make them unambiguous. This defaults to “.x” and “.y”, but you can change it to something more meaningful.

You can leave out the by argument if you’re matching on all of the columns with the same name, but it’s good practice to always specify it so your code is robust to changes in the loaded data.

朝左表單合併

left_join()

Left Join

A left_join keeps all the data from the first (left) table and joins anything that matches from the second (right) table. If the right table has more than one match for a row in the right table, there will be more than one row in the joined table (see ids 4 and 5).

left_join(subject, exp, by = "id")

## # A tibble: 7 x 4
##      id sex     age score
##   <dbl> <chr> <dbl> <dbl>
## 1     1 m        19    NA
## 2     2 m        22    10
## 3     3 <NA>     NA    18
## 4     4 f        19    21
## 5     4 f        19    23
## 6     5 f        18     9
## 7     5 f        18    11

Left Join (reversed)

The order of tables is swapped here, so the result is all rows from the exp table joined to any matching rows from the subject table.

left_join(exp, subject, by = "id")

## # A tibble: 9 x 4
##      id score sex     age
##   <dbl> <dbl> <chr> <dbl>
## 1     2    10 m        22
## 2     3    18 <NA>     NA
## 3     4    21 f        19
## 4     4    23 f        19
## 5     5     9 f        18
## 6     5    11 f        18
## 7     6    11 <NA>     NA
## 8     6    12 <NA>     NA
## 9     7     3 <NA>     NA

朝右表單合併

right_join()

Right Join

A right_join keeps all the data from the second (right) table and joins anything that matches from the first (left) table.

right_join(subject, exp, by = "id")

## # A tibble: 9 x 4
##      id sex     age score
##   <dbl> <chr> <dbl> <dbl>
## 1     2 m        22    10
## 2     3 <NA>     NA    18
## 3     4 f        19    21
## 4     4 f        19    23
## 5     5 f        18     9
## 6     5 f        18    11
## 7     6 <NA>     NA    11
## 8     6 <NA>     NA    12
## 9     7 <NA>     NA     3

This table has the same information as left_join(exp, subject, by = “id”), but the columns are in a different order (left table, then right table).

變數內容交集合併

inner_join()

Inner Join

An inner_join returns all the rows that have a match in the other table.

inner_join(subject, exp, by = "id")

## # A tibble: 6 x 4
##      id sex     age score
##   <dbl> <chr> <dbl> <dbl>
## 1     2 m        22    10
## 2     3 <NA>     NA    18
## 3     4 f        19    21
## 4     4 f        19    23
## 5     5 f        18     9
## 6     5 f        18    11

變數內容聯集合併

full_join()

Full Join

A full_join lets you join up rows in two tables while keeping all of the information from both tables. If a row doesn’t have a match in the other table, the other table’s column values are set to NA.

full_join(subject, exp, by = "id")

## # A tibble: 10 x 4
##       id sex     age score
##    <dbl> <chr> <dbl> <dbl>
##  1     1 m        19    NA
##  2     2 m        22    10
##  3     3 <NA>     NA    18
##  4     4 f        19    21
##  5     4 f        19    23
##  6     5 f        18     9
##  7     5 f        18    11
##  8     6 <NA>     NA    11
##  9     6 <NA>     NA    12
## 10     7 <NA>     NA     3

刪除式合併(Filtering Joins)

依指定變數匹配內容保留左表單資料

semi_join()

Semi Join

A semi_join returns all rows from the left table where there are matching values in the right table, keeping just columns from the left table.

semi_join(subject, exp, by = "id")

## # A tibble: 4 x 3
##      id sex     age
##   <int> <chr> <dbl>
## 1     2 m        22
## 2     3 <NA>     NA
## 3     4 f        19
## 4     5 f        18

subject$id %in% exp$id

## [1] FALSE  TRUE  TRUE  TRUE  TRUE

Unlike an inner join, a semi join will never duplicate the rows in the left table if there is more than one maching row in the right table.

Semi Join (Reversed)

Order matters in a semi join.

semi_join(exp, subject, by = "id")

## # A tibble: 6 x 2
##      id score
##   <dbl> <dbl>
## 1     2    10
## 2     3    18
## 3     4    21
## 4     4    23
## 5     5     9
## 6     5    11

exp$id %in% subject$id

## [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

依指定變數不匹配內容保留左表單資料

anti_join()

Anti Join

A anti_join return all rows from the left table where there are not matching values in the right table, keeping just columns from the left table.

anti_join(subject, exp, by = "id")

## # A tibble: 1 x 3
##      id sex     age
##   <int> <chr> <dbl>
## 1     1 m        19

!(subject$id %in% exp$id)

## [1]  TRUE FALSE FALSE FALSE FALSE

Anti Join (Reversed)

Order matters in an anti join.

anti_join(exp, subject, by = "id")

## # A tibble: 3 x 2
##      id score
##   <dbl> <dbl>
## 1     6    11
## 2     6    12
## 3     7     3

!(exp$id %in% subject$id)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

強制合併(Binding Joins)

循列合併

bind_rows()

You can combine the rows of two tables with bind_rows.

Here we’ll add subject data for subjects 6-9 and bind that to the original subject table.

new_subjects <- tibble(
  id = seq(6, 9),
  sex = c("m", "m", "f", "f"),
  age = c(19, 16, 20, 19)
)

bind_rows(subject, new_subjects)

## # A tibble: 9 x 3
##      id sex     age
##   <int> <chr> <dbl>
## 1     1 m        19
## 2     2 m        22
## 3     3 <NA>     NA
## 4     4 f        19
## 5     5 f        18
## 6     6 m        19
## 7     7 m        16
## 8     8 f        20
## 9     9 f        19

The columns just have to have the same names, they don’t have to be in the same order. Any columns that differ between the two tables will just have NA values for entries from the other table.

If a row is duplicated between the two tables (like id 5 below), the row will also be duplicated in the resulting table. If your tables have the exact same columns, you can use union() (see below) to avoid duplicates.

new_subjects <- tibble(
  id = seq(5, 9),
  age = c(18, 19, 16, 20, 19),
  sex = c("f", "m", "m", "f", "f"),
  new = c(1,2,3,4,5)
)

bind_rows(subject, new_subjects)

## # A tibble: 10 x 4
##       id sex     age   new
##    <int> <chr> <dbl> <dbl>
##  1     1 m        19    NA
##  2     2 m        22    NA
##  3     3 <NA>     NA    NA
##  4     4 f        19    NA
##  5     5 f        18    NA
##  6     5 f        18     1
##  7     6 m        19     2
##  8     7 m        16     3
##  9     8 f        20     4
## 10     9 f        19     5

循欄合併

bind_cols()

You can merge two tables with the same number of rows using bind_cols. This is only useful if the two tables have their rows in the exact same order. The only advantage over a left join is when the tables don’t have any IDs to join by and you have to rely solely on their order.

new_info <- tibble(
  colour = c("red", "orange", "yellow", "green", "blue")
)

bind_cols(subject, new_info)

## # A tibble: 5 x 4
##      id sex     age colour
##   <int> <chr> <dbl> <chr> 
## 1     1 m        19 red   
## 2     2 m        22 orange
## 3     3 <NA>     NA yellow
## 4     4 f        19 green 
## 5     5 f        18 blue

依集合規律合併(Set Operations)

限用在合併欄位變數名稱，以及觀察值資料型態一致的兩份表單。

intersect()

合併保留觀察資料一致的欄位。

intersect() returns all rows in two tables that match exactly. The columns don’t have to be in the same order.

new_subjects <- tibble(
  id = seq(4, 9),
  age = c(19, 18, 19, 16, 20, 19),
  sex = c("f", "f", "m", "m", "f", "f")
)

dplyr::intersect(subject, new_subjects)

## # A tibble: 2 x 3
##      id sex     age
##   <int> <chr> <dbl>
## 1     4 f        19
## 2     5 f        18

id	age	sex
4	19	f
5	18	f
6	19	m
7	16	m
8	20	f
9	19	f

union()

合併保留所有觀察資料的欄位。

union() returns all the rows from both tables, removing duplicate rows.

dplyr::union(subject, new_subjects)

## # A tibble: 9 x 3
##      id sex     age
##   <int> <chr> <dbl>
## 1     1 m        19
## 2     2 m        22
## 3     3 <NA>     NA
## 4     4 f        19
## 5     5 f        18
## 6     6 m        19
## 7     7 m        16
## 8     8 f        20
## 9     9 f        19

setdiff()

合併保留只存在左表單的欄位。

setdiff returns rows that are in the first table, but not in the second table.

setdiff(subject, new_subjects)

## # A tibble: 3 x 3
##      id sex     age
##   <int> <chr> <dbl>
## 1     1 m        19
## 2     2 m        22
## 3     3 <NA>     NA

Order matters for setdiff.

setdiff(new_subjects, subject)

## # A tibble: 4 x 3
##      id   age sex  
##   <int> <dbl> <chr>
## 1     6    19 m    
## 2     7    16 m    
## 3     8    20 f    
## 4     9    19 f

結合資料表單

陳紹慶

2021/4/29