Hi. In this post, I demonstrate the use of the dplyr package in R to perform some SQL like functions. The math field of set theory is used here as it helps with understanding.
Setting Up The Data
To start, load in the dplyr package with the code library(dplyr). Next, I create some sample "fake" data in R.
> colours1 <- c("Red", "Green", "Blue", "Yellow", "Black", "Maroon", "Purple", "Orange")
>
> counts1 <- c(17, 10, 8, 12, 20, 15, 8, 11)
> # SQL & Set Theory Like Functions In R With dplyr
> # Ref: https://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters
>
> # Sample Data Of Two Classrooms:
>
> library(dplyr)
>
> colours1 <- c("Red", "Green", "Blue", "Yellow", "Black", "Maroon", "Purple", "Orange")
>
> counts1 <- c(17, 10, 8, 12, 20, 15, 8, 11)
>
> # Create table:
> class1 <- data.frame("Colour" = colours1, "Count" = counts1, stringsAsFactors = FALSE)
>
> # Show table:
> class1
Colour Count
1 Red 17
2 Green 10
3 Blue 8
4 Yellow 12
5 Black 20
6 Maroon 15
7 Purple 8
8 Orange 11
>
> colours2 <- c("Red", "Green", "Blue", "Grey", "Black", "Coral", "Yellow", "Orange")
>
> counts2 <- c(20, 9, 18, 17, 10, 5, 9, 21)
>
> # Create table:
> class2 <- data.frame("Colour" = colours2, "Count" = counts2, stringsAsFactors = FALSE)
>
> # Show table:
> class2
Colour Count
1 Red 20
2 Green 9
3 Blue 18
4 Grey 17
5 Black 10
6 Coral 5
7 Yellow 9
8 Orange 21
dplyr's Join Functions
Left Join
The left join shows rows from the left side data frame. In this case, it is from class1.
Right Join
With the right join, rows appear from the right side. The rows are from class2.
Inner Join
The inner join contains rows where there are no NA/missing values from each column.
Full Join
The full join contains all the rows and data from both class1 and class2.
For those familiar with Venn diagrams and set theory, here is a nice summary visual. Visuals such as these help with memorization.
This website link: http://r4ds.had.co.nz/relational-data.html contains a nice comparison chart between R's dplyr package and SQL queries. [Screenshot image below from Snipping tool.]
Semi Joins & Full Joins
The dplyr package also have semi joins and full joins. These are somewhat similar to left joins but not quite.
Semi Join
Anti Join
Set Operations In dplyr
For this section, I create two new sample tables in R.
The union, intersect and setdiff() functions are illustrated in the image below.
References & Resources
- R For Data Science Book [Online Version] - Chapter 13: http://r4ds.had.co.nz/relational-data.html (For Some Of The Images)
- R's dplyr Cheatsheet