Blog

dplyr : R Package  - Comprehensive list of Functions available in dplyr along with Examples

dplyr is a grammar of Data Manipulation , providing a consistent set of 'Verbs' / 'Functions'  that help you solve any Data Manipulation requirement .

dplyr Package has been designed with three main goals:

    Identify the most important data manipulation tools needed for data analysis and make them easy to use from R.

    Provide blazing fast performance for in-memory data by writing key pieces in C++.

    Use the same interface to work with data no matter where it's stored, whether in a data frame, a data table or database.

dplyr Functions along with Examples

Note - All examples below are using the 'iris' dataset available in 'datasets' package . To load this dataset run data("iris")

1) glimpse()
Purpose : This function of dplyr is used to see the Structure of a DataFrame / Tibble 
 It is very similar to str() function , but , this function adapts the view based on the screen size.

Example : 
> glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5....
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3....
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1....
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0....
$ Species      <fct> setosa, setosa, setosa, set...


2) select () 
Purpose :  This function is used for selecting a specific set of columns from Dataset .
  There are also additional special Functions which only work with select()

Example 1 : to Select only specific 2 columns from Dataset ( either by Column Name or by Column Number ) 
> select(iris , Sepal.Width , Species)
OR
> select(iris , 2 , 5)
    Sepal.Width    Species
1            3.5     setosa
2            3.0     setosa
3            3.2     setosa
4            3.1     setosa
5            3.6     setosa
6            3.9     setosa

Example 2 : To exclude certain column(s) from the selection
> select(iris , -Sepal.Width , -Species)
OR
> select(iris , -c(Sepal.Width , Species))
OR 
> select(iris , -c(2 , 5))
  Sepal.Length Petal.Length Petal.Width
1          5.1          1.4         0.2
2          4.9          1.4         0.2
3          4.7          1.3         0.2
4          4.6          1.5         0.2
5          5.0          1.4         0.2
6          5.4          1.7         0.4

Example 3 : starts_with() is one of 'Select Helpers' available in dplyr to be used with select()
    starts_with() allows to select only those columns where the column name has a prefix mentioned as argument .
select(iris , starts_with('sepal',ignore.case = T))
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9

Example 4 : ends_with() is one of 'Select Helpers' available in dplyr to be used with select()
    ends_with() allows to select only those columns where the column name has a suffix mentioned as argument .
select(iris , ends_with('width',ignore.case = T))
  Sepal.Width Petal.Width
1         3.5         0.2
2         3.0         0.2
3         3.2         0.2
4         3.1         0.2
5         3.6         0.2
6         3.9         0.4

Example 5 : contains() is one of 'Select Helpers' available in dplyr to be used with select()
    contains() allows to select only those columns where the column name has a substring mentioned as argument .
select(iris , contains('Len',ignore.case = T))
  Sepal.Length Petal.Length
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
4          4.6          1.5
5          5.0          1.4
6          5.4          1.7

Example 6 : everything() is one of 'Select Helpers' available in dplyr to be used with select()
    everything() allows to select all the columns available in the dataset
select(iris , everything())

Example 7 : one_of() is one of 'Select Helpers' available in dplyr to be used with select()
    one_of() allows to select all the columns whose names are mentioned as values in a character Vector
select(iris , one_of(c('Sepal.Length','Species')))
  Sepal.Length Species
1          5.1  setosa
2          4.9  setosa
3          4.7  setosa
4          4.6  setosa
5          5.0  setosa
6          5.4  setosa

3) select_if () 
Purpose : This function allows us to only select specific columns by passing a function as argument which returns Logical Values.

Example : Select only those Columns from iris dataset where the datatype of column is 'factor'
select_if(iris , is.factor)
  Species
1  setosa
2  setosa
3  setosa
4  setosa
5  setosa
6  setosa

4) filter() 
Purpose : This function helps in only selecting those rows from a Dataset which satisfies certain condition(s).

Example 1: Only show those rows from iris dataset where , the value in species column is 'setosa'
filter(iris ,  Species == 'setosa')

Example 2: using between() function with filter
> iris %>% filter(between(Sepal.Length ,5.0, 5.8 ))

Example 3: using near() function along with tolerance parameter with filter
below example will show all observations having value near to 5.0 with tolerance of 0.5
iris %>% filter(near(Sepal.Length ,5.0, tol = 0.5 ))

5) arrange() 
Purpose : This dplyr function is used to reorder (sort ) dataset based on one or more columns

Example 1 : Sort in ascending order of 1 column 
> iris %>% arrange(Sepal.Length)

Example 2 : Sort in descending order of 1 column 
iris %>% arrange(desc(Sepal.Length))
OR
iris %>% arrange(-Sepal.Length)

6) mutate() 
Purpose : this dplyr function is to create a new column in the dataset 

Example :
> iris %>% mutate(Sepal.Length.Width = Sepal.Length + Sepal.Width) 

7) summarize() 
Purpose : this dplyr function is to generate a Summary of column(s)

Example :
> iris %>% summarise(mean(Sepal.Length) , n()) 

8) add_rownames()
Purpose : this dplyr function is used to convert rownames of a dataset to an explicit column

Example :
iris %>% add_rownames() %>% head()

9) inner_join()
Purpose : this function takes 2 datasets as input & returns all the rows from the first dataset having a matching value in second dataset.
                        The default matching is done based on the columns having the same column names in both tables.
The common column name can be explicitly mentioned by using 'by='

Example : inner_join(x, y): matching x + y
inner_join(iris_Petal , iris_Sepal , by = c( 'rowname' , 'Species'))

10) left_join()
Purpose : left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y.

Example : 
left_join(iris_Petal , iris_Sepal , by = c( 'rowname' , 'Species'))

11) anti_join()
Purpose : Return all rows from x where there are no matching values in y, keeping just columns from x. This is a filtering join.

Example : 
anti_join(iris_Petal , iris_Sepal , by = c( 'rowname' , 'Species'))

12) full_join()
Purpose : Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Example : 
full_join(iris_Petal , iris_Sepal , by = c( 'rowname' , 'Species'))

12) full_join()
Purpose : Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Example : 
full_join(iris_Petal , iris_Sepal , by = c( 'rowname' , 'Species'))

13) intersect()
Purpose : Intersect function takes the rows that appear in both the datasets and create new dataframe.
                         intersect() will remove duplicate records.

Example : 
> iris_1_10 <- iris[1:10 , ]
> iris_5_15 <- iris[5:15 , ]
> intersect(iris_1_10 , iris_5_15)

14) union()
Purpose : This function combines all rows from both the tables and removes duplicate records from the combined dataset

Example : 
> iris_1_10 <- iris[1:10 , ]
> iris_5_15 <- iris[5:15 , ]
> union(iris_1_10 , iris_5_15)

15) union_all()
Purpose : This function combines all rows from both the tables without removing the duplicate records from the combined dataset.

Example : 
> iris_1_10 <- iris[1:10 , ]
> iris_5_15 <- iris[5:15 , ]
> union_all(iris_1_10 , iris_5_15)

16) setdiff()
Purpose : takes the rows that appear in one tables but not in other
                        i.e. To get the row present in one table which is not in other table we will be using setdiff() function in Dplyr package .

Example : 
> iris_1_10 <- iris[1:10 , ]
> iris_5_15 <- iris[5:15 , ]
> setdiff(iris_1_10 , iris_5_15)

17) distinct()
Purpose : Use distinct() to find unique values in a data table:.

Example 1:  Find only unique / distinct complete rows from the iris dataset 
> distinct(iris)

Example 2:  Find only unique / distinct values in the 'Species' column of the 'iris' dataset 
> distinct(iris , Species)
     Species
1     setosa
2 versicolor
3  virginica

Example 3:  Find only unique / distinct values in the 'Species' column of the 'iris' dataset & show all the corresponding values in the row. 
distinct(iris , Species , .keep_all = T)

18) sample_n()
Purpose : Use sample_n() to extract n random rows from the dataset 

Example 1:  extract 6 random rows from the iris datset
> sample_n(iris,6)

Example 2:  extract 6 random rows from the iris datset with possibility of same rows to be selected more than once.
> sample_n(iris,6 , replace = T)

19) sample_frac()
Purpose : Use sample_frac() to extract a certain percentage of rows from the dataset  ( 0.10 is 10% ..)

Example 1:  extract 10% random rows from the iris datset
> sample_frac(iris,.10)

20) group_by()
Purpose : takes an existing tbl and converts it into a grouped dataset where operations are performed "by group". 
ungroup() removes grouping.

Example 1:  creating a grouped dataset from 'iris' based on 'Species'
> iris_Species <- iris %>% group_by(Species)

Example 2:  to check the number of subgroups in a grouped table
> n_groups(iris_Species) 
[1] 3

Example 3:  to check the number of rows in each subgroups in a grouped table
> group_size(iris_Species) 
[1] 50 50 50

Example 4:  to summarize based on the group_by 
> iris_Species %>% summarize(Mean_Sepal_Ln = mean(Sepal.Length),
                                Mean_Petal_Ln = mean(Petal.Length))

  Species    Mean_Sepal_Ln Mean_Petal_Ln
  <fct>              <dbl>         <dbl>
1 setosa                5.01          1.46
2 versicolor          5.94          4.26
3 virginica            6.59          5.55

21) top_n()
Purpose : to select top n or bottom n rows of a dataset 

Example 1 : to select top n rows from  a dataset 
> top_n(iris,10)

Example 2 : to select bottom n rows from  a dataset 
> top_n(iris,-10)

Happy Learning !!
Priyaranjan Mohanty
@AUTHOR : Admin

Tags:Eco, Water, Air, Environment

Comments (0)

    No Comments Found
Leave a Comment