How to Clean Data in R

Cleaning data is one of the most essential parts in data analysis. In this article, we learn how to clean the variable names, how to remove empty rows and columns, and how to remove duplicate rows.

Data cleaning is the process of converting messy data into reliable data that can be analyzed in R. Data cleaning improves data quality and your productivity in R. In this article, you will learn how to do the following important parts of clearing a messy R data set.

Format ugly data frame column names in R
Delete all blank rows in R
Remove duplicate rows in R

Check Out: How to Import Data into R

Before we start, we need to specify the working directory from which we can import the data.

setwd("D:/DataScience")

First of all, we need to have data that needs to be cleaned. Therefore, we use the portion of iris data set as an example and we change some parts to illustrate how to clean a messy data set. For example, we have changed variables names and have created an empty row. Also, we have duplicated last row of the data. Using the read.csv() function the data are imported in R console.

data <- read.csv("iris.csv")
data
##   Sepal.LENGTH SEPAL.Width PETAL.LENGTH petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6           NA          NA           NA          NA    <NA>
## 7          5.4         3.9          1.7         0.4  setosa
## 8          5.4         3.9          1.7         0.4  setosa

Also Check: What are Data Structures in R?

1. How to Format Variable Names of Data in R

In this part, clean_names() function will be used available in janitor R package (Firke, 2021) to clean column names.

library(janitor)
data2<-clean_names(data)
data2
##   sepal_length sepal_width petal_length petal_width species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6           NA          NA           NA          NA    <NA>
## 7          5.4         3.9          1.7         0.4  setosa
## 8          5.4         3.9          1.7         0.4  setosa

2. How to Remove Empty Rows and Columns of Data in R

Suppose if you want to remove rows and/or columns of if contain completely empty, then you can use remove_empty() function available in janitor R package (Firke, 2021).

library(janitor)
data3<- remove_empty(data2, which = c("rows","cols"), quiet = FALSE)
## Removing 1 empty rows of 8 rows total (12.5%).
## No empty columns to remove.

data3
##   sepal_length sepal_width petal_length petal_width species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 7          5.4         3.9          1.7         0.4  setosa
## 8          5.4         3.9          1.7         0.4  setosa

Also Check: How to Export Data from R

3. How to Remove Duplicate Rows of Data in R

In this part, we will use distinct() function available in dplyr R package (Wickham et al., 2020) to remove the duplicate rows.

library(dplyr)
data_cleaned <- distinct(data3)
data_cleaned
##   sepal_length sepal_width petal_length petal_width species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

If we want to remove duplicate rows with respect to a specific variable, we can use distinct() function again. For example, we remove duplicate rows with respect to petal_length.

data_cleaned2 <- distinct(data_cleaned, petal_length, .keep_all = TRUE)
data_cleaned2 
##   sepal_length sepal_width petal_length petal_width species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.7         3.2          1.3         0.2  setosa
## 3          4.6         3.1          1.5         0.2  setosa
## 4          5.4         3.9          1.7         0.4  setosa

The application of the codes is available in our youtube channel below.

How to Clean Data in R Using RStudio

Don’t forget to check: What are Data Types in R?

References

Firke, S. (2021). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 2.1.0.

Wickham, H., Francois, R., Henry, L., Muller, K. (2020). dplyr: A Grammar of Data Manipulation. R package version 1.0.2.

1. How to Format Variable Names of Data in R

2. How to Remove Empty Rows and Columns of Data in R

3. How to Remove Duplicate Rows of Data in R

References

0 Comments

22 Pingbacks

Recent Posts

Jobs for Data Scientist

Archives

How to Clean Data in R

1. How to Format Variable Names of Data in R

2. How to Remove Empty Rows and Columns of Data in R

3. How to Remove Duplicate Rows of Data in R

References

0 Comments

22 Pingbacks

Recent Posts

Jobs for Data Scientist

Archives

Connect With Us