Categorizing numeric variables is very common task in data processing. This inclusive guide covers the ways of categorizing numerical data. Find out how to convert numerical data to categories in R.

In this guide, we will work on four ways of categorizing numerical variables in R. Firstly, we will convert numerical data to categorical data using cut() function. Secondly, we will categorize numeric values with discretize() function available in arules package (Hahsler et al., 2021). Then, we will learn how to make categorization of numerical variables using group_var() function in sjmisc package (Ludecke, 2018). Last, we will convert numerical data into groups using frq() function in sjmisc package (Ludecke, 2018).

Let’s construct a numerical variable to learn categorization of numerical variables in R.

data <- seq(1, 90, 2)
class(data)
## [1] "numeric"
length(data)
## [1] 45

Check Out: 6 Ways of Subsetting Data in R

1. How to Categorize Numeric Data with cut() Function

In this section, we learn cut() function to convert numerical data into categories. The cut() function includes breaks argument. We can specify the break points or the number of categories. For example, in our example, we want to categorize the data in three groups. Therefore, we can specify the break points as 30 and 60. Also, we need to specify end points. For example, we write -Inf and Inf as end points. Then, we specify the breaks argument as three. Moreover, we can define the labels of the categories with labels argument. For instance, we define the labels as low, medium and high.

Categories <- cut(data, breaks = c(-Inf,30,60,Inf), labels = c("Low","Medium","High"))
table(Categories)
## Categories
##    Low Medium   High 
##     15     15     15 

Categories <- cut(data, breaks = 3, labels = c("Low","Medium","High"))
table(Categories)
## Categories
##    Low Medium   High 
##     15     15     15

Also Check: How to Handle Missing Values in R

2. How to Categorize Numeric Data with discretize() Function

In this part, we learn the use of discretize() function available in arules package (Hahsler et al., 2021). We specify the number of categories. For example, we want to categorize our variable in three groups. Therefore, we specify the breaks argument as three. Moreover, we can define the labels of classes with labels argument.

Categories <- arules::discretize(data, breaks = 3, labels = c("Low","Medium","High"))
table(Categories)
## Categories
##    Low Medium   High 
##     15     15     15

Also Check: How to Recode Character Variables in R

3. How to Categorize Numeric Data with group_var() Function

In this section, we learn how to use group_var() function available in sjmisc package (Ludecke, 2018) to convert the numerical variable into classes. We need to specify the range of each category with size argument. Also, as.num argument should be set to FALSE if the labels of categories want to be specifed.

Categories <- sjmisc::group_var(data, size = 30, as.num = FALSE)
levels(Categories) <- c("Low","Medium","High")
table(Categories)
## Categories
##    Low Medium   High 
##     15     15     15

4. How to Categorize Numeric Data with frq() Function

In this part, we learn how to categorize numeric variables with frq() function available in sjmisc package (Ludecke, 2018). We specify the number of categories with auto.grp argument. With this function, we can also construct a frequency table including frequency and (raw, valid and cumulative) percentages. Also, the function returns mean and standard deviation of the data. The frq() function presents the number of NA values in table. Moreover, we can define the names of groups using frq() function together with group_var() function.

library(sjmisc)
frq(data, auto.grp = 3)

## x <numeric>
## # total N=45  valid N=45  mean=45.00  sd=26.27

## Value | Label |  N | Raw % | Valid % | Cum. %
## ---------------------------------------------
##     1 |  1-30 | 15 | 33.33 |   33.33 |  33.33
##     2 | 31-60 | 15 | 33.33 |   33.33 |  66.67
##     3 | 61-90 | 15 | 33.33 |   33.33 | 100.00
##  <NA> |  <NA> |  0 |  0.00 |    <NA> |   <NA>

Categories <- group_var(data, size = 30, as.num = FALSE)
levels(Categories) <- c("Low","Medium","High")
frq(Categories)

## x <categorical>
## # total N=45  valid N=45  mean=2.00  sd=0.83

## Value  |  N | Raw % | Valid % | Cum. %
## --------------------------------------
## Low    | 15 | 33.33 |   33.33 |  33.33
## Medium | 15 | 33.33 |   33.33 |  66.67
## High   | 15 | 33.33 |   33.33 | 100.00
## <NA>   |  0 |  0.00 |    <NA> |   <NA>

The application of the codes is available in our youtube channel below.

How to Categorize Numeric Variables in R Using RStudio
Subscribe to YouTube Channel

Don’t forget to check: How to Clean Data in R

References

Hahsler, M., Buchta, C., Gruen, B., Hornik, K. (2021). arules: Mining Association Rules and Frequent Itemsets. R package version 1.6-7.

Ludecke, D. (2018). sjmisc: Data and Variable Transformation Functions. Journal of Open Source Software, 3(26), 754.


Dr. Osman Dag