It is important to create dummy variables when working on categorical variables where there is no ordered relationship. This ultimate tutorial includes necessary steps to make dummy variables in R.

Sometimes, researchers can use integer encoding for a nominal variable to put it in a regression model. Integer encoding assigns a unique integer to each level of a categorical variable. Therefore, just integer encoding to nominal variable is misleading since it lets the model do a natural ordering between categories. This cause unexpected results and poor performance.

If we have a nominal variable and want to put it in the model, we need to create dummy variables for each nominal variable, i.e. one hot encoding. If we have k levels of a categorical variable, k new dummy variables are created. Each dummy variable has a value of either 0 or 1 , representing absence or presence of that feature, respectively.

If we have k levels of a categorical variable and we create k new dummy variables, we may fall in dummy variable trap. Dummy variable trap is a situation in which one variable can be exactly predicted by the value of other variables (multicollinearity). Therefore, we need to exclude one dummy variable while constructing regression model. **As a result, if we have k levels of a categorical variable, we need to create k-1 dummy variables.**

In this tutorial, we learn the usage of dummy_cols() function available in fastDummies package (Kaplan, 2020). Firstly, we learn how to create dummy variables. Secondly, we go over how to remove the nominal variables from data after creating dummy variables. At last, we learn how to save from dummy variable trap.

Let’s construct a data frame involving two categorical variables in which no ordinal relation exists.

```
x <- factor(rep(c("apple","banana","carrot"), each = 2))
y <- factor(rep(c("A","B","C"), 2))
data <- data.frame(x, y)
data
## x y
## 1 apple A
## 2 apple B
## 3 banana C
## 4 banana A
## 5 carrot B
## 6 carrot C
```

**Check Out:***How to Merge Data Frames in R*

## 1) How to Create Dummy Variables in R

In this part, we use select_columns argument to define which variables are converted into dummy variables.

```
library(fastDummies)
dummy_cols(data, select_columns = c("x","y"))
## x y x_apple x_banana x_carrot y_A y_B y_C
## 1 apple A 1 0 0 1 0 0
## 2 apple B 1 0 0 0 1 0
## 3 banana C 0 1 0 0 0 1
## 4 banana A 0 1 0 1 0 0
## 5 carrot B 0 0 1 0 1 0
## 6 carrot C 0 0 1 0 0 1
```

**Also Check:** *How to Remove Outliers from Data in R*

## 2) How to Remove Nominal Variables After Creating Dummy Variables

We can use remove_selected_columns argument to remove initial categorical variables from data after creation of dummy variables by set it to TRUE.

```
dummy_cols(data, select_columns = c("x","y"), remove_selected_columns = TRUE)
## x_apple x_banana x_carrot y_A y_B y_C
## 1 1 0 0 1 0 0
## 2 1 0 0 0 1 0
## 3 0 1 0 0 0 1
## 4 0 1 0 1 0 0
## 5 0 0 1 0 1 0
## 6 0 0 1 0 0 1
```

**Also Check:** How to Create Dummy Variables Based on Variable Class in R Data Frame

## 3) How to Save from Dummy Variable Trap in R

At last, we can use remove_first_dummy argument to save from dummy variable trap by setting it to TRUE.

```
dummy_cols(data, select_columns = c("x","y"), remove_selected_columns = TRUE, remove_first_dummy = TRUE)
## x_banana x_carrot y_B y_C
## 1 0 0 0 0
## 2 0 0 1 0
## 3 1 0 0 1
## 4 1 0 0 0
## 5 0 1 1 0
## 6 0 1 0 1
```

The application of the codes is available in our youtube channel below.

**Don’t forget to check:** *Missing Data Imputations in R – Mean, Median, Mode*

**References**

Kaplan, J. (2020). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. R package version 1.6.3.

## 0 Comments

## 1 Pingback