The Art of Feature Engineering: Encoding Categorical Variables with Code Examples
In many data analysis or machine learning tasks, we often deal with different types of data, including categorical and quantitative variables. If for example, we want to calculate distances, such as in clustering, classification, or other distance-based algorithms, we typically need all of our variables to be quantitative because most distance metrics (like Euclidean distance or Manhattan distance) are defined for numeric data. Categorical variables represent discrete categories or classes, which can’t be directly compared in the same way as numbers.
To overcome this, we convert categorical variables into a form that can be meaningfully compared using distance metrics. This conversion is what we call encoding. The goal is to convert each category or class into a number or a set of numbers that preserves the information needed to calculate the distance between data points.
What is a Categorical Variable?
A categorical variable is a type of random variable that can take on a limited set of distinct values. Examples include attributes such as gender (like Male, Female), job titles (like Data Analyst, Data Engineer, Data Scientist) and satisfaction ratings ( Satisfied, Dissatisfied). Categorical variables can be classified as either string or binary variables and are commonly referred to as qualitative variables.
There are two main types of categorical data: nominal and ordinal.
Nominal data consists of categories that have no intrinsic order. For instance, state names like California, New York and Washington or varieties of fruit such as apple, orange and banana, belong to this category. Each group is unique and there is no ranking or sequence among them.
Ordinal data, on the other hand, includes categories that possess a specific order. Examples include levels of education such as High School, Bachelor’s, and Master’s, as well as customer feedback ratings like Poor, Average, and Excellent. In these cases, the categories can be ranked and their order carries significance for analysis.
Why Do We Need Categorical Encoding?
Categorical variables represent discrete categories — such as color, gender, or type, that cannot be directly used by most machine learning models, which primarily interpret numerical values. To make these variables usable, they need to be transformed into a numerical format. This is where categorical encoding comes in, serving as a crucial preprocessing step in machine learning.
For variables with two categories (like Male and Female), simple binary encoding (1 and 0) works well. But for variables with more than two categories, methods like one-hot encoding or ordinal encoding are needed. It is important to avoid assigning arbitrary numeric values to categories (as seen in label encoding), since this can mislead the model. For instance, assigning 2 to Washington, 1 to California, and 0 to New York would suggest an order that does not exist in the data. Label encoding is typically reserved for target variables, not input features.
By properly encoding categorical variables, we enable models to process and learn from these features, ensuring better predictions and more effective analysis.
In this article, I will demonstrate how to encode categorical variables using the HR attrition dataset from Kaggle. For clarity, I will focus only on the object columns and highlight a few key categorical features. This approach will showcase how to apply encoding techniques effectively, while addressing common challenges and best practices.
df = pd.read_csv('./Datasets/HR_Attrition_Dataset',
usecols=['Department','Gender', 'Education',
'EducationField', 'JobSatisfaction','Attrition' ] )
df.head()
Label Encoding
Label encoding converts categorical values into unique integers, but it may not be ideal for nominal data. Since label encoding assigns numeric values based on the categories, it can introduce an artificial hierarchy, which is inappropriate when there is no inherent order between the categories.
In scikit-learn, both LabelEncoder and LabelBinarizer are primarily designed for encoding target labels (y), converting them into integers ranging from 0 to n_classes-1. They should not be used for encoding feature values (X), as they are specifically intended for target variable transformation.
For example, in our dataset with Attrition as the target variable (with values “Yes” or “No”), we can use LabelEncoder to convert the labels into binary values — where “Yes” becomes 1 (attrition) and “No” becomes 0 (no attrition).
from sklearn.preprocessing import LabelEncoder
# Initialize the LabelBinarizer
le = LabelEncoder()
# Fit and transform the data
df['Attrition'] = le.fit_transform(df['Attrition'])
df.head()
Custom Binary Encoding
In this dataset, the Gender column is binary, with unique values of Male and Female.
To prepare it for modeling, we will use custom binary encoding, transforming Female to 1 and Male to 0. This transformation will utilize the np.where function from the NumPy library
df['Gender'] = np.where(df['Gender']== 'Female', 1,0)
df.head()
Splitting the dataset:
It is essential to encode these subsets separately when building the model to capture their unique values and prevent data leakage from the test set into the training set. This approach ensures unbiased model evaluation and provides a more accurate assessment of its performance.
X= df.drop(columns=['Attrition'])
y= df['Attrition']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,
random_state=42)
print("X_train shape:", X_train.shape,"\nX_test shape:", X_test.shape,"\n")
X_train.head(5)
Ordinal Encoding
Some categorical variables have an intrinsic order, and to preserve this order, we use ordinal encoding. This method converts categorical variables into numerical format by assigning integer values based on their ordinal relationships.
For instance, for the Education variable with categories like Below College, College, Bachelor, Master, and Doctor, we might assign values as follows:
Below College: 0; College: 1; Bachelor: 2; Master: 3; Doctor: 4;
This encoding reflects the ranking of education levels, with higher numbers indicating greater educational attainment.
Ordinal encoding is useful when the target variable (example: salary) consistently increases or decreases with respect to the categorical variable. For instance, with an ordinal variable like JobLevel (like Intern, Junior, Senior, Manager), salary typically rises with each level of experience. However, a drawback is that the differences between consecutive levels may not be equal, which could impact the model’s accuracy.
The OrdinalEncoder from scikit-learn uses a default lexicographical method for encoding, which may result in arbitrary integer mappings. To preserve the inherent order of ordinal variables, it is important to explicitly define this order using the categories argument in the encoder.
We will apply ordinal encoding to the Education and JobSatisfaction columns due to their inherent ordering.
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(
categories=[['Below College', 'College','Bachelor',
'Master', 'Doctor'],
['Low','Medium','High','Very High' ] ], dtype=int)
# fitting data with X_train and transform
X_train[['Education','JobSatisfaction']]= oe.fit_transform(X_train[['Education',
'JobSatisfaction']])
#Transforming the test data
X_test[['Education','JobSatisfaction']]= oe.transform(X_test[['Education',
'JobSatisfaction']])
X_train.head()
One hot encoding
One-hot encoding is a widely used method for categorical data transformation. It converts each category into a new binary column, where each column represents the presence (1) or absence (0) of a particular category. This method is beneficial because it prevents the model from assuming any ordinal relationships between categories, as it doesn’t assign any inherent order.
However, one potential drawback is that it can significantly increase the dimensionality of the dataset, especially when there are many categories. This can lead to higher computational costs and, in some cases, increase the risk of overfitting due to the larger number of features.
In Pandas, the get_dummies function simplifies this process by generating dummy (or indicator) variables for categorical data. For example, the EducationField column with six distinct categories—Life Sciences, Other, Medical, Marketing, Technical Degree, and Human Resources—can be converted into six binary columns, where each column represents a category with a 1 or 0, indicating the presence or absence of that category.
However, one-hot encoding can introduce multicollinearity, a situation known as the dummy variable trap. This occurs because, with the inclusion of all dummy variables, the categories become perfectly correlated with each other. For instance, if there are n categories, the sum of all dummy variables for each observation will always equal 1, which creates redundancy in the data. This redundancy can distort the model and affect its ability to estimate relationships accurately. To address this, we typically drop one of the dummy variables, as it is redundant. This technique is also known as dummy encoding.
In Pandas, the pd.get_dummies function offers a convenient drop_first=True argument, which automatically removes the first column, thereby preventing multicollinearity.
pd.get_dummies(df,
columns=['EducationField'],
prefix=["Field"],
dtype=int).head()
We can pass multiple categorical columns to the get_dummies function and specify a prefix to label the new columns as desired. To modify the dataset directly, we can use the inplace=True argument, which ensures the changes are applied to the original DataFrame.
pd.get_dummies(df, columns=[ 'Department', 'EducationField'],
prefix=['Dept','Field'],
drop_first= True,
dtype=int).head()
One-hot encoding can increase dataset size, so it’s best suited for categorical features with few distinct values. To reduce dimensionality, we can group less frequent categories (with value counts below a threshold) into a single “uncommon” category.
counts = df['EducationField'].value_counts()
print("Number of unique Categories in Education Field: ",
df['EducationField'].nunique(),"\n")
threshold = 100
repl = counts[counts <= threshold].index
pd.get_dummies(X_train['EducationField'].replace(repl, 'uncommon'),
dtype= int).head(5)
One hot Encoding in Scikit-Learn
Scikit-learn provides the OneHotEncoder, which is used to apply one-hot encoding to categorical features, particularly nominal features that do not have an inherent order.
from sklearn.preprocessing import OneHotEncoder
one_hot = OneHotEncoder( drop='first',
handle_unknown='ignore',
sparse_output=False,
dtype=int)
OH_cols = ['Department','EducationField']
# One-hot encoding for training and test data
one_hot_train = pd.DataFrame(one_hot.fit_transform(X_train[OH_cols]), index=X_train.index)
one_hot_test = pd.DataFrame(one_hot.transform(X_test[OH_cols]), index=X_test.index)
# Add a prefix to the column names
one_hot_train.columns = [col for col in one_hot.get_feature_names_out(OH_cols)]
one_hot_test.columns = [col for col in one_hot.get_feature_names_out(OH_cols)]
# Remove original categorical columns from training and test data
num_X_train = X_train.drop(OH_cols, axis=1)
num_X_test = X_test.drop(OH_cols, axis=1)
# Combine numerical features with one-hot encoded columns
X_train_new = pd.concat([num_X_train, one_hot_train], axis=1)
X_test_new = pd.concat([num_X_test, one_hot_test], axis=1)
X_train_new[one_hot_train.columns].head()
When using OneHotEncoder, we can set the min_frequency parameter to group categories that appear less frequently than a specified threshold into a single “Other” category. This helps reduce the dimensionality of the dataset while still retaining meaningful information from the more common categories.
For instance, I can use min_frequency=100 to group the less frequent categories (specifically Other and Human Resources) into an Other category. This approach reduces the dimensionality of the EducationField variable while preserving the important information from the more common categories. We can avoid using a sparse matrix by specifying sparse_output=False in the OneHotEncoder during matrix creation, ensuring the output is dense.
one_hot = OneHotEncoder( drop='first',
handle_unknown='ignore',
sparse_output=False,
dtype=int,
min_frequency=100)
ColumnTransformer in Scikit-Learn
We can apply both Ordinal Encoding and One-Hot Encoding simultaneously using ColumnTransformer by specifying different encoders for ordinal and nominal categorical variables within the same transformation pipeline.
from sklearn.compose import ColumnTransformer
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,
random_state=42)
transformer = ColumnTransformer(transformers=[
('oe',OrdinalEncoder(dtype=int),['Education','JobSatisfaction']),
('ohe',OneHotEncoder(drop='first',
handle_unknown='ignore',
sparse_output=False,
dtype=int), ["EducationField","Department"])
],remainder='passthrough')
transformer.fit(X_train)
pd.DataFrame(transformer.transform(X_train))
Visualizing a ColumnTransformer
Scikit-learn provides a convenient way to visualize a ColumnTransformer.
Frequency Encoding
Frequency encoding is a technique that converts categorical values into numerical values based on the frequency of their occurrence in the dataset. Each category is replaced by the relative frequency (or count) of that category, typically scaled between 0 and 1.
This encoding preserves information about the distribution of categories, allowing algorithms to understand how common or rare a particular category is. It is especially useful when the frequency of a category correlates with the target variable.
encoded = X_train['EducationField'].value_counts(normalize=True)
print(encoded,"\n")
X_train['EducationField']= X_train['EducationField'].map(encoded)
X_train.head()
In conclusion, we explored various encoding techniques for categorical variables using the HR attrition dataset. We covered methods like Label Encoding, Ordinal Encoding, One-Hot Encoding, and Frequency Encoding, each suited for different types of data. The key takeaway is that choosing the right encoding method helps avoid problems like creating unnecessary order in data or making the dataset too large. Proper encoding helps models learn from the data more effectively and make better predictions.