The Art of Feature Engineering: A Guide to Handling Missing Values with Code Examples
What is Feature Engineering?
Feature engineering is the process of creating or modifying features from raw data to improve the performance and interpretability of machine learning models. It includes the transformation, construction, extraction, and selection of variables that best contribute to an accurate algorithm. This involves identifying meaningful attributes that have a logical relationship with the target variable. For instance, multiplying two unrelated columns, such as age and salary, may not produce a valuable feature because there’s no real-world connection between them. Effective feature engineering is an art that requires careful consideration of these relationships to maximize predictive power.
Why is Feature Engineering Important?
- Machine learning operates on the principle of “Garbage In, Garbage Out”. Training a model on poor-quality data will result in poor outcomes.
- Well-engineered features can reveal patterns and insights that raw data may not readily expose.
- Filtering out irrelevant or noisy data ensures the model focuses on the most relevant information, contributing to better generalization on unseen data.
- Simpler models with well-chosen features often outperform complex models with irrelevant features, as they are easier to interpret and less likely to overfit. Focusing on the most impactful features allows for strong performance without unnecessary complexity.
Feature engineering involves various techniques to enhance data for better model performance.
- Feature Transformation
- Feature Construction
- Feature Selection
- Feature Extraction
Feature Transformation
Feature transformation involves modifying existing features to enhance their effectiveness for modeling, ensuring the model can flexibly handle a variety of data. This process includes addressing missing values, scaling data (such as normalizing or standardizing), encoding categorical variables (like one-hot encoding), and applying mathematical functions (such as logarithmic or polynomial transformations). The primary goal is to adjust features to better align with model assumptions and improve overall performance.
In this article, I will focus on methods for handling missing values. Dealing with missing data is challenging, as there is no single solution; this article summarizes common imputation methods, including Median, time series, KNN, MICE, and MissForest.
Handing Missing Values
One of the biggest challenges when working with real-world datasets is dealing with missing values. Sometimes, certain pieces of information are not available. For example, in a dataset about people, the Middle Name field might be empty for those who don’t have a middle name. This situation is called non-data, meaning that the information simply doesn’t exist for those individuals — it is not erroneous.
In contrast, missing data refers to information that should have been collected but is not available. In a dataset, if a passenger’s age is recorded as null, then this could result from an oversight during data collection or an instance where the passenger did not provide their age. Missing data can introduce bias and inaccuracies in analysis, potentially skewing results and leading to incorrect conclusions. Understanding the reasons behind missing values is crucial as it can significantly impact analysis outcomes and predictive accuracy.
Missing data can generally be categorized into three types:
1. Missing Completely at Random (MCAR):
Data is considered MCAR when the absence of values is unrelated to both observed and unobserved data. It may also be classified as MCAR if the missingness is due to external factors, such as test design or recording failures. For example, if some participants in a survey randomly skip questions due to distraction or confusion, the resulting missing responses are MCAR. While it is often safe to remove MCAR data from analysis, this might lower the overall efficiency of the model. However, it can still produce reliable and unbiased results. It is important to recognize that while this assumption (MCAR) simplifies analysis, it may not be realistic in real-world datasets, where the reasons for missing data can be much more complex.
2. Missing at Random (MAR):
MAR occurs when the missingness is related to observed data but not to the missing values themselves. For instance, if female participants in a survey are less likely to report their age compared to male participants, the missingness is related to gender, an observed variable. In this case, if we account for gender in our analysis, we can use the complete observed data to make informed predictions about the missing values. This scenario is more common than MCAR in real-world and underscores the importance of understanding the patterns of missingness to enhance the accuracy of analyses.
3. Missing Not at Random (MNAR):
Data is classified as MNAR when the missingness is directly related to the unobserved values themselves. For instance, in a mental health survey, individuals with severe anxiety may be less likely to answer questions about their mental health, making the missing responses directly influenced by the severity of their condition. Handling MNAR data poses significant challenges, as simply omitting these cases can lead to biased results. Addressing MNAR often requires careful modeling of the missingness and may involve incorporating domain knowledge to understand the underlying reasons for the missing values. This understanding can help develop more accurate estimates and improve the overall quality of the analysis.
To identify missing values in a Pandas DataFrame, we can use the command df.isnull().sum(), which gives a count of null entries for each column.
Since most machine learning algorithms cannot handle missing data directly, it is essential to manage these data before fitting a model. We can address missing values using strategies like removal, imputation or by creating flags to indicate their absence. Preprocessing our data to handle missing values is essential, as Scikit-learn models require complete datasets for effective training and accurate predictions.
For this demo, I’ll use the famous Titanic dataset from Kaggle and some manually created time series data to explore and tackle missing values. Before addressing the missing data, it’s essential to understand the reasons behind their missingness, as this will guide us in selecting the most effective handling methods.
# importing libraries and data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
sns.set_palette("Spectral")
# import
train = pd.read_csv('./titanic/train.csv')
test = pd.read_csv('./titanic/test.csv')
print("Shape of Train data: ", train.shape, "\nShape pf Test Data: ",
test.shape)
Calculate Missing Values Numerically
# find the missing value count and percentage
def missing_value_df(df):
missing_count = df.isnull().sum()
missing_pct = df.isnull().mean()
missing_summary = pd.DataFrame({
'missing_count': missing_count,
'missing_pct': missing_pct
})
missing_summary = (
missing_summary.sort_values('missing_pct', ascending=False)
.query('missing_count > 0')
.style.format('{:.2%}', subset=['missing_pct'])
)
return missing_summary
missing_value_df(train)
missing_value_df(test)
Visualizing missing data
To analyze missing data visually, we can use a library called Missingno. This tool allows us to quickly identify and understand the patterns of missing values in our datasets.
import missingno as msno
#Bar Chart
msno.bar(train)
plt.savefig('bar.png', dpi=150)
In the Missingno bar visualization, each column represents a feature. We observe that 3 features in our train dataset have significant missing values — Cabin has a large amount of missing data, Age has a moderate level of missingness, and Embarked has minimal missing values.
Visualizing the Position of Missing Data:
To better understand the locations of missing data, we can use the msno.matrix command. This allows us to easily identify the distribution and locations of missing data.
#Matrix Chart
msno.matrix(train)
This command creates a black-and-white visualization where white lines indicate missing values in each feature. For example, the Embarked column has only two missing values, so you see two white lines. However, in the Age and Cabin columns, the missing values are spread out.
If the dataset is very large, we can also look at just a smaller part of it to check for missing values more easily.
msno.matrix(train.sample(200))
The msno.heatmap
command creates a heatmap that shows how missing values are related across different features. It doesn’t show how the features are correlated, but it does help us see how missing values in one feature can affect missing values in another.
msno.heatmap(train)
In our dataset the heatmap function shows that there are no strong correlations between missing values of different features. That means the actual correlation values might not be very important.
Now, let’s sort the values by the Age column to check for any patterns in the missing values.
sorted = train.sort_values('Age')
msno.matrix(sorted)
The missing values in the Cabin column are spread out, and sorting by the Age column shows no correlation between the two. This indicates that there is no relationship between the missingness in the Age and Cabin columns, as also seen in the heatmap.
Analyzing Missing Values Numerically
Missing Data in the Age Column
i) Relationship with the Survived Column (Target Variable):
# Create 'missing_age' column (1 if Age is missing, 0 otherwise)
train['missing_age'] = np.where(train['Age'].isnull(), 1, 0)
# Calculate the proportion of missing Age values by survival status
pd.DataFrame(train.groupby('Survived')['missing_age'].mean())
It looks like the missing Age values are connected to whether people survived. For those who survived, 15.2% of Age values are missing, while for those who didn’t survive, 22.77% are missing. This difference suggests that missing Age data is related to survival status.
Because the missing values depend on whether someone survived which is the target variable or unobserved data, we can say this is Missing Not At Random (MNAR). This means the missingness is linked to the target variable. Now we have to check if missingness depends on any observed variable or not.
Next, I’ll explore whether the missing Age data is related to other observed variables in the dataset. If I find that certain variables are linked to the missingness, it would indicate that the missing Age values are Missing At Random (MAR) instead of Missing Not At Random (MNAR).
ii) Relation with Sex Column
Let us investigate whether the Sex variable affects the missingness of Age, potentially indicating it as MAR (Missing At Random). The idea is that females might be less willing to disclose their age, resulting in a higher percentage of missing Age values among them.
total_counts = (
train['Sex'].value_counts()
.reset_index()
.rename(columns={'count': 'Total_Count'})
.set_index('Sex')
)
total_counts['Percentage_gender(%)'] = (
(total_counts['Total_Count']/len(train)* 100 ).round(1)
)
missing_age_counts = (
pd.DataFrame(train.groupby('Sex')['missing_age'].sum()
, index=['female', 'male'])
.rename(columns={'missing_age': 'Missing_Age'})
)
combined_df = pd.concat([total_counts, missing_age_counts], axis=1)
combined_df['Percent_Missing_Age (%)'] = (
(combined_df['Missing_Age'] / combined_df['Total_Count']
* 100).round(1)
)
combined_df
However, our analysis shows that a greater proportion of males (21.5%) have missing Age values compared to females (16.9%), suggesting that this is not a MAR case.
iii) Relation with Embarked column:
In the Titanic dataset, the letters in Embarked column stand for different boarding locations: S is for Southampton, C is for Cherbourg, and Q is for Queenstown. So, the order of boarding stations on the ship was S-C-Q.
Let’s check the proportion of passengers who boarded from Southampton (S), Cherbourg ( C), and Queenstown (Q).
# Total counts of passengers by Embarked location
total_embarked_counts = (
train['Embarked'].value_counts()
.reset_index()
.rename(columns={'index': 'Embarked', 'count': 'Total_Count'})
.set_index('Embarked')
)
# # Calculate the percentage of each Embarked location
total_embarked_counts['Percentage_Embarked(%)'] = (
(total_embarked_counts['Total_Count'] / len(train) * 100).round(1)
)
# Count of missing Age values by Embarked location
missing_age_embarked_counts = (
pd.DataFrame(train.groupby('Embarked')['missing_age'].sum())
.rename(columns={'missing_age': 'Missing_Age'})
)
# Combine the total counts and missing Age counts
combined_embarked_df = pd.concat([total_embarked_counts, missing_age_embarked_counts], axis=1)
# # Calculate the percentage of missing Age values for each Embarked location
combined_embarked_df['Percent_Missing_Age (%)'] = (
(combined_embarked_df['Missing_Age'] / combined_embarked_df['Total_Count'] * 100).round(1)
)
combined_embarked_df
The data indicates that while fewer people (8%) boarded from Queenstown (Q) compared to Southampton (S) and Cherbourg(C) , a significant percentage (63%) of those from Q have missing Age information.
Since the likelihood of missing Age data is influenced by the observed variable (Embarked), this suggests that the missingness should not be considered as MNAR. The missing Age data might be classified as Missing At Random (MAR), as it appears to be related to the boarding location rather than being completely random or dependent on unobserved factors.
Missing Data of Embarked Column in train set
- With only 2 missing values in the Embarked column, we can think of this missingness as Missing Completely at Random (MCAR).
- This means that the missing values are not connected to any other data in the dataset, and their absence doesn’t create any bias in our analysis.
- Usually, we can delete missing data that is MCAR if it doesn’t exceed 5% of the total data or we can impute them with a suitable alternative.
train[train["Embarked"].isnull()]
Missing Data of Fare Column in test set
- Since there is only one missing value in the Fare column, it can be considered Missing Completely at Random (MCAR).
- This small amount of missing data is not likely to affect our analysis much
- We can choose to either delete this missing entry or impute it with an appropriate value, such as the mean or median Fare
test[test['Fare'].isnull()]
Missing Data of Cabin Column
i) Relationship with the Survived Column (Target Variable):
# Create 'missing_cabin' column : 1 if missing, 0 is available
train['missing_cabin'] = np.where(train['Cabin'].isnull(), 1, 0)
# Calculate the proportion of missing Cabin values by survival status
missing_cabin_by_survival = train.groupby('Survived')['missing_cabin'].mean()
It looks like the missing Cabin values are connected to whether people survived. For those who survived, 60.2% of Cabin values are missing, while for those who didn’t survive, 87.6% are missing. It is logical as survivors are more likely to have recorded Cabin numbers, while non-survivors may have incomplete data. Since the missing values are related to whether someone survived, we can classify this as Missing Not At Random (MNAR).
ii) Relationship with Pclass:
Let’s investigate whether the missingness is connected to other observed factors rather than the missing Cabin values themselves or target variable.
(
pd.DataFrame(
train.groupby('Pclass')['missing_cabin'].mean()
).style.format('{:.2%}', subset=['missing_cabin'])
)
It seems that the percentage of missing Cabin values varies by passenger class (Pclass):
- Pclass 1: 18.52% missing
- Pclass 2: 91.30% missing
- Pclass 3: 97.56% missing
It’s reasonable to think that first-class passengers, who generally have more access to better service, may have their cabin information recorded more reliably. Since the likelihood of missing Cabin data is related to the observed variable (Pclass), this suggests that the missingness is not MNAR. Instead, it appears to be influenced by the passenger class, indicating a potential relationship between the missing values and the observed data.
Therefore, this situation is likely classified as Missing At Random (MAR).
Handling Missing Data
Deletion Strategies:
- Deleting Data Points with Missing Values (Listwise Deletion):
This method removes entire rows with any missing values, allowing analyses to be conducted on complete datasets. It is also called Complete Case Analysis (CCA) meaning analysing only those observations in the dataset that contain values in all the variables.
While it simplifies analysis and maintains consistent sample sizes, it can lead to significant data loss and potential bias if the missingness is not completely random. Deleting data points can substantially reduce the dataset size, resulting in inadequate training and poorer model performance.
This approach is advisable only when the amount of missing data is small and is assumed to be Missing Completely at Random (MCAR). In real-world dataset, CCA is never an option, as the amount of missing data is never small.
Titanic — Train data missing values:
If we remove all the missing data, we will end up with a very small dataset, as 77.1% of the entries are missing in the Cabin column. Therefore, complete case analysis (CCA) is not a viable option for this dataset.
Test data missing values:
If we focus on a few variables from the dataset, we can analyze each one separately to determine the best approach for handling missing values. Since the missing values in the Embarked and Fare columns represent only 0.22% and 0.24% of the data, respectively, we can drop them, considering them as missing completely at random (MCAR).
train.dropna(subset=['Embarked'],how='any',inplace=True)
test.dropna(subset=['Fare'],how='any',inplace=True)
2. Dropping Features with Missing Values: Another strategy for handling missing data is to identify and remove features with a high proportion of missing values, typically over 80%. However, it is important to retain valuable features that contribute to the outcome. Instead of discarding entire rows, selectively removing features with excessive missingness is often preferable, especially in larger datasets where the missingness may not be informative. This approach helps maintain a robust dataset, enhancing analysis and potentially improving model performance.
Removing features can lead to the loss of important information and create uncertainty about how this impacts the results. Additionally, this approach does not address missing data during prediction, which can cause issues later. While it may be acceptable in some situations, relying on this method as a primary strategy for handling missing data is risky and generally not advisable.
Titanic Dataset:
We can drop the Cabin column, as 77.1% of data is missing in the training dataset and 78.2% in the test dataset.
train.drop(columns=['Cabin'], axis = 1, inplace = True)
test.drop(columns=['Cabin'], axis = 1, inplace = True)
3. Pairwise Deletion: It is a method for handling missing data by analyzing all available data points for each specific analysis, rather than removing entire rows and columns with any missing values. Instead of removing entire cases with any missing values, pairwise deletion retains as much data as possible by using all available information for each analysis.
Imputation Strategies
The second approach to handling missing data is imputation, which involves filling in missing values with informed estimates instead of discarding data points. This process replaces missing values in a column with calculated non-null values, resulting in a complete dataset that preserves valuable information and minimizes the drawbacks of data loss.
a) Univariate Imputation:
This method involves replacing missing values in a variable using information from that variable’s non-missing values.
- Arbitrary Value Imputation:
There are two missing values in the Embarked column of the training set. We will fill these missing values with an arbitrary value, ‘S’.
# using pandas
train['Embarked'] = train['Embarked'].fillna('S')
Using Sklearn — Creating ColumnTransformer to apply different imputers: Using ColumnTransformer from Scikit-learn allows for the efficient application of different imputers to specific columns in the dataset. In this example, distinct constant values are used to fill missing entries in the Embarked, Cabin, Age and Fare columns. This approach aids in preprocessing data with varying imputation needs.
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# Define the imputers for each column
embarked_imputer = SimpleImputer(strategy='constant', fill_value='S')
cabin_imputer = SimpleImputer(strategy='constant', fill_value='T20')
age_imputer = SimpleImputer(strategy='constant', fill_value=30)
fare_imputer = SimpleImputer(strategy='constant', fill_value=8)
# Create a ColumnTransformer to apply different imputers
imputer = ColumnTransformer(
transformers=[
('embarked', embarked_imputer, ['Embarked']),
('cabin', cabin_imputer, ['Cabin']),
('fare', fare_imputer, ['Fare']),
('age', age_imputer, ['Age'])
],
remainder='drop' # Drop columns not included in transformers
)
# preprocessing train data
train1 = train.drop(columns=['Survived']).copy()
columns_to_impute = ['Embarked', 'Cabin', 'Fare', 'Age']
# Impute missing values
train_imputed = pd.DataFrame(imputer.fit_transform(train1), columns=columns_to_impute)
# Remove columns with missing values from the original dataset
train1 = train1.drop(columns_to_impute, axis=1)
# Concatenate imputed columns with original dataset
train = pd.concat([train1, train_imputed[columns_to_impute]], axis=1)
# preprocessing train data
test1 = test.copy()
# Impute missing values
test_imputed = pd.DataFrame(imputer.transform(test1), columns=columns_to_impute)
# Remove columns with missing values from the original dataset
test1 = test1.drop(columns_to_impute, axis=1)
# Concatenate imputed columns with original dataset
test = pd.concat([test1, test_imputed[columns_to_impute]], axis=1)
train.isnull().sum()
test.isnull().sum()
2. Imputation with Mean, Median or Mode:
These methods replace missing values with the mean, median or mode of the available data. For categorical data, the mode (the most frequent value) is used; and if there is no dominant value — options like “Other” or “Missing” can be imputed. For numerical data, the mean is suitable for normally distributed data, while the median is preferred in the presence of outliers, as it is less sensitive to them.
Mean or median imputation relies on the assumption that the missing data is missing completely at random (MCAR). This means that the absence of data is unrelated to both observed and unobserved variables, ensuring that imputation does not introduce bias. While effective for small amounts of missing data, these approaches can reduce variability and overlook important relationships between features.
i) Imputing missing value with median:
Using pandas:
X_train = train.drop(columns = ['Survived']).copy()
X_test = test.copy()
# plot the distribution of age in train and test data before imputation
plt.figure(figsize=(8, 4))
# X_train
plt.subplot(1, 2, 1)
sns.histplot(x='Age', data=X_train, kde=True)
plt.title('X_train')
plt.ylabel('Number of passengers')
plt.xlabel('Age')
# X_test
plt.subplot(1, 2, 2)
sns.histplot(x='Age', data=X_test, kde=True)
plt.title('X_test')
plt.ylabel('Number of passengers')
plt.xlabel('Age')
plt.suptitle('Age Distribution in Training and Test Sets beforeImputation', fontsize=16)
plt.tight_layout()
plt.show()
Imputation should use the training set’s statistics to prevent data leakage. By calculating the mean or median from the training set, we ensure that imputation relies only on the information available during model training. Using values from the test set would introduce knowledge about the test data, leading to overfitting and an inaccurate evaluation of the model.
# calculating median age of Training data
median_age = X_train['Age'].median()
# filling missing values with median
for df in [X_train, X_test]:
df['Age'].fillna(median_age, inplace=True)
print("Number of missing values in train after imputation: ",
X_train['Age'].isnull().sum())
print("Number of missing values in test after imputation: ",
X_test['Age'].isnull().sum())
Fare:
There is only one missing value in Fare column in test set.
X_test = test.copy()
X_test[X_test['Fare'].isnull()]
Fare often depends on factors like ticket class (Pclass), siblings or spouses (SibSp), and parents or children (Parch). It makes sense to fill the missing Fare value with the median Fare for a male in third class who has no family (no siblings, spouses, or parents/children).
#calculating median for a male in 3rd class without family
median_fare = X_train[(X_train['Sex'] == 'male') &
(X_train['Pclass'] == 3) &
(X_train['SibSp'] == 0) &
(X_train['Parch'] == 0)]['Fare'].median()
# filling with median_fare
X_test['Fare'].fillna(median_fare, inplace= True)
# Checking the imputed value
X_test[X_test.index == 152]
Using Sklearn- Imputing with median
from sklearn.impute import SimpleImputer
columns_to_impute = ['Fare','Age']
train_median = train[columns_to_impute]
test_median = test[columns_to_impute]
# Imputation strategies
imputer = SimpleImputer(strategy='median')
# Impute missing values
train_median = pd.DataFrame(imputer.fit_transform(train_median), columns=columns_to_impute)
test_median = pd.DataFrame(imputer.transform(test_median), columns=columns_to_impute)
#concatenate Columns
train1 = train.copy()
train1.drop(columns= columns_to_impute, axis=1, inplace=True)
train_median = pd.concat([train1,train_median], axis= 1)
test1 = test.copy()
test1.drop(columns= columns_to_impute, axis=1, inplace=True)
test_median = pd.concat([test1,test_median], axis= 1)
# Example:
train_median[(train_median.index == 5) | (train_median.index == 17) | (train_median.index == 19)]
Imputing both categorical and numerical columns:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
columns_to_impute = ['Embarked', 'Cabin', 'Fare', 'Age']
train_mixed = train[columns_to_impute]
test_mixed = test[columns_to_impute]
# Identify categorical and numerical columns
categorical_cols = train_mixed.select_dtypes(include=['category', 'object']).columns
numerical_cols = train_mixed.select_dtypes(include=['number']).columns
# Imputation strategies
categorical_transformer = SimpleImputer(strategy='most_frequent')
numerical_transformer = SimpleImputer(strategy='mean') # for median imputer, use strategy='median'
# Column Transformer
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_cols),
('num', numerical_transformer, numerical_cols)
])
# Impute missing values
train_mixed = pd.DataFrame(preprocessor.fit_transform(train_mixed), columns=columns_to_impute)
test_mixed = pd.DataFrame(preprocessor.transform(test_mixed), columns=columns_to_impute)
Imputing missing value with mode:
#setting strategy to 'most_frequent'
imputer = SimpleImputer(strategy='most_frequent')
3. Random Sample Imputation:
Random sample imputation involves replacing missing data by randomly selecting values from existing data within the same variable. This technique preserves the variable’s distribution and is suitable for data that is missing completely at random. It is essential to set a random seed to ensure consistent imputation across different runs; this creates a reproducible environment where the same random values are generated each time, leading to stable model predictions.
While this method is easy to implement using pandas and maintains data variance, it can affect covariance and is memory-intensive since the original dataset must be stored for future imputations. Additionally, random sampling can be applied to both numerical and categorical data, ensuring that the frequency of existing categories is preserved.
train['Age_imputed'] = train['Age']
test['Age_imputed'] = test['Age']
# Impute missing values in train
missing_train = train['Age'].isnull().sum()
if missing_train > 0:
samples_train = train['Age'].dropna().sample(missing_train, random_state=42).values
train.loc[train['Age_imputed'].isnull(), 'Age_imputed'] = samples_train
# Impute missing values in test
missing_test = test['Age'].isnull().sum()
if missing_test > 0:
samples_test = train['Age'].dropna().sample(missing_test, random_state=42).values
test.loc[test['Age_imputed'].isnull(), 'Age_imputed'] = samples_test
Imputation with Time Series Data:
Techniques such as Last Observation Carried Forward (LOCF), Next Observation Carried Backward (NOCB), filling with the mean, and interpolation are commonly used to replace missing values in time series data. While these methods can effectively address missingness, they may introduce bias, particularly in trending datasets.
Setting up the data:
I have created a DataFrame with a time series index and a feature labeled Sales.
# Create a date range
date_range = pd.date_range(start='2023-01-01', end='2023-12-31', freq='W')
# Generate random sales data
np.random.seed(42) # For the same data
sales_data = np.random.uniform(12000, 15000, size=len(date_range)).round(2)
# setting missing data
num_rows = len(sales_data)
num_sales_missing = 10
missing_sales_indices = np.random.choice(num_rows, size=min(num_sales_missing, num_rows), replace=False)
sales_data[missing_sales_indices] = np.nan
# Step 5: Create the DataFrame
time_series_df = pd.DataFrame({
'Date': date_range,
'Sales': sales_data
})
# Set the Date column as the index
time_series_df.set_index('Date', inplace=True)
time_series_df.head()
df = time_series_df.copy()
print("Number of missing values: ",df.isnull().sum())
df['Sales'].tail(11)
There are 3 missing entries in Sales column in the last 11 rows.
Dealing with missing values:
- Fill NaN with Zero (0) Value
df = time_series_df.copy()
#plotting sales comparing with modified values (NaNs filled with 0) against the original data (NaNs filled with infinity)
plt.figure(figsize=(10, 3))
sns.lineplot(x=df.index, y=df['Sales'].fillna(0), color='darkorange', label='modified')
sns.lineplot(x=df.index, y=df['Sales'].fillna(np.inf), color='dodgerblue', label='original')
plt.title('Fill NaN with 0', fontsize=14)
plt.ylabel('Sales Volume ($)', fontsize=14) # Updated ylabel
plt.legend()
plt.show()
# filling missing values
df.fillna(0,inplace=True)
df['Sales'].tail(11)
2. Fill NaN with Mean Value
df = time_series_df.copy()
# Calculating Average value of Sales column
avg_sales = df['Sales'].mean()
#plotting sales comparing with modified values (NaNs filled with 0) against the original data (NaNs filled with infinity)
plt.figure(figsize=(10, 3))
plt.figure(figsize=(10, 3))
sns.lineplot(x=df.index, y=df['Sales'].fillna(avg_sales), color='darkorange', label='modified')
sns.lineplot(x=df.index, y=df['Sales'].fillna(np.inf), color='dodgerblue', label='original')
plt.title('Fill NaN with Average Sales', fontsize=14)
plt.ylabel('Sales Volume ($)', fontsize=14) # Updated ylabel
plt.legend()
plt.show()
# Filling the missing values
df.fillna(avg_sales,inplace=True)
df['Sales'].tail(11)
3. Last observation carried forward (LOCF) — Fill NaN with ‘ffill’
ffill (forward fill) fills in missing values with the last known value, so if a value is missing, it uses the most recent one before it.
df = time_series_df.copy()
#plotting sales comparing with modified values (NaNs filled with 0) against the original data (NaNs filled with infinity)
plt.figure(figsize=(10, 3))
sns.lineplot(x=df.index, y=df['Sales'].ffill(), color='darkorange', label='modified')
sns.lineplot(x=df.index, y=df['Sales'].fillna(np.inf), color='dodgerblue', label='original')
plt.title('Fill NaN with ffill', fontsize=14)
plt.ylabel('Sales Volume ($)', fontsize=14)
plt.legend()
plt.show()
# Filling the missing values
df.fillna(method='ffill',inplace=True)
df['Sales'].tail(11)
4. Next observation carried backward (NOCB) — Fill NaN with ‘bfill’
bfill (backward fill) fills missing values with the next known value in the column, so if a value is missing, it uses the next one after it.
df = time_series_df.copy()
#plotting sales comparing with modified values (NaNs filled with 0) against the original data (NaNs filled with infinity)
plt.figure(figsize=(10, 3))
sns.lineplot(x=df.index, y=df['Sales'].bfill(), color='darkorange', label='modified')
sns.lineplot(x=df.index, y=df['Sales'].fillna(np.inf), color='dodgerblue', label='original')
plt.title('Fill NaN with bfill', fontsize=14)
plt.ylabel('Sales Volume ($)', fontsize=14)
plt.legend()
plt.show()
# Filling the missing value
df.fillna(method='bfill',inplace=True)
df['Sales'].tail(11)
5. Fill NaN with Linear Interpolation
Filling time series data with linear interpolation involves estimating missing values by connecting the dots of existing data points with straight lines. This method smoothly fills in gaps based on the trends in the surrounding data, providing a more accurate representation of the values over time.
Multivariate Imputation
This method utilizes information from multiple variables to estimate and fill in missing data, resulting in more accurate and unbiased estimates. It trains machine learning models, such as k-nearest neighbors, random forest, or linear regression etc. to predict missing values. For instance, in a dataset with features like age, income, and education level, a regression model can use age and education to predict missing income. This approach is particularly effective for addressing the MAR category of missing data, as it leverages relationships between variables to create imputations that reflect patterns of missingness.
Common techniques for multivariate imputation include:
i) K-Nearest Neighbors (KNN):
The KNNImputer class in Scikit-learn fills in missing values using the k-Nearest Neighbors method. It estimates each missing value based on the values of the n_neighbors nearest neighbors that have available data for that feature. Imputation can be performed using either a simple average or a weighted average, where closer neighbors have a greater influence on the result. The KNNImputer requires numerical data to perform calculations.
train[train['Age'].isnull()].head(2)
test[test['Age'].isnull()].head(2)
from sklearn.impute import KNNImputer
# Create copies of datasets
train_knn = train.drop(columns=['Survived']).copy(deep=True)
test_knn = test.copy(deep=True)
# Initialize the KNN imputer
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
# Fit the imputer on the training data and transform the 'Age' column
train_knn['Age'] = knn_imputer.fit_transform(train_knn[['Age']])
# Transform the 'Age' column in the test data using the fitted imputer
test_knn['Age'] = knn_imputer.transform(test_knn[['Age']])
Results:
train_knn[ (train_knn.index == 5) | (train_knn.index ==17)]
test_knn[(test_knn.index == 10)|(test_knn.index == 22)]
Distribution of original and KNN Imputed data:
# Create the figure and axis
fig = plt.figure(figsize=(6, 4))
ax = fig.add_subplot(111)
# Plot KDE for each dataset
sns.kdeplot(x=train['Age'], ax=ax, label='Original Train Data')
sns.kdeplot(x=train_knn['Age'], ax=ax, label='KNN Imputation', color='blue')
# add legends
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
plt.title('KDE of Age Distribution')
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()
ii) Multivariate Imputation by Chained Equations (MICE):
This method handles missing data by creating multiple imputed datasets. It operates iteratively, treating each feature with missing values as a dependent variable and using other features to predict its missing values. In each iteration, the algorithm updates the imputed values based on predictions from regression models, cycling through all variables multiple times to refine the estimates. This approach helps preserve relationships among variables and provides more robust estimates of missing data compared to simpler methods. The IterativeImputer expects numerical data for its operations.
# Enable the experimental feature
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Create copies of the datasets
train_mice = train.drop(columns=['Survived']).copy(deep=True)
test_mice = test.copy(deep=True)
# Initialize the Iterative Imputer
imputer = IterativeImputer()
# Fit the imputer on the training data and transform the 'Age' column
train_mice['Age'] = imputer.fit_transform(train_mice[['Age']])
# Transform the 'Age' column in the test data using the fitted imputer
test_mice['Age'] = imputer.transform(test_mice[['Age']])
Results:
iii) Miss Forest:
Miss Forest is an advanced imputation method that uses Random Forests to accurately fill in missing data. It begins by initially imputing missing values with the mean for continuous variables and the most frequent category for categorical ones. The dataset is then divided into observed and missing parts, with the Random Forest model trained on the observed data to predict the missing values. This process is repeated for several iterations, improving the imputed data each time. The iterations continue until the changes in imputation are minimal or a specific limit is reached, typically resulting in well-attributed data after about 5 to 6 iterations. Overall, Miss Forest offers a precise and iterative approach to handling missing values.
First you have to install MissForest:
pip install MissForest
Using MissForest on titanic data:
train = pd.read_csv('./Datasets/titanic/train.csv')
train.drop(columns=['PassengerId', 'Survived', 'Name','Ticket'], inplace=True)
# Initialize the imputer
imputer = MissForest()
# Specify categorical columns
categorical_columns = train.select_dtypes('O').columns
# Fit and transform the data
df_imputed = imputer.fit_transform(train, categorical=categorical_columns)
df_imputed.isnull().sum()
Adding Missing Indicator
Adding missing indicators is a technique for handling missing data by creating binary columns that flag whether a value is missing. For each feature with missing values, a new column is added where 1 indicates the value is missing and 0 means it’s present. This helps track missing information, which can enhance predictions. For example, in an “Age” column with missing values, an indicator can inform the model which ages are missing, potentially improving its predictions.
We can still impute the original variable with the mean or median. Using both methods together allows us to leverage the predictive power of the variable, captured by the missing indicator, while the imputation addresses the missing values directly.
This approach works well with linear models. However, adding these indicators increases the number of features, and if many variables have missing values for the same data points, they might end up being highly correlated.
Using Pandas:
X_train = train.drop(columns=['Survived'])
y_train = train['Survived']
X_test = test.copy()
X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)
X_test['Age_NA'] = np.where(X_test['Age'].isnull(), 1, 0)
Using Sklearn:
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import MissingIndicator
X_train = train.drop(columns=['Survived'])
y_train = train['Survived']
X_test = test.copy()
mi = MissingIndicator()
# Fit on training data
mi.fit(X_train[['Age']])
X_train_missing = mi.transform(X_train[['Age']])
X_test_missing = mi.transform(X_test[['Age']])
# Add the indicator to the original DataFrames
X_train['Age_NA'] = X_train_missing
X_test['Age_NA'] = X_test_missing
X_train[['Age_NA']].sample(5)
Choosing the Right Imputation Method: A Guide
If missing values constitute less than 5% of a variable, mean or median imputation or random sample replacement can be used.
Types of Missing Data and Imputation Methods:
a) Missing Completely At Random (MCAR): Use mean, median, mode, or other imputation methods.
b) Missing At Random (MAR): Effective methods include multivariate imputation like regression imputation, KNN, MICE, and MissForest.
c) Missing Not At Random (MNAR):
i) Modeling Missingness: Requires explicit models to address the relationship.
ii) Pattern Substitution: Fills in missing data based on identified patterns.
iii) Maximum Likelihood Estimation (MLE): Estimates missing values by maximizing the likelihood of observed data.
Algorithms That Can Tackle Missing Data:
Various machine learning algorithms can effectively manage missing data, each employing unique strategies to ensure robust performance.
- Naive Bayes Classifier: Handling missing data is straightforward with Naive Bayes. It does not require filling in or explicitly modeling missing values. Instead, the algorithm simply ignores data points for the features where the value is missing and computes the likelihood based on the observed features. We can calculate conditional probabilities independently for each feature, using only the non-missing rows for that feature.
- Decision Tree: Decision Tree algorithm in Scikit-learn can handle missing values naturally by making splits based on available data and calculating impurity measures with instance weights. It also uses surrogate splits to ensure accurate predictions, even when some attributes are missing.
- XGBoost: XGBoost inherently handles missing values during training by learning branch directions in tree algorithms. The “missing” parameter, which defaults to NaN, allows for specific handling of designated missing values in predictors.
- LightGBM: LightGBM handles missing values by default using NA (NaN) representation, but this can be changed to treat zeros as missing by setting “zero_as_missing=true”; unrecorded values in sparse matrices are treated as zeros unless specified otherwise.
In conclusion, managing missing values is essential for analyzing real-world datasets, as they can greatly affect our results’ accuracy. By identifying whether data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR), we can choose the right strategies to address these gaps. Using appropriate imputation methods or removal techniques improves model performance and leads to more reliable insights in predictive analytics.
Thank you for reading! Don’t hesitate to share your experiences or questions in the comments below — I’d love to hear from you!
Reference:
- https://stefvanbuuren.name/fimd/sec-MCAR.html
- https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/mi.html
- https://www.kaggle.com/code/prashant111/a-reference-guide-to-feature-engineering-methods
- https://www.kaggle.com/code/parulpandey/a-guide-to-handling-missing-values-in-python#Finding-reason-for-missing-data-using-a-Heatmap
- https://medium.com/@shiptechamit/missing-data-part-2-understanding-missing-values-in-pyhton-c7faae023980
- https://xgboost.readthedocs.io/en/stable/faq.html