Missing Value Imputation Using MICE

Background

The problem of missing value is quite common in many real-life datasets. Missing value can bias the results of the machine learning models and/or reduce the accuracy of the model. Missing value imputation (MVI) is the solution method most commonly used to deal with the incomplete dataset problem. In general, MVI is a process in which some statistical or machine learning techniques are used to replace the missing data with substituted values. However, the main limitation of using statistical imputation with measures of central tendency is that it leads to biased estimates of variance and covariance. Therefore, techniques with machine learning emerged as another alternative to overcome the weaknesses of statistical methods. One of the machine learning algorithms in the imputation technique is MICE.

Multiple Imputation by Chained Equations (MICE)

MICE (Multiple Imputation by Chained Equations) is multiple imputation which is recognized as a leading strategy for inserting missing data due to its ease of implementation and ability to maintain unbiased effect estimates and valid inferences (Ref). So, MICE performs multiple regression for imputing.

MICE is a method of multiple imputation where missing values are filled in several times to create a complete data set. MICE is an imputation method that works with the assumption that the missing data is Missing at Random (MAR). This means that the missing data properties are related to the observed data but not the missing data. So that the imputation process involves information from other observation columns. The MICE algorithm works by running a multiple regression model and each missing value is modeled conditionally depending on the observed (not missing) value.

The main characteristic of MICE is that it performs multiple imputations using a chained equations approach. Multiple Imputations is able to take into account statistical uncertainty in imputation. While the chain equation approach is very flexible and can handle variables of various types (continuous or binary).

The following are the steps of the MICE technique:

Step 1: All missing values are initially filled in by the usual statistical methods (eg mean for numeric, mode for categorical). This imputation can be considered a "place holders" (temporary value).

Step 2: Gradually, one by one column will be returned to NA. Set back to missing starts on the variable (“var”) with the least number of missing values.

Step 3: “var” is the dependent variable in the regression/classification model and all other variables are independent variables in the regression model.

Step 4: The missing values for “var” are then replaced with predictions (imputations) from the regression model. When "var" is subsequently used as an independent variable in a regression model for another variable, both this observed value and the calculated value will be used.

Step 5: Move to the next variable with the next fewest missing values, repeating steps 2–4 then for each variable that has missing data. Cycle through each variable constitutes one iteration or "cycle." At the end of one cycle, all missing values have been replaced with predictions from the regression that reflect the observed relationships in the data. The idea is that at the end of the cycle, the distribution of parameters governing imputation (eg, coefficients in a regression model) should converge in the sense of being stable.

Based on its performance, this MICE builds the potential of the imputation method which is able to produce predictive values that are closer to the original value. However, examining the strength of the techniques is important to help understand its characteristics.

Libraries

Read Data

Took the data from Kaggle https://www.kaggle.com/code/nezarabdilahprakasa/resign-prediction-accuracy-92/data?select=Train.csv

Exploratory Data Analysis

Correlation Check

It seen that all the correlation (Excluding Age and Time Service) are low

Check Missing Value

Label Encoding for Category Variables

Using Decision Skill Possess category column to be documented for checking the changing of value

Artificially Create Missing Value

Check Artificial Missing Value

Evaluation

It shows that model can predict column Time of Service with the MSE just 1.66% and the Accuracy of Decision Skill Possess as much as 92.4%

Impute the Original Dataset

Based on performance of the imputation, the results shown are quite good, so the next step is doing imputation to the original dataset.

Using MICE as a method to substitute/ fill the missing value.

Data Preparation

Impute using MICE

Replace Missing Values

Revert to original value if there is missing value in categorical column

The new dataframe that has been implemented with MICE method can now be use.