Let’s say you have a dataset with several numerical features, and some of the features have missing values. The first thing that you would do is figure out why the values are missing, are they missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR)? Depending on which category it belongs to, the handling would be different.
If you have a lot of missing data, dropping missing values would not be a good choice. This is where you have to think of another way to handle missing values, and if the feature is numerical, the strategy is different than if it is categorical. Then comes the question of whether univariate or multivariate imputing would work.
For univariate imputing of numerical features, you have a choice between one of the following strategies:
1- Mean/Median Imputation
2- Imputing with an arbitrary value
3- End of distribution Imputation
I’ll go over the theory, assumptions, benefits, and drawbacks of each technique before showing you how to use it in Python.
You are welcome to join and bring in all the questions you ever had while dealing with missing values in your dataset. I would really appreciate it if you could spread the word in your community; it would really help me reach out to more people who want to learn data science and machine learning in depth without skipping the hard mathematical concepts.
I will be streaming live every Monday, Wednesday, and Friday at 5 a.m. EST and those streamed videos will be available for you to watch on Autodidact Scienctists
I’d also like to add here that for univariate as well as multivariate imputation when the percent of missing values per each predictor exceed 5%, regression across the rest of the samples within that predictor is often employed. However, a good practice and improvement to this would be -
Use a bayesian regressor.
For all the relevant predictors that have a part to play in predicting the target variable value - impute the missing values for each of the predictors in a single model since the backend build will take care of bayesian-based optimization across all predictors.
For this, there is an easy tool called PyMC which runs on Monte Carlo approximations of posteriors and the use of it in imputation tasks.
The general structure for your imputation models might look something like this -
with pm.Model() as m:
#x1_miss: np.ma.masked_values(x1, value=-1)
#x2_miss: np.ma.masked_values(x2 value=-1)
#y_miss: np.ma.masked_values(y, value=-1)
# Examples for defining priors for your target
lik_tau = pm.Gamma("tau", 0.0001, 0.0001)
beta0 = pm.Normal("beta0_intercept", 0, tau=0.0001)
beta1 = pm.Normal("beta1_temp", 0, tau=0.0001)
sigma = pm.Deterministic("sigma", 1 / lik_tau**0.5)
#for all xi -
xi_imputed = pm.TruncatedNormal(
"x_imputed", mu=80, sigma=10, lower=0, observed=x_miss
)
# this is an informative prior - it may be non-informative as well.
# Curve fixing - this depends on what kind of regressor curve you're fitting.
mu = beta0 + beta1 * x_imputed # X also has the missing values taken care off
# Likelihood estimation for y - again flexible to adapt the distribution...
likelihood = pm.Normal("likelihood", mu=mu, tau=lik_tau, observed=y_miss)
'''Training - optimizes across all, takes prior and likelihood distributions into account and returns missing values for all predictors. In this case, also predicts target y in a single model.'''
trace = pm.sample(2000)