Demystify classify

Yash Raj
Jun 23, 2020
3 min read

Updated: Jun 29, 2020

How the hell Instagram is showing me Ad of shirts I viewed on amazon?

Why banks kept on declining my credit card requests?

Well, you will not only get the Why part answered but also get a roadmap to implement a classification model by the end of this post.

The beginning

One thing society got to thank gamblers apart from paying huge taxes on alcohol and helping the economy of state (pun intended) is the use of statistical models. Thanks to increased processing power and accessibility of smart devices almost everybody is using products based on artificial intelligence or machine learning which is nothing but glorified statistics.

The reason

Let's say your friend got a big discount voucher from Uber but you didn’t Or, Your friend is getting matches after matches on Tinder while you are getting shown only a few profiles. Well, this is not happening by chance but by KARMA, it's your past features (activities) that judges your present digital life. Marketers use models to classify people based on their profiles.

The recipe to cook a logistic model

Let me give you a roadmap that you can follow while building a classification model from scratch.

Data cleaning

Raw Data is similar to engineering student’s room, everything that you need plus the garbage is there, and cleaning it will be the most tiring job that you are gonna come across. I believe its the part where not only your statistical knowledge but also your intuition is tested. One should be vigilant and look up the following methods while building a logistic model

Imputation: Because of missing data points, your model may not represent the dependent variable closely. Commonly used practices are to follow the mean, median, and proxy imputation. The type of imputation largely depends on the data type e.g. one can opt for mode or proxy imputation to fill null values of a variable that are categorical.

outliers: Your model is as good as your data is and the presence of outliers can change your regression-based model’s expected output significantly, So it becomes imperative for one to deal with them in the first phase. Methods used for identifying and removing outliers are:

Methods used for outlier detection and replacement. IQR, Isolation forest discussed — Outlier detection methods

Balancing: The training dataset may have a disproportionate number of group members, a similar case shown in the figure below. In such scenarios, the model created using imbalanced data sets tend to produce biased results.

An example of how an imbalanced data set may look — Imbalanced dataset

Two resampling approach to get a balanced dataset are:

data resampling method to make data balanced. Upscaling and downscaling discussed — Data scaling method

Regularisation: Data that we use to train our models can be represented as a sum of True value and random error, and when we train our model with data of very high dimensions, the model tries to explain the true values along with the attached random errors i.e. the case of overfitting.

Such models may present very low in-sample error, very high R Square, and may trick you in assuming yourself as a pro but believe me if you trust such models you are nothing but a noob.

Representing underfoot, optimaland overfit model — Fitting types

These models have a very high out-sample error and fail to predict the dependent variable. Few approaches to reduce dimensionality and limit overfitting are:

Techniques to reduce the data dimension. PCA, MDS, PCOA and VIF discussed — Dimension reduction techniques

Model building

The statistical Model building is overrated! I mean you can build a model in just a line of code but will that model be useful or provide the insights needed. The answer is a big NO.

Model building is no less than dating, you got to be patient, dedicated, should be able to filter hidden cues, and should be able to keep on updating your approach.

Bootstrapping

The model you created may have some random bias attached to it, however, it can be limited using Bootstrapping. Under bootstrapping, multiple models are regressed with different training sets. Coefficients of all the models are averaged out to remove the random bias present

Resampling techniques to make the dataset balanced — Resampling techniques

Model evaluation

You need some metrics that indicate how good your model is or does your model need to train again?

Well, there are certain metrics that you can rely on, few of them are

Presents a brief information about the metrics that evaluates model performance. R square, McFadden R square, ROC, AUC and F1 score discussed — Model evaluation

These matrices are just the tip of the iceberg, and one should not follow them blindly. However, you can tag along and subscribe to our mailing list, we will be explaining many more matrices and algorithms that are used in business analytics.

Demystify classify

How the hell Instagram is showing me Ad of shirts I viewed on amazon?

Why banks kept on declining my credit card requests?

Recent Posts

Comments