Demystify classify
- Yash Raj
- Jun 23, 2020
- 3 min read
Updated: Jun 29, 2020
How the hell Instagram is showing me Ad of shirts I viewed on amazon?
OR
Why banks kept on declining my credit card requests?
Well, you will not only get the Why part answered but also get a roadmap to implement a classification model by the end of this post.
The beginning
One thing society got to thank gamblers apart from paying huge taxes on alcohol and helping the economy of state (pun intended) is the use of statistical models. Thanks to increased processing power and accessibility of smart devices almost everybody is using products based on artificial intelligence or machine learning which is nothing but glorified statistics.
The reason
Let's say your friend got a big discount voucher from Uber but you didn’t Or, Your friend is getting matches after matches on Tinder while you are getting shown only a few profiles. Well, this is not happening by chance but by KARMA, it's your past features (activities) that judges your present digital life. Marketers use models to classify people based on their profiles.
The recipe to cook a logistic model
Let me give you a roadmap that you can follow while building a classification model from scratch.
Data cleaning
Raw Data is similar to engineering student’s room, everything that you need plus the garbage is there, and cleaning it will be the most tiring job that you are gonna come across. I believe its the part where not only your statistical knowledge but also your intuition is tested. One should be vigilant and look up the following methods while building a logistic model
Imputation: Because of missing data points, your model may not represent the dependent variable closely. Commonly used practices are to follow the mean, median, and proxy imputation. The type of imputation largely depends on the data type e.g. one can opt for mode or proxy imputation to fill null values of a variable that are categorical.
outliers: Your model is as good as your data is and the presence of outliers can change your regression-based model’s expected output significantly, So it becomes imperative for one to deal with them in the first phase. Methods used for identifying and removing outliers are:

Balancing: The training dataset may have a disproportionate number of group members, a similar case shown in the figure below. In such scenarios, the model created using imbalanced data sets tend to produce biased results.

Two resampling approach to get a balanced dataset are:

Regularisation: Data that we use to train our models can be represented as a sum of True value and random error, and when we train our model with data of very high dimensions, the model tries to explain the true values along with the attached random errors i.e. the case of overfitting.
Such models may present very low in-sample error, very high R Square, and may trick you in assuming yourself as a pro but believe me if you trust such models you are nothing but a noob.

These models have a very high out-sample error and fail to predict the dependent variable. Few approaches to reduce dimensionality and limit overfitting are:

Model building
The statistical Model building is overrated! I mean you can build a model in just a line of code but will that model be useful or provide the insights needed. The answer is a big NO.
Model building is no less than dating, you got to be patient, dedicated, should be able to filter hidden cues, and should be able to keep on updating your approach.
Bootstrapping
The model you created may have some random bias attached to it, however, it can be limited using Bootstrapping. Under bootstrapping, multiple models are regressed with different training sets. Coefficients of all the models are averaged out to remove the random bias present

Model evaluation
You need some metrics that indicate how good your model is or does your model need to train again?
Well, there are certain metrics that you can rely on, few of them are

These matrices are just the tip of the iceberg, and one should not follow them blindly. However, you can tag along and subscribe to our mailing list, we will be explaining many more matrices and algorithms that are used in business analytics.
Comments