Fraud detection using Machine learning

Fraud Detection as classification problem

To illustrate how fraud detection problem can be solved using Machine learning I will use data available on kaggle : https://www.kaggle.com/mlg-ulb/creditcardfraud
The purpose of this article is to discuss the biggest challenges in predicting fraudulent data and how to overcome those.

In Machine Learning, problems like fraud detection are typically framed as classification problems predicting a discrete class label output given an information observation. Examples of classification issues that can be thought about are Spam Detectors, Recommender Systems, and Loan Default Prediction.
Talking about credit card payment fraud detection, the classification problem involves developing designs that have enough intelligence to accurately categorize deals as either legit or fraudulent, based upon transaction details such as amount, merchant, location, time and others.
Hackers and criminals around the world are continually looking into new ways of dedicating financial fraud at each minute. Relying exclusively on rule-based, traditionally programmed systems for identifying monetary fraud would not provide the appropriate time-to-market.
This is where Machine Learning shines as a unique solution for this kind of problem.
The primary obstacle when it pertains to modeling fraud detection as a classification issue comes from the reality that in real-world data, most of the deals is not deceptive. This brings a big challenge: imbalanced data.
You can perform exploratory data analysis to find how imbalanced your data is. You can find that out using few simple plots and groupings.

Group by class

Dimensionality Reduction With t-SNE for Visualization

Visualizing our classes would prove to be quite interesting and show us if they are clearly separable. However, it is not possible to produce a 30-dimensional plot using all of our predictors. Instead, using a dimensionality reduction technique such as t-SNE, we are able to project these higher dimensional distributions into lower-dimensional visualizations. (further reading here

Projecting our data set into a two-dimensional space, we are able to produce a scatter plot showing the clusters of fraudulent and non-fraudulent transactions:

Scatter plot shows imbalanced data

Techniques to handle imbalanced data

Most common ways of dealing with imbalanced data are:

  1. Oversampling — SMOTE
  2. Undersampling — UnderSampler from imblearn

Oversampling

To oversample means to create observations in our data set belonging to the class that is under represented in our data. In this case fraudulent transactions.

One common technique is SMOTE — Synthetic Minority Over-sampling Technique. At a high level, SMOTE creates synthetic observations of the minority class (in this case, fraudulent transactions). At a lower level, SMOTE performs the following steps:

  • Finding the k-nearest-neighbors for minority class observations (finding similar observations)
  • Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations.

There are many SMOTE implementations out there. In our case, we will leverage the SMOTE class from the imblearn library. The imblearn library is a really useful toolbox for dealing with imbalanced data problems.
In Python you I use this two packages: imblearn.over_sampling or smote-variants
Here is a good article about SMOTE and NEAR MISS: https://medium.com/@saeedAR/smote-and-near-miss-in-python-machine-learning-in-imbalanced-datasets-b7976d9a7a79

Undersampling

Undersampling works by sampling the dominant class to reduce the number of samples. Simple way to undersample your data is by randomly selecting a handful of samples from the class that is overrepresented.

Another more scientific way would be using Python, R ORor any other tool of choice.
Under-sampling is a nice Python class from imblearn library I use. It provides fast and easy way to balance the data by randomly selecting a subset of data for the targeted classes. It works by performing k-means clustering on the majority class and removing data points from high-density centroids.

Notice how the dominant class (the yellow) is undersampled on the right most plot.

_images/sphx_glr_plot_comparison_under_sampling_0011.png
under-sampling

Near miss

The general idea behind near miss is to only the sample the points from the majority class necessary to distinguish between other classes.

NearMiss-1 select samples from the majority class for which the average distance of the N closest samples of a minority class is smallest.

near miss 1

NearMiss-2 select samples from the majority class for which the average distance of the N farthest samples of a minority class is smallest.

Beware while undersampling, to not shrink the dominant class to much. Then your model will be prone to underfitting. For example, you have positive class of 4000 samples and negative only 200. If you undersample the positive class to 400, you need to be careful that the model you develop will be genristic enough to represent the rest of the 3600 samples.
One way to solve this is by creating an ensemble model from different models trained with different undersampled data sets.

Creating Machine Model for Fraud detection

Once we have quite balanced data we can approach the fraud detection as any supervised machine learning problem.
First divide the data set on train and test set:

X_train,X_test,y_train,y_test=train_test_split(train_data,y,test_size=0.3,random_state=1)

My prefered practice is creating ensemble model and perform grid search. In my experience this gives me the best result overall.

clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()
AB= AdaBoostClassifier()
GBM= GradientBoostingClassifier()
ET= ExtraTreesClassifier()
SGD = SGDClassifier()
MLP = MLPClassifier()
XGB=XGBClassifier()
kNN =  KNeighborsClassifier()
param_grid = [
    {
        'activation' : ['identity', 'logistic', 'tanh' ],
        'solver' : ['lbfgs', 'sgd', 'adam'],
        'hidden_layer_sizes': [
         (1,),(2,),(3,),(4,),(5,),(6,),(7,),(8,),(9,),(10,),(11,), (12,),(13,),(14,),(15,),(16,),(17,),(18,),(19,),(20,),(21,)
         ]
    }
   ]

print('Grid Search - soft')
eclf = VotingClassifier(estimators=[ ('rf', clf2), ('gnb', clf3),('AB',AB),('GBM',GBM),('ET',ET),('MLP',MLP),('XGB',XGB),('kNN',kNN)], voting='soft')
params = {'rf__n_estimators': [20, 1000]}
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=10,n_jobs=20)

grid.fit(rescaledX, y_train)

Conclusions on predicting Fraud data using Machine learning

As you can see, once we identify the most challenging part predicting fraudulent data – unbalanced samples – the problem comes down to solving a classification problem. As I stated before, beware when undersampling the data, to generate ensemble models that will represent most of the population that was left out of the training.
Another thing you need to be carefull is, if there are way too many representatives of the undersampled set. Then you need to use business knowledge empowered with outlier detection to create rules that will classify some of the data (or newcoming data).

There are also other ways to predict unbalanced data, for example using Kalman filters. This is something I’ll write about later.

Leave a comment