5 steps to prep your data for machine mearning models to tackle financial fraud

In our previous post, we shared how financial crime impacts financial institutions around the world and how institutions are turning to advanced machine learning and data science for solutions. In this post, we will dive deeper into how institutions use data to get ahead of financial crime.

Before you can even start to look at ways of using machine learning (ML) to solve a problem, you need access to the data. More is often better, but if you are looking to build supervised models then you need data that has been validated and verified to optimize the process. Getting the right data at the right time is a major hurdle for most organizations. Let’s look at how to accomplish this.

Step 1: Data access

The first step is to unleash all of the data that a company owns. In today’s world, this needs to include real-time data, transactional data, CRM data, historical data, social data, geodata, demographic data, and more. Instead of cramming it into data lakes or data warehouses, often a better way of making data accessible is using a data virtualization tool, accessing it efficiently from its source.

Step 2: Wrangle, validate, rinse, repeat

Accessing the data is just the start. Data from disparate systems need to be wrangled into a common format to ensure connections can be made. Powerful data cleansing and master data management will enhance the value of the data. For use in the world of ML, data is best when augmented with additional context. Adding values like location data for the origin of a payment, relationships between payees and targets accounts, frequency patterns, and dormancy times can all be revealed when looking for sophisticated new fraud patterns. Augmenting verified transactional data for supervised model feeds allows more data features to be assessed to build better models.

Step 3: Get your data in shape through categorization

Data is the root of model accuracy. More data is good for unsupervised models, but for supervised model training, it also needs to be accurate. If the source data is incorrectly categorized, it can have a huge impact on model effectiveness, so get the data in shape before creating your models.

Step 4: Build both supervised and unsupervised models to detect fraud

The next step is to build models that can detect fraud. This is done with training datasets pulled from the vast array of historical data sources that you have so lovingly preserved. Both supervised models (that uses validated “good” data) and unsupervised models are best used in combination to build the heart of an artificial intelligence system. Depending on the use case, the choice of the algorithm may be obvious, or it may be that many different algorithms need to be validated before the best combination is found. Tools like AutoML in Spotfire^® Data Science can help speed up this process to assist the data science team to select the models that give the best results.

Step 5: But it doesn’t stop there…

These models, once trained, validated, and approved, need to be deployable on demand into live systems to inspect transactions in real time. The fraud systems then score the transactions against the model, and the system uses thresholds to determine the processing path. The results need to be monitored for effectiveness and the results fed back into the model building process (known as “good” data) to constantly refine and adapt to changing patterns.

Augmented investigations
A good fraud prevention platform enables you to monitor transactions as they occur and easily generate views of real-time information on transactions and frauds as they happen. But more than this, it needs to assist the inevitable human investigations by ensuring the context and data of those suspicious transactions is captured and logged for the investigators. This will expedite the investigation process across your organization to evaluate those suspect transactions and quickly make the right decisions in a timely and efficient manner.