How to develop a scalable categorisation engine for open banking

Written by Jamie Sims · June 17th, 2020

Open banking relies on the availability of financial data, which arrives in many shapes and sizes. Adding bank connections is one thing, but increasingly, clients are demanding more from Open Banking; to derive meaning and insights from transactional data.

A cornerstone of this is the ability to categorise transactions into income and expense categories. By creating and utilising services like these, our clients can seek to better understand their customer’s financial behaviors.

In this categorisation series, we share top tips and the challenges faced when we embarked on our journey to develop Yapily’s categorisation engine.

The “Product” behind the Product

You never know how big something is until you start building it. This is definitely true when we embarked on our journey to build our Categorisation Engine. Being a relatively new Product Manager for Machine Learning products, it was a steep learning curve for me.

The journey to develop this product was as much about developing individual parts of a larger system as well as the end result. After initial scoping sessions, we found what we actually needed to design was:

  • A system to develop Transaction Categorisation Models.
  • Our first model: to target UK Retail transactions.

This first point is particularly important for a business like Yapily - a financial gateway that will need to serve multiple markets and therefore will need to handle financial data from a variety of sources. The requirement is as much about building a scalable process to deliver multiple categorisation models for potential future challenges.

Given this was our first step, it quickly became apparent that success hinged on us developing our core competencies or “Products behind the Product” that can be re-used or extended alongside our final deliverable - a Categorisation Engine for UK Retail Banks.

  • Data Treatment
  • Data Classification
  • Model Training

Data Treatment:

Varying quality of data from the banks and finding a core set of features from which we could build our Machine learning Models was our first challenge. Given that we were targeting the UK for our first release the Open Banking Spec went some way towards solving this problem, but given the number of optional fields and the subtle differences between what was returned by each bank presented a challenge.

The second challenge was the removal of Personal Information from transactional data before we classified it. Ensuring that only anonymised data is accessible to our internal teams for classification.

Data Classification:

Having experimented with both manual and automatic ways of classifying transactions, it became clear that manually classified data was far better at training our model. Given the size of the task - we developed a lightweight tool that enabled us to crowdsource labellers from “willing” participants at Yapily - eventually running a series of Labelling Grand Prix at work to create training sets from anonymised transactional data.

Model Training:

Having recently watched “The Professor and the Madman” (I’d highly recommend this as mid week COVID Movie Fodder to fill up an evening), it drew striking similarities between their task and ours. In the film - a group of professors embark on the daunting task of creating the first English Dictionary. Quickly they come to the same conclusion that we did - “the work would never be finished”.

“Building a Categorisation engine is much like writing the dictionary, it’s work is never truly done”

Once we had got to our first milestone it quickly became apparent that models require re-training and constant optimisation, new merchants can appear at any time and new categories of spending may appear. As a result, the Data and Insights team needed to invest in tooling and opted to use Kubeflow to help us continue to refine our model.

What did we learn?

For me personally, I took three main lessons from this experience:

Thinking of the Product more as a system, with component parts and applying systems thinking principles really helped us plan the Product roadmap for our initial delivery. This also helped us identify common components that we could use again, if Yapily decided to develop a categorisation model for another market.

Secondly, deciding on a Taxonomy by which we can classify transactions into various categories was extremely contentious. Some of the common problems we ran into:

  • Different people segment their spend differently.
  • Even when giving people a finite choice of categories, there are still differences in how people categorise transactions.
  • Organising spend into categories is difficult e.g. how do you categorise a trip to the post office?

After much trial and error, adapting card sorting exercises and lots of testing, we arrived at a taxonomy that made sense to us.

Finally, was the importance of data. To continue the theme of film references, Robert Downey Jr (when playing Sherlock Holmes) summarises our first key learning nicely “Data, Data, Data - I cannot make bricks without clay”. Obtaining data to train our model that is good quality and diverse was a huge problem to solve. As well as controlling for diversity of transactions, we had a larger business problem: how do we classify data at scale, not just now but in the future?

The answer came in the form of developing our labelling tool into its own stand alone product that can allow us to develop a product that started out as a solution to a problem into what I believe is a competitive advantage for Yapily.

What’s next?

This is a really exciting time for Yapily and Data Science. If you look at the rate at which Banks are producing Open Banking APIs across UK and Europe combined with the improved quality of data compared to screen scraping - this space really lends itself towards Machine Learning and AI Enrichments.

The most exciting use case from my perspective, is in the lending sector. The combination of the quality Open Banking API Data and an accurate categorisation model can be extremely powerful in lending decisions and presents a win-win situation: ultimately

Customers benefit by getting a more tailored lending decision Businesses benefit by being able to lend more confidently lend or are able to service customer segments that they couldn’t previously

In terms of where we can take Categorisation at Yapily, model refinement and using that for enrichment purposes is what I find most interesting. Understanding how often a payment occurs and the type of transaction it is, is just one example where combining enrichment services can offer real insight into our customers.

What are you going to build today?