(By Nick Lik Khian Yeo)
In the world of finance, you will never run out of reading material. Dozens of documents, reports, and studies show up in your inbox daily demanding your attention. However, not all of these are relevant to your interests, especially if you are an analyst for a specific product in a specific region. In this project which was completed in Oct’ 19, I was part of a team that trained and deployed a machine learning system to classify tax update documents by topic and location.
The Problem: More documents than you could shake a stick at
Our client, GIC, is a sovereign wealth fund established by the Singapore government, with a portfolio spanning dozens of countries and territories. They were one of the project sponsors in AISG’s 100 Experiments (100E) programme. We were engaged by their tax advisory team whose job (among others) was to keep track of changes in the tax code and to study its implications to their portfolio of investments. This was a time consuming task as they had to sift through mountains of documents in their inbox to identify information specific to changes in their specialised tax categories before they could even get started on the analysis.
Our solution was to build a document labeling algorithm that would parse a document and identify the specific tax topics it related to, as well as the geographical region it affected. This is known in machine learning as a multi-label classification problem as each document could cover multiple topics.
Data drives the solution to every AI problem
Before we can train our machine learning model, we first need data. Due to content sensitivity, our client could not simply give us a dump of their emails. Instead we worked with them to construct an initial labeled dataset of 200 publicly-available documents. This dataset is too small to perform any significant training, but simply serves as a ‘gold standard’ to help validate our model accuracy, and for us to do some exploratory data analysis.
Our initial exploration of the data identified 10 main categories, and over 100 sub-categories that fell under these 10 categories. In the course of our discovery process, we found that the 10 main categories were easily distinguishable and in fact, the updates the analysts received were already sorted according to these main categories. The real value thus lay in identifying which of the sub-categories each document belong to, and this requires a deeper understanding of each document.
To deal with the lack of training data, we went online to find documents for each sub-category. We downloaded all the documents that had words matching that sub-category in the title. This gave us a ‘weakly’ labeled training dataset of several thousand documents.
One problem remained: our training data had one label for each document, but we were supposed to build a model that could predict multiple labels per document.
Training the machine learning model
I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times.
– Bruce Lee
The problem of training on a multi-class dataset and creating a multi-label output was treated in the model design: Instead of training one model that could predict 100 labels, I trained 100 models that each predicted one label. Each model will train only on one topic and become the expert at identifying whether that particular topic exists. When a new document is encountered, every model makes a prediction, and the results are collated to retrieve multiple labels for the document. This system had the added benefit of future-proofing the model. If I wanted to add additional categories after the model has been trained, I do not need to retrain the entire model. Instead, I only have to train a model on that additional category, and then add it to the original model.
The model itself was actually an ensemble of several models. Some models focused on the number of times each word occurs (known as term frequency-inverse document frequency, or TF-IDF), while others tried to gain a semantic understanding of the document with a language model pre-trained on the English language.
Additional features were generated in the following manner:
- The spaCy matcher was used to highlight certain important keywords identified by subject matter experts
- An algorithm called k-means clustering was used to automatically group documents into unsupervised categories
Will this model be useful?
We decided to evaluate the performance of the model by a classic human vs. computer comparison, with the target that a satisfactory model should perform no worse than a human analyst. We collected a batch of unseen documents and had the model predict its labels. At the same time, 3 analysts also worked to label these documents.
With these data points, we could look at both how the model performed as compared to humans, and how each analyst compared to each other. The intra-analyst comparison was necessary because at this high level of topic-granularity, many topics overlap and there is some degree of subjectivity to the labels.
Our model achieved an F1 score of 0.65, which was essentially the same as the intra-analyst F1 score of 0.64. We have successfully built a model that performed no worse than an analyst at identifying document topics!
Deploying the model
The model is deployed in a Docker container to make it work across different environments, and consists of 3 key services
- An automated training script that can be used to add additional categories or incorporate user feedback
- A prediction API that is triggered when a new document is added
- A feedback module, which collects feedback from analysts, accounts for conflicting feedback, and updates the database
Deploying a model in this manner is what makes the system ‘intelligent’ as it is able to get better over time by learning from user feedback and improving itself. This ensures that the model will remain relevant as new tax topics are introduced, or the discussion surrounding a particular topic changes over time (concept drift).
The model has proven effective during user acceptance tests and has since been deployed into production for their local and overseas offices. This is just one of the ways that artificial intelligence can be used to augment workflows and improve efficiency. Artificial intelligence is now at a point where many of the techniques originally found in research papers are ready to be adopted by industry.
This article is a reproduction of the original by Nick Lik Khian Yeo. He is a graduate of the AI Apprenticeship Programme (AIAP®) and is plying his craft as a data scientist at GIC at the time of publication.