1

Using ML, OCR, and RPA to Automate the Processing of Financial Reports

Finance
Google Cloud
Python
Tensorflow
AI / ML

Description

Brief results of the collaboration:

  • A provider of investment management services turned to Altoros to automate manual aggregation of financial reports.
  • The company cut time spent on analyzing each document from 12 minutes to 10 seconds.
  • Achieving 99% of precision, the delivered solution enabled the customer to optimize its analyst team by focusing it on more important business tasks.

The customer

The company is involved in investment management, helping organizations to allocate their financial assets to gain value. Headquartered in Boston, the customer has affiliates in London, Singapore, Tokyo, and Sydney. Operating globally, the company serves customers across 25 countries in Europe, Asia, the Middle East, North America, and Australia.

The need

To find an optimal investment opportunity, the company was manually analyzing publicly available financial reports. Turning to Altoros, the customer wanted to automate the process of recognizing and extracting explicit tables of contents (ToCs) from reports in a PDF format.

The challenge

Under the project, the team at Altoros had to address the following issues:

  • The entries in tables of contents greatly varied from company to company, so engineers at Altoros needed to achieve unification for better recognition of the contents.
  • In many cases, it was impossible to extract text from a PDF file directly. So, developers at Altoros needed to rely on object recognition (OCR), treating a PDF as an image, and parse text from it.

The solution

At the preprocessing stage, our engineers parsed PDF files into symbols to recover text in a human-readable format, as well as extract such geometrical and formatting features of text lines as fonts, coordinates, etc. Using a classifier trained with scikit-learn, TensorFlow, and XGBoost, experts at Altoros were able to extract pages containing tables of contents.

Our team also built another classifier to extract ToCs from files’ metadata, which was present in 10% of the documents.

In order to detect a table of contents in a file, developers at Altoros trained a classifier with a subset of document bounding boxes, which label the areas containing tables of contents. While parsing, our team employed different features based on the styles of ToCs. For each text line, there were calculated and stored all the potentially relevant features.

Then, engineers at Altoros identified the exact page to which a ToC entry referred to. The extracted table of contents had a page number sequence, and the algorithms created by our experts detected the difference between a ToC page number and its actual page number in the PDF file.

Finally, developers at Altoros implemented a searchable database to easily access and search through the information contained in PDF reports.

The outcome

Partnering with Altoros, the customer automated manual processing of financial reports, cutting time spent on each document analysis from 12 minutes to 10 seconds. Achieving 99% of precision, the delivered solution enabled the customer to optimize its analyst team by focusing it on more important business tasks.

Technology stack

Programming language

Python

Technologies

TensorFlow, scikit-learn, XGBoost,Google BigQuery, Google Dataproc,tesseract, pdfminer

Database

Google Cloud Storage

You May Also Like

Automation of In-field Job Planning and Performance Optimization
Java
JavaScript
PostgreSQL
Information technology
Marketing
Call Recording, Analytics, and Workforce Optimization Solution
.NET
jQuery
C#
JavaScript
MS SQL
Information technology
Highly Scalable System for DNA Analysis
Hadoop
Java
Information technology
Healthcare
Sport
A Highly Secure Smart Home System Wins a Kickstarter Funding
Ruby
Ruby on Rails
JavaScript
Angular
PostgreSQL
MySQL
Information technology
The Image Recognition System
Java
MongoDB
NoSQL
e-Commerce
Integrated logistics solutions to the offshore industry
Android
LikeFolio: Best Practices of Cloud and Ruby Development for Application Optimization
NoSQL
MySQL
Ruby
Ruby on Rails
Marketing
Social media
Telecommunications
Finance
Data-Driven Analytics
Software for Selecting and Mixing Paint
.NET
MS SQL
C#
WP
Information technology
Retail
Software Suite for Mobile Technicians and Field Service Management
.NET
MS SQL
iOS
Android
Logistics and transportation
The System for Emergency Control Centers
.NET
C#
MS SQL
Healthcare
Sport
Logistics and transportation
The Cloud-based Document Exchange System
Java
jQuery
NoSQL
Information technology
e-Commerce
The Marketing Information Messaging System
.NET
C#
MS SQL
iOS
Marketing, Social media
Telecommunications
The NuoDB Migrator for Moving SQL Data to a NoSQL Database
Java
NuoDB
MySQL
PostgreSQL
Information technology
Manufacturing
Toyota Automates Its System for Holding Tenders
.NET
C#
Manufacturing
Warehouse Workload Monitoring Application
.NET
C#
MS SQL
WP
Logistics and transportation
Web-Based Personal Styling
Ruby
Ruby on Rails
JavaScript
jQuery
MySQL
Social media
e-Commerce
Web-Based System for Retailers
Ruby
Ruby on Rails
MySQL
MongoDB
Retail
e-Commerce
A Blockchain-Based Platform for Automating Bond Issuing Worth $10M
Bash
JavaScript
Blockchain
Finance

Contact us

Jan-Terje Nordlien

Daglig leder

jan-terje@altoros.no+47 21 92 93 00

Altoros Norge AS
Org.nr.: 894 684 992
Tordenskiolds gate 2,
0160 Oslo