Analyzing bank statements: Where the machine fails but human eye succeeds
Here’s how a small error by HDFC Bank put our entire product development on hold.
Bank statement pdf’s or be it any pdf for that matter are not meant to be machine readable. They are designed with the intention to be read by humans. To the machine, a pdf looks like a soup of letters with no demarcation on where a word ends and where the next one begins. And so when you try to convert a bank statement pdf into a machine-readable csv/json you are bound to face challenges.
After some trial and error we finalized this process:
- Find identifiers/headers for the transaction table
- Identify which chunk of the text belongs to which column/header
- Put these chunks in a json / csv format
Easy enough, or so we thought.
Most bank statements are to a large extent, similar in structure. They all have 3 columns for transactions. One for debit, one for credit and the last one for balance (Kotak Mahindra has 2. They are quite the odd one out as you will find in my future posts)
Based on this principle we started finding the headers. “Debit” / “DR” or “Dr” would mean debit. “Balance” or “Closing Balance” would mean balance. Very quickly we were able to do a basic parse on most private bank using this. Axis, Kotak, ICICI they all followed this rule. Except for HDFC. For some reason, unknown to us at that point, our parser refused to work on HDFC bank statements. And that’s where we hit a brickwall. Even though HDFC has the same column structure and no apparent differences, the parser kept telling us that it was unable to find anything called “balance”
For a span of 3 days this was the only thing we focussed on. When all tricks failed, we eventually started pulling out our training data set and reading these statements manually. After screening multiple such files, not sure what to look for, it hit us. Right in front of our eyes yet something we had been ignoring.
In what was in all likelihood an unintentional mistake, HDFC bank, in their monthly bank statements which they send to their entire customer base, had written the “Balance” column heading as “Balnace”. Yes. “Balnace”. A spelling mistake by HDFC of all things caused us to pull our hair for 3 days.
So, now CollectR’s code actually lists “Balnace” as a synonym for “Balance”
We have spent months refining CollectR and we don’t want other fintechs to reinvent the wheel. CollectR, which used to be an internal tool at GalaxyCard is now available to everyone. It is by far the easiest way to collect bank statements from your users. Collect bank statements from your users in just 1-click. No document upload and no passwords needed. A 1-click, magical experience.
Want to know more? Join us for a short webinar on Bank statement analysis | Strategies for loan underwriting. Go beyond analyzing basic transactions. Identify lenders, auto-debits and other key ratios. Register here
Product @ IndusInd | Ex-Creditvidya (CRED entity) | IMT Ghaziabad | BITS
4yHemanth Vaddi