• by jonahbenton on 9/14/2023, 6:44:11 PM

    Field report- the problem is subtle. I wrote code to do this for mine, rather than use CSVs, because the statement is a regulated document, which CSVs are not, and it has balances for validation, which CSVs also lack.

    I wound up with a pipeline of pdftotext -> configurable regexes to capture the transactions within their respective sections (banks list credits and debits separately without indicating the sign in the amount field) -> BNF parser to turn transaction lines into data, then checks start balance + transactions = end balance.

    PITB but works well.

    Over the winter will be standing up a local model to see whether a sophisticated prompt can reliably accomplish the same.

    Not going to base any workflow on my transaction data on hosted models.

  • by andrewio on 9/14/2023, 7:52:34 PM

    To extract tables from PDFs, you can use the following tools:

    1. Tabula (https://tabula.technology): a free and open-source tool.

    2. Parsio (https://parsio.io): uses pre-trained AI models for data extraction from PDFs, emails, and other formats.

    3. Airparser (https://airparser.com): uses GPT approach similar to ChatGPT for data extraction from PDFs, emails, and other formats.