Hacker News Clone

Show HN: Parsing horse racing charts with Apache PDFBox

by robinhowlett on 7/7/2017, 2:18:09 PM with 31 comments

by joosters on 7/7/2017, 9:33:43 PM
Very interesting! I had never heard of Apache PDFBox before, I must give it a try. I have a similar program that parses horse racing PDFs from sites such as www.racehorserunner.com - which are of a much simpler format, but cause endless problems for me when the PDFs have layout problems. For example, issues like one column being too long and overlapping with another, e.g the last race on http://www.racehorserunner.com/Archives/ELP/ELP170702.pdf
All PDF parsers that I have tried cope very badly with these kind of situations, and often try to be 'too clever' in that they value the final layout of the text over and above the individual strings.
Have you experienced similar problems with PDFBox, or does it handle formatting and layout fairly reliably?
by maxxxxx on 7/7/2017, 9:58:14 PM
I still don't understand how PDF could become one of the standards for publishing documents. Well structured content gets converted into PDF which loses most of that structure. And then a lot of work is done to guess that structure from PDF and convert it back to a better file format. It just shows that successful solutions don't have to be technically good.
by beager on 7/7/2017, 11:11:36 PM
Very neat, and gets me curious about PDFBox, but every time I see something that converts a consistent-layout PDF back to structured data, I just bemoan the fact that this would all be trivial with an API for these kinds of things.
by 0x445442 on 7/8/2017, 2:42:06 PM
Great job!
I was just looking at collecting race information and historical results data a month or two ago and was struck by the lack of available structured data. Heck, I couldn't easily find any for pay options either.
by Cyph0n on 7/7/2017, 9:29:47 PM
Firstly, what an interesting library. Secondly, this is among the best TLDR readmes I've ever seen! I lack exposure to this area, so I'm actually quite impressed with the complexity of it.
Keep up the great work.
by richiverse on 7/8/2017, 2:00:35 AM
As a python programmer, I found R's pdftools to be indispensable for messy text based PDFs. I couldn't find a python lib that worked as consistently across variously different formats.
by hbcondo714 on 7/8/2017, 12:20:27 AM
Impressive! Seems like you can't just use PDFBox out of the box (no pun intended) and need to write some custom code specific to the PDF itself per the chart-parser commits[1]
[1] https://github.com/robinhowlett/chart-parser/tree/master/src...
by JabavuAdams on 7/7/2017, 9:20:17 PM
Crazy! I was just looking in to this topic a few weeks ago, for a friend. Thanks!
by vbuwivbiu on 7/7/2017, 5:58:38 PM
what I would love is an app that would reformat portrait PDFs as 2-column landscape for reading on my screen
by ocrimgproc on 7/8/2017, 7:37:07 PM
Can it be used for invoices?