Hacker News Clone

SmolDocling: An ultra-compact VLM for end-to-end multi-modal document conversion

by prats226 on 3/21/2025, 1:06:42 AM with 12 comments

by Oras on 3/21/2025, 2:43:58 PM
After many posts on my feed, I decided to give it a spin.
The good: - Open source.
- Can run locally (Apple Silicon) at a fair speed.
- Image detection is good.
The bad:
- Not detecting tables.
- Text in a perfectly clean PDF (resume) is not detected.
I know its in preview, small and open source which is great, but its far from being usable.
-
by daemonologist on 3/21/2025, 5:10:40 AM
Well it's certainly small. Absolutely bombs my KTANE test though - poor character recognition, poor handling of even mildly complex tables, and prone to getting stuck in repetition loops. (Task was convert to docling, in the official HF space.)
That said, I'm definitely glad to see work in this area, particularly with open weights.
by jeyzolo on 3/28/2025, 1:29:31 PM
It also can be used here:https://www.smoldocling.net, works well!
by bugglebeetle on 3/21/2025, 3:06:29 AM
What’s the best library for fine-tuning VLMs at the moment and do they support this architecture or that for the IBM Granite vision models? Document understanding tasks seem in special need of fine-tuning.
by th0ma5 on 3/21/2025, 4:36:08 AM
Does seem comparable to Tesseract? I feel like the accuracy results are still not significantly improved as a whole.