• by nnurmanov on 6/13/2025, 7:24:38 AM

    For example, markitdown from MS can't recognize Cyrillic text, when I started researching into it, I found that they use pdfminer.six under the hood and there is an unresolved issue with supporting languages.

    Docling is OK with tables, but fails with cyrillic text;

    marker-pdf is OK with tables, but it also fails with cyrillic text;

    What other pdf parser libraries exist? I am looking for preferably on-premise solutions, but if I won't find a reliable and accurate solution, I might consider cloud based solutions as well.