• by tmpfs on 5/22/2025, 10:40:32 PM

    Interesting as I was researching this recently and certainly not impressed with the quality of the Readability implementations in various languages. Although Readability.js was clearly the best, it being Javascript didn't suit my project.

    In the end I found the python trifatura library to extract the best quality content with accurate meta data.

    You might want to compare your implementation to trifatura to see if there is room for improvement.

  • by creakingstairs on 5/22/2025, 11:14:14 PM

    I was just looking at obsidian web-clipper's source code because I've been quite impressed at its markdown conversion results and came across Defuddle in there. I'll be using for my bespoke read-it-later/ knowledge-base app, so thank you in advance :D

  • by Tsarp on 5/23/2025, 3:18:06 AM

    Been using the obsidian clipper since it was out and this is a really neat. The per website profile based extraction is awesome.

    Even if you are not a obsidian user, the markdown extraction quality is the most reliable Ive seen.

  • by jeanlucas on 5/23/2025, 1:54:56 AM

    Obsidian Web Clipper is a great tool to turn chatGPT conversations in markdown, or to just print it (believe me, it is a user case)

  • by binarymax on 5/23/2025, 1:44:39 PM

    Really nice work. I appreciate the example with JSDOM as that’s exactly how I use readability, and this looks like a nice drop-in replacement.

    Question: How did you validate this? You say it works better than readability but I don’t see any tests or datasets in the repo to evaluate accuracy or coverage. Would it be possible to share that as well?

  • by shrinks99 on 5/23/2025, 1:18:38 AM

    I've been super happy with Obsidian Web Clipper! It's worked really well for me with the one exception of importing publish dates (which is more than forgivable !)

  • by acrophobic on 5/23/2025, 2:23:03 AM

    Is Mozilla's Readability really abandoned? The latest release (v0.6.0) is just 2 months ago, and its maintainer (Gijs) is pretty active on responding issues.

  • by rcarmo on 5/22/2025, 9:50:21 PM

    The Python analogues seem to be well maintained. I did my own implementation of the Readability algorithm years ago and dropped it in favor them, and I have a few scrapers going strong with regular updates.

  • by novoreorx on 5/24/2025, 10:04:47 AM

    I built a similar project called Substance [^0]. Unlike most readability tools that try to solve the problem once and for all, it takes a different approach. It provides a framework to define how each website should be handled, ensuring better results for each website covered.

    [^0]: https://substance.reorx.com/

  • by Andr2Andr on 5/23/2025, 8:19:21 AM

    Serious question - who and why would be using this tool? What is the use case? In other comments I have only seen exporting ChatGPT conversations to md

  • by jonplackett on 5/23/2025, 7:02:21 AM

    Does anyone know why readers don’t work for some websites where it looks like they should - ie normal article with lots of text.

    You just get a completely white page (on the iPhone reader). Usually it’s a news website.

    Is this the website intentionally obscuring the content to ensure they can serve their ads? If so how do they go about it?

  • by severusdd on 5/23/2025, 10:18:50 AM

    This is very cool! Given how messy and busy many websites have become, we really need a robust markdown converter that lets readers focus on reading the content. Nice to see something stepping up where Readability left off.

    Thank you for picking up this work :-)

  • by busymom0 on 5/22/2025, 9:49:43 PM

    In the playground, after I enter a url, I can't seem to figure out how to submit it to fetch the url? I tried pressing the return key on iOS keyboard but it didn't do anything. Am I missing something?

  • by ricardonunez on 5/23/2025, 11:19:25 AM

    I’ll give it a try. I’m not happy with my current setup for markdown to HTML on the wysiwyg editor I’m using, this may provide better results if I go with my own tool bar and editor.

  • by ulrischa on 5/23/2025, 9:01:17 AM

    I have build something similar:https://devkram.de/markydown but with php. Easy for self hosting

  • by inhumantsar on 5/22/2025, 11:40:08 PM

    can confirm that readability seems to be on life support. I used it slurp, an obsidian plugin which serves the same basic purpose as web clipper, and always had a hard time getting PRs reviewed and merged.

    i started working on my own alternative but life (and web clipper) derailed the work.

    it's funny. somehow slurp keeps gaining new users even though web clipper exists. so i might have to refactor it to use your library sometime soon even though I don't use slurp myself anymore.

  • by billconan on 5/22/2025, 10:35:52 PM

    Are you using ai models behind the scenes? I saw Gemini and others in the code. I am asking mainly to understand the cost of using yours vs. readability. Thank!

  • by ahsd1 on 5/23/2025, 4:47:19 PM

    Cool. Im looking for something similar but for stripping signatures and boilerplate disclaimers from html email. Could this work for that?

  • by timdeve on 5/23/2025, 8:50:52 AM

    Looks good, I'm gonna try to swap readability in my RSS reader with this.

    And with Pocket going away I might have to add save it later to it...

  • by 90s_dev on 5/23/2025, 12:04:37 AM

    Neat. With ~3 more lines of code, you could get a URL and render it in simpler HTML and be a full fledged replacement.

  • by infogulch on 5/23/2025, 5:39:08 PM

    Since it's written in javascript is there any chance it could be packaged as a bookmarklet?

  • by khaki54 on 5/23/2025, 1:18:13 AM

    seems pretty much perfect including obsidian clipper. Thanks!

  • by revskill on 5/23/2025, 8:07:47 AM

    Interesting that Markdown does not support form element.

  • by miketromba on 5/23/2025, 6:25:33 PM

    Excellent work. A modern alternative to readability was much needed. This is especially useful for building clean web context for LLMs. Thanks for open-sourcing this!

  • by ioma8 on 5/23/2025, 10:01:33 AM

    Tried it on some webpages, doesnt work well.

  • by input_sh on 5/22/2025, 11:17:24 PM

    A bit off-topic, but I'm very excited to see the launch of Bases! I've obsessively followed the roadmap for like a year awaiting this day and have been frequently disappointed to still see it stuck somewhere under "planned".

    Not that I didn't already implement a read-it-later solution with Obsidian+Dataview, but this definitely makes things simpler!

  • by fkfyshroglk on 5/22/2025, 10:26:20 PM

    For those not in the know: [Readability](https://github.com/mozilla/readability)