Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Readability.js (github.com/mozilla)
164 points by stefankuehnel on Feb 25, 2024 | hide | past | favorite | 23 comments



I like using Readability.js as a demo for my shot-scraper CLI utility: https://shot-scraper.datasette.io/en/stable/javascript.html#...

Running this in a terminal (after installing shot-scraper):

    shot-scraper javascript \
      https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/ "
    async () => {
      const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
      return (new readability.Readability(document)).parse();
    }"
Outputs this:

    {
        "title": "Scraping web pages from the command line with shot-scraper",
        "byline": null,
        "dir": null,
        "lang": "en-gb",
        "content": "... long string of HTML ...",
        "length": 7104,
        "excerpt": "I\u2019ve added a powerful new capability to my shot-scraper command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript \u2026",
        "siteName": null,
        "publishedTime": null
    }


Wow, that's really cool! Thanks for sharing. I knew about "shot-scraper" before, but I didn't know you could do something so cool with it.


This is really cool, thanks for sharing!


I have used and love readability.js. I used it in an application that lets you run various NLP analyses over a web page (surprisals, reading time, word counts, etc.). For that, I needed only the main page content. readability.js retrieves main page content well, consistently.

The Alan Turing Institute maintains a Python wrapper around readability.js, too: https://github.com/alan-turing-institute/ReadabiliPy.


I had not heard the term "surprisal" before. It's a delightful word. https://en.wikipedia.org/wiki/Information_content


I love this feature so much, I built a website to let me share clean text URL with others.

The first one use js lib but it's kinda limited. the second one use Go, compiled to wasm. Both deployed to Cloudflare workers

- https://github.com/tuananh/reader

- https://github.com/tuananh/reader2


> minScore (number, default 20): the minimum cumulated 'score' used to determine if the document is readerable;

Ahaaa, so that's why sometimes I get the reader icon and sometimes not (especially on mobile).


This is pretty good for stuffing web scrapings into an LLM context.

Can also use `JSDOM ---> element.textContent` depending on your needs. Useful for snagging all the text content or a specific element's.


I wrote a browser plugin that does this. Could never figure out how to deal with paged articles though.


Readability is awesome! I used it to build Smort.io [1] to easily read articles & ArXiv papers.

[1] https://smort.io


That would be a good chance to mention postlight reader [1] which is a browser extension (both Firefox and chrome versions available) that uses Readability to give you a better reading experience. I recommend it

[1] https://reader.postlight.com/


Excellent library. This is what i'm using to build a full text index of every website I visit [1]. It doesn't work perfectly, but well enough.

[1]: https://github.com/iansinnott/full-text-tabs-forever


Clipper.js is built on top of Mozilla's Readability library, Turndown to convert HTML to Markdown https://github.com/philschmid/clipper.js


Is there a way to use it as a copy-paste in the console, or link to paste in the URL bar, in order to convert a poorly formatted webpage to a nicer one?


Is there a way of using this in conjunction with, say, curl? I'd love to be able to grab clean web pages for offline use and printing.



If you run it inside a container, it's fairly simple: https://github.com/phpdocker-io/readability-js-server


This is the type of work ML should excel at.


Has anyone been able to get this to work on Cloudflare Workers?


I did this few years back https://github.com/tuananh/reader

also 2nd version use golang and copmile to wasm https://github.com/tuananh/reader2


I seem to remember it can be run with JSDom rather than a real DOM. I've never tried with CF workers specifically though.


See also the C port here: https://github.com/eafer/rdrview/

It works well with text-mode browsers like w3m.


huh that is probably good for ereaders




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: