{
"title": "Scraping web pages from the command line with shot-scraper",
"byline": null,
"dir": null,
"lang": "en-gb",
"content": "... long string of HTML ...",
"length": 7104,
"excerpt": "I\u2019ve added a powerful new capability to my shot-scraper command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript \u2026",
"siteName": null,
"publishedTime": null
}
I have used and love readability.js. I used it in an application that lets you run various NLP analyses over a web page (surprisals, reading time, word counts, etc.). For that, I needed only the main page content. readability.js retrieves main page content well, consistently.
That would be a good chance to mention postlight reader [1] which is a browser extension (both Firefox and chrome versions available) that uses Readability to give you a better reading experience. I recommend it
Is there a way to use it as a copy-paste in the console, or link to paste in the URL bar, in order to convert a poorly formatted webpage to a nicer one?
Running this in a terminal (after installing shot-scraper):
Outputs this: