Does it also include logic to download JS-driven sites properly or is this out o...

simonw · 2024-11-09T11:54:17 1731153257

It doesn't. For that you would need to execute a full headless browser first, extract the HTML (document.body.innerHTML after the page has finished loading can work) and process the result.

If you're already running a headless browser you may as well run the conversion in JavaScript though - I use this recipe pretty often with my shot-scraper tool: https://shot-scraper.datasette.io/en/stable/javascript.html#... - adding https://github.com/mixmark-io/turndown to the mix will get you Markdown conversion as well.

jot · 2024-11-09T14:38:19 1731163099

We do that with Urlbox’s markdown feature: https://urlbox.com/extracting-text

JohannesKauf · 2024-11-09T11:54:22 1731153262

That is unfortunately out of scope. I like the philosophy of doing one thing really well.

But nowadays—with Playwright and Puppeteer—there are great choices for Browser automation.

bni · 2024-11-09T15:23:09 1731165789

I used https://github.com/mozilla/readability for this