Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does it also include logic to download JS-driven sites properly or is this out of scope?


It doesn't. For that you would need to execute a full headless browser first, extract the HTML (document.body.innerHTML after the page has finished loading can work) and process the result.

If you're already running a headless browser you may as well run the conversion in JavaScript though - I use this recipe pretty often with my shot-scraper tool: https://shot-scraper.datasette.io/en/stable/javascript.html#... - adding https://github.com/mixmark-io/turndown to the mix will get you Markdown conversion as well.


We do that with Urlbox’s markdown feature: https://urlbox.com/extracting-text


That is unfortunately out of scope. I like the philosophy of doing one thing really well.

But nowadays—with Playwright and Puppeteer—there are great choices for Browser automation.





Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: