I'm working on pure.md[1], which lets your scripts, APIs, apps, agents, etc reliably access web content in markdown format. Simply prefix any URL with `pure.md/` and you get the unblocked markdown content of that webpage. It avoids bot detection and renders JavaScript-heavy websites, and can convert HTML, PDFs, images, and more into pure markdown.
pure.md acts as a global caching layer between LLMs and web content. I like to think of it like a CDN for LLMs, similar to how Cloudinary is a CDN for images.
By default, href values of <a> tags are removed, because they add significant token length without adding more context. Coming soon, you can specify a request header to set whether or not you want links removed from the response. Those underscores you mentioned are from the italics.
Thanks for sharing. I was planning on building something like this in April after hitting too many issues with Jina and Tavily but it looks like you've already done the hard work!
With AWS Athena, you can query the contents of someone else’s public S3 bucket. You pay per read, but if you craft your query the right way then it’s very inexpensive. Each query I run only scans about 1MB of data.
The service seems designed to bypass anti-scraping measures. If site owners don't want their content scraped by AI this is subverting their will.
It also obfuscates responsibility between the AI vendor and the scraping service. One can imagine unethical AI providers using a series of ephemeral "gateways" to access content while avoiding any legal or reputational harm.
pure.md acts as a global caching layer between LLMs and web content. I like to think of it like a CDN for LLMs, similar to how Cloudinary is a CDN for images.
[1] https://pure.md