Hacker Newsnew | past | comments | ask | show | jobs | submit | welanes's commentslogin

This guide may be useful for anyone interested in building their own spec-compliant MCP server: https://simplescraper.io/blog/how-to-mcp.

Albeit, it's a few weeks old so already in need of an update!


> I'm always a bit wary of approaching these ideas though because I feel like nobody would ever pay for small web stuff?

Build them and find out! There's almost always room for something better, and if what you build is that, then some people will pay.

Unless the tools have large running costs, consider offering them for free (at least at first).

A few things are likely to happen:

Nobody cares enough to sign up – fine, zero users, zero costs.

People use it but don’t stick around – ask them what would make the tool more valuable.

People use it a lot – great, now you can charge for the value.

Two examples come to mind: spaced repetition app Mochi and markdown editor Typora. Mochi is $5 a month, I think. Typora was free and is now a $15 one-time purchase.

Both compete with free alternatives and still have many paying users.


1. Clicking the box programmatically – possible but inconsistent

2. Outsourcing the task to one of the many CAPTCHA-solving services (2Captcha etc) – better

3. Using a pool of reliable IP addresses so you don't encounter checkboxes or turnstiles – best

I run a web scraping startup (https://simplescraper.io) and this is usually the approach[0]. It has become more difficult, and I think a lot of the AI crawlers are peeing in the pool with aggressive scraping, which is making the web a little bit worse for everyone.

[0] Worth mentioning that once you're "in" past the captcha, a smart scraper will try to use fetch to access more pages on the same domain so you only need to solve a fraction of possible captchas.


That's awesome. Thanks for sharing.

First time hearing of the fetch() approach! If I understand correctly, regular browser automation might typically involve making separate GET requests for each page. Whereas the fetch() strategy involves making a GET for the first page (just as with regular browser automation), then after satisfying cloudflare, rather than going on to the next GET request, use fetch(<url>) to retrieve the rest of the pages you're after.

This approach is less noisy/impact on the server and therefore less likely to get noticed by bot detection.

This is fascinating stuff. (I'd previously used very little javascript in scrapes, preferring ruby, R, or python but this may tilt my tooling preferences toward using more js)


Almost. I mean it's not like fetch(..) is going to lead to some esoteric kind of HTTP request method. I am guessing parent comment is saying what it is saying because fetch will utilize the cookies and other crumbs set by the successful completion of the captcha. If you can take all those crumbs and include it in your next GET request, you don't need to resort to utilizing fetch.


Scammers will use fingerprints from their victims browser/IP/geolocation to try and impersonate them, you basically can buy not only stolen credentials but also the environment in which to run them -safely- from such vendors


first time hearing about fetch too. but i don't see the advantage. is fetch reusing the connection and a manual page load not?


Making my data extraction Saas (https://simplescraper.io) more LLM friendly.

Markdown extraction, improved Google search, workflows - search for this terms, visit the first N links, summarize etc. Big demand for (or rather, expectation of) this lately.


Things are much easier for one-person startups these days—it's a gift.

I remember building a todo app as my first SaaS project, and choosing something called Stormpath for authentication. It subsequently shut down, forcing me to do a last-minute migration from a hostel in Japan using Nitrous Cloud IDE (which also shut down). Just pain upon pain.[1]

Now, you can just pick a full-stack cloud service and run with it. My latest SaaS[2] is built on Google Cloud, so Authentication, Cloud Functions, Docker containers, logging, etc straight out of the box.

Not to mention, modern JavaScript and CSS are finally good. With so many fewer headaches, it’s a great time to build.

[1] Admittedly, I was new to software dev and made some rather poor tech choices

[2] https://simplescraper.io


How do you find the costs involved with a cloud service like that? I remember getting bitten once with a bill from Azure because a service went wild with logging once.


Also solo founder here, for myself you just have to watch costs like a hawk when you make any big changes. AWS and friends have calculators but there is only so much you can estimate and it’s hard to know the usage patterns till something is live.

I’m lucky that my work is event-based, as is it used by in-person live events so my usage comes in waves (pre-sales a month or two out, steady traffic the week leading up to the event, and high traffic the day before or week/day of the event). This means that at worst I only have to ride out the current “wave” and then I have some amount of time before the next event (gives me an opportunity to fix run-away costs.

One of my big runaway costs was when I tried to use something like Datadog/NewRelic/Baseline. You work yourself up to the cost of the service, make your peace with it (the best you can, since it’s also hard to estimate), then get hit with AWS fees (that none of the providers call out) for things like CloudWatch when they are pulling logs/metrics out. I’ve had the CloudWatch bills be 4-6x as expensive as the service itself and it’s a complete surprise (or was the first time). Thankfully AWS refunded it when it happened. I caught it after 2 days and had run up a few hundred dollars in that time, I could have handled it but thankfully they refunded it for me.

The second runaway cost was Google Maps, once you fall off that free tier the costs accumulate quickly. In just a few days I had a couple hundred in fees from that. I scrambled a switch to ProtonMaps and took my costs down to a couple dollars a month.


Hah - yes DataDog is what we moved to after the grand Azure fail. That was better, in that costs were more controllable, but you have to learn their billing world, which is in itself very painful.

All these services are predicated on the idea that you never want your site to go down, and will pay anything to keep it running. So if you start logging gigabytes a second, in the old world your VPS would've started failing and your website's buttons would start showing errors to the user. Now, the user doesn't see an error, but you get charged hundreds or thousands of dollars a month to keep it up, even if your website generates you $50/mo.


Sounds like you've learned a lot of what not to do, which will be invaluable when you try again.

It feels like the only areas where a solo builder can gain traction are in B2B products, or B2C if you're riding the latest trend (crypto yesterday, AI today).

Everything else requires an audience or a ton of capital if you hope to capture even a sliver of attention. Patience and persistence may get you there too, but it's going to take more than a year.


> perhaps you can simply ask the API to create Python or JS code that is deterministic, instead.

Had a conversation last week with a customer that did exactly that - spent 15 minutes in ChatGPT generating working Scrapy code. Neat to see people solve their own problem so easily but it doesn't yet erode our value.

I run https://simplescraper.io and a lot of value is integrations, scale, proxies, scheduling, UI, not-having-to-maintain-code etc.

More important than that though is time-saved. For many people, 15 minutes wrangling with ChatGPT will always remain less preferable than paying a few dollars and having everything Just Work.

AI is still a little too unreliable at extracting structured data from HTML, but excellent at auxiliary tasks like identifying randomized CSS selectors etc

This will change of course so the opportunity right now is one of arbitrage - use AI to improve your offering before it has a chance to subsume it.


I'm making https://simplescraper.io - a no-code web scraping tool.

Saved up, quit my job and went all in...on a todo app. Needless to say that idea didn't go far, but it taught me how to code.

When I was close to broke I pivoted to this product and finally gained traction and now it's doing well enough to be my main source of income.

I'm kind of following the "1000 true fans" ethos that pops up here occasionally. There's a dedicated group of customers who benefit from the ease and speed of the tool and they're like my product team.

I check in with them often, make sure they're happy and build features for them. Turns out, what they value other people value too, and so the product slowly but surely grows.

Learning to code was definitely one of the best decisions I've made. Felt like gaining wings.


What is your stack and what libraries are you using?


Cool. Will try it out. Can I ask you a question?


Don't ask to ask, just ask! https://dontasktoask.com/


Nice job. No RSS though as far as I can tell.

For something slightly less minimal but very nice to use, I stumbled across this the other day: https://blogstatic.io/

Feels likes - after being crowded out by FB and Twitter (and the death of Google Reader) - blogs are making a comeback.


Thanks! You're right, there's currently no RSS. Might change that in the future, but I haven't given it much thought just yet.

I agree that blogs are making a comeback. It's about time. We're in need of some sort of detox after a decade of social media fuelled lives. Hopefully it leads to a more interesting, less overwhelming internet.


Blogstatic looks nice I though it was ghost.org. Their guide is straight up how to setup your domain with Cloudflare.


Congrats on the persistence! The lesson: things take time.

Took me about a decade. I also tried a note-taking app (every new developer's rite of passage it seems), an event guide, digital magazine, and an assortment of other projects whose domain names no longer resolve.

Suck as it may, each failure is a lesson in what not to do. If you find yourself back at square one and still have the drive to keep trying then it's only a matter of time before you create enough value to open wallets.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: