I had the same experience, my typing speed with two hands is 90-120 but with one hand i can still get 50-70. The hard part is punctuation but with AI these days you could try just prompting and let the AI deal with syntax for you.
I have some thoughts on this (in the context of modern SaaS companies).
The most expensive parts of fixing a bug are discovering/diagnosing/triaging the bug, cleaning up corrupted records, and customer communication. If you discover a bug in development or even better while you are coding the function or during a code review you get to bypass triaging, customer calls, escalations, RCAs, etc. At a SaaS company with enterprise customers each of those steps involves multiple meetings with your Support, Account Manager, Senior Engineer, Product Manager, Engineering Manager, Department Manager, sometimes Legal or a Security Engineer and then finally the actual coder. So of course if you can resolve an issue (at a modern SaaS company) during development it can be 10-100x less expensive just because of how much bureaucracy is involved in running a large scale enterprise SaaS company.
It also brings up the interesting side effect of companies adopting non-deterministic coding (AI Code) in that now bugs that could have been discovered during design/development by a human engineer while writing the code can now leak all the way into prod.
I’ve been doing this as well it also works well when you hook it up to Edgar or feed in investor relations documents or earnings transcripts. You can extract a lot of data at scale for regressions using small models with few shot prompts running locally.
I think from a business perspective, the hiring pool for Rails is small and younger engineers don’t have an interest in learning Rails (check recent university hackathons). It takes a decently long time (2-3+ months) to upskill a non-ruby engineer to be productive in Rails (although this is dampened by AI tools these days) and many senior non-ruby engineers aren’t interested in Rails jobs whereas you can get an Node or Java engineer to come to your Go shop and vice versa. Rails can also be hard to debug if you work in a multi-language shop, you can’t map your understanding of Java or Typescript over to a Rails codebase and be able to find your way around.
All that being said I still use (and like) Rails, currently comparing Phoenix/Elixir to Rails 8 in a side project. But I use typescript w/ Node and Bun in my day job.
Usually you only need some subset of the data per page load if you invest some time looking at dev tools you can probably find the API call you need and save yourself a few MB.
>Providers like Oxylabs can be quite restrictive, preventing access to many of the common sites that scrapers choose to target.
Most of them seem pretty reasonable?
"Entertainment & streaming" - who's trying to scrape netflix's library?
"Banking and other financial institutions" / "Government websites" / "Mailing" - seems far more likely it'll be used for credential stuffing than for "scraping".
"Ticketing" - seems far more likely that it'll get used by scalpers than for scraping
The main targets of scraping - e-commerce sites (for price comparisons) and social media networks (for user generated content) are fine to scrape. Is there some use case I'm missing here? Is there a huge contingent of people wanting to scrape ticketmaster or bank of america?
I used the term "scrapers" pretty loosely, but yes, in many cases they are more bad actors than actual scrapers. However as they say the list may include other sites, I suspect Oxylab adds sites to the list at the site owners' requests (Amazon, Target, etc are likely to be on those lists)
Everyday it is looking better and better to self host. I really don’t feel like ChatGPT $200/mo is 10x better than $20/mo and after trying QwQ 32B myself I just can’t justify paying these AI subscriptions. Gemini Flash is also dirt cheap as an api and Groq is dirt cheap and fast for QwQ and other OSS models.
Yes, the quality of the models is increasing at a slower rate and the race will transition to performance and efficiency.
This is good for self hosters and devs who will be able to run near SOTA models like QwQ locally. I’m near the point where I’m going to cancel my ChatGPT Plus and Claude subscription.
If you’re not already trying to self host, build your own local agents and build your own MCPs/Tools I would encourage you to try it (simple stack: ollama, pydanticAI, fastmcp, QwQ 32B, Llama 3.2 3B). If you don’t have a fancy GPU or M1+ try out QwQ on Groq or Flash 2.0 Lite with the Gemini API, it’s super cheap and fast and they are basically equivalent (if not better) than the ChatGPT you were paying for 16 months ago.
I just started self hosting as well on my local machine, been using https://lmstudio.ai/ Locally for now.
I think the 32b models are actually good enough that I might stop paying for ChatGPT plus and Claude.
I get around 20 tok/second on my m3 and I can get 100 tok/second on smaller models or quantized. 80-100 tok/second is the best for interactive usage if you go above that you basically can’t read as fast as it generates.
I also really like the QwQ reaoning model, I haven’t gotten around to try out using locally hosted models for Agents and RAG especially coding agents is what im interested in. I feel like 20 tok/second is fine if it’s just running in the background.
Anyways would love to know others experiences, that was mine this weekend. The way it’s going I really dont see a point in paying, I think on-device is the near future and they should just charge a licensing fee like DB provider for enterprise support and updates.
If you were paying $20/mo for ChatGPT 1 year ago, the 32b models are basically at that level but slightly slower and slightly lower quality but useful enough to consider cancelling your subscriptions at this point.
Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally? I am grad student on budget but I want to host one locally and trying to build a PC that could run one of these models.
"B" just means "billion". A 7B model has 7 billion parameters. Most models are trained in fp16, so each parameter takes two bytes at full precision. Therefore, 7B = 14GB of memory. You can easily quantize models to 8 bits per parameter with very little quality loss, so then 7B = 7GB of memory. With more quality loss (making the model dumber), you can quantize to 4 bits per parameter, so 7B = 3.5GB of memory. There are ways to quantize at other levels too, anywhere from under 2 bits per parameter up to 6 bits per parameter are common.
There is additional memory used for context / KV cache. So, if you use a large context window for a model, you will need to factor in several additional gigabytes for that, but it is much harder to provide a rule of thumb for that overhead. Most of the time, the overhead is significantly less than the size of the model, so not 2x or anything. (The size of the context window is related to the amount of text/images that you can have in a conversation before the LLM begins forgetting the earlier parts of the conversation.)
The most important thing for local LLM performance is typically memory bandwidth. This is why GPUs are so much faster for LLM inference than CPUs, since GPU VRAM is many times the speed of CPU RAM. Apple Silicon offers rather decent memory bandwidth, which makes the performance fit somewhere between a typical Intel/AMD CPU and a typical GPU. Apple Silicon is definitely not as fast as a discrete GPU with the same amount of VRAM.
That's about all you need to know to get started. There are obviously nuances and exceptions that apply in certain situations.
A 32B model at 5 bits per parameter will comfortably fit onto a 24GB GPU and provide decent speed, as long as the context window isn't set to a huge value.
Assuming the same model sizes in gigabytes, which one to choose: a higher-B lower-bit or a lower-B higher-bit? Is there a silver bullet? Like “yeah always take 4-bit 13B over 8-bit 7B”.
Or are same-sized models basically equal in this regard?
I would say 9 times out of 10, you will get better results from a Q4 model that’s a size class larger than a smaller model at Q8. But it’s best not to go below Q4.
But it’s still surprising they haven’t. People would be motivated as hell if they launched GPUs with twice the amount of VRAM. It’s not as simple as just soldering some more in but still.
They sort of have. I'm using a 7900xtx, which has 24gb of vram. The next competitor would be a 4090, which would cost more than double today; granted, that would be much faster.
Technically there is also the 3090, which is more comparable price wise. I don't know about performance, though.
VRAM is supply limited enough that going bigger isn't as easy as it sounds. AMD can probably sell as much as they get their hands on, so they may as well still more GPUs, too.
Go to r/LocalLLAMA they have the most info. There’s also lots of good YouTube channels who have done benchmarks on Mac minis for this (another good value one with student discount).
Since you’re a student most of the providers/clouds offer student credits and you can also get loads of credits from hackathons.
Generally, unquantized - double the number and that's the amount of VRAM in GB you need + some extra, because most models use fp16 weights so it's 2 bytes per parameter -> 32B parameters = 64GB
typical quantization to 4bit will cut 32B model into 16GB of weights plus some of the runtime data, which makes it possibly usable (if slow) on 16GB GPU. You can sometimes viably use smaller quantizations, which will reduce memory use even more.
You always want a bit of headroom for context. It's a problem I keep bumping into with 32B models on a 24GB card: the decent quants fit, but the context you have available on the card isn't quite as much as I'd like.
Qwq:32b + qwen2.5-coder:32b is a nice combination for aider, running locally on a 4090. It has to swap models between architect and edit steps so it's not especially fast, but it's capable enough to be useful. qwen2.5-coder does screw up the edit format sometimes though, which is a pain.
reply