Hacker Newsnew | past | comments | ask | show | jobs | submit | thegginthesky's favoriteslogin

> Stonebreaker must have gotten huge pushback for attacking MR for something it wasn't good at

I like this comment because it gets to the heart of a misunderstanding. I'd further correct it to say "for something it wasn't trying to be good at". DeWitt and Stonebraker just didn't understand why anyone would want this, and I can see why: change was coming faster than it ever did, from many angles. Let's travel back in time to see why:

The decade after mapreduce appeared - when I came of age as a programmer - was a fascinating time of change:

The backdrop is the post-dotcom bubble when the hype cycle came to a close, and the web market somewhat consolidated in a smaller set of winners who now were more proven and ready to go all in on a new business model that elevates doing business on the web above all else, in a way that would truly threaten brick and mortar.

Alongside that we have CPU manufacturers struggling with escalating clock speeds and jamming more transistors into a single die to keep up with Moore's law and consumer demand, which leads to the first commodity dual and multi core CPUs.

But I remember that most non-scientific software just couldn't make use of multiple CPUs or cores effectively yet. So we were ripe for a programming model that engineers who've never heard of lamport before can actually understand and work with: threads and locks and socket programming in C and C++ were a rough proposition, and MPI was certainly a thing but the scientific computing people who were working on supercomputers, grids, and Beowulf clusters were not the same people as the dotcom engineers using commodity hardware.

Companies pushing these boundaries were wanting to do things that traditional DBMSes could not offer at a certain scale, at least for cheap enough. The RDBMS vendors and priesthood were defending that it's hard to offer that while also offering ACID and everything else a database offers, which was certainly not wrong: it's hard to support an OLAP use case with the OLTP-style System-R-ish design that dominated the market in those days. This was some of the most complicated and sophisticated software ever made, imbued with magiclike qualities from decades of academic research hardened by years of industrial use.

Then there was data warehouse style solutions that were "appliances" that were locked into a specific and expensive combination of hardware and software optimized to work well together and also to extract millions and billions of dollars from the fortune 500s that could afford them.

So the ethos at the booming post-dotcoms definitely was "do we really need all this crap that's getting in our way?", and we would soon find out. Couching it in formalism and calling it "mapreduce" made it sound fancier than what it really was: some glue that made it easy for engineers to declaratively define how to split work into chunks, shuffle them around and assemble them again across many computers, without having to worry about the pedestrian details of the glue in between. A corporate drone didn't have to understand /how/ it worked, just how to fill in the blanks for each step properly: a much more viable proposition than thousands of engineers writing software together that involves finnicky locks and semaphores.

The DBMS crowd thumbed their noses at this because it was truly SO primitive and wasteful compared to the sophisticated mechanisms built to preserve efficiency that dated back to the 70s: indexes, access patterns, query optimizers, optimized storage layouts. What they didn't get was that every million dollar you didn't waste on what was essentially the space shuttle of computer software - fabulously expensive and complicated - could now buy a /lot/ more cheapo computing power duct taped together. The question was how to leverage that. Plus, with things changing at the pace that they did back then, last year's CPU could be obsolete by next year, so how well could the vendors building custom hardware even keep up with that, after you paid them their hefty fees? The value proposition was "it's so basic that it will run on anything, and it's future proof" - the democratization aspect could be hard to understand for an observer at that point, because the tidal wave hadn't hit yet.

What came was the start a transition from datacenters to rack mounts in colos and dedicated hosts to virtualization and very soon after the first programmable commodity clouds: why settle for an administered unixlike timesharing environment when you can manage everything yourself and don't have to ask for permission? Why deal with buying and maintaining hardware? This lowered the barrier for smaller companies and startups who previously couldn't afford access to such things nor markets that required them, which unleashed what can only be described as a hunger for anything that could leverage that model.

So it's not so much that worse was better, but that worse was briefly more appropriate for the times. "Do we really need all this crap that's getting in our way?" really took hold for a moment, and programmers were willing to dump anything and everything that was previously sacred if they thought it'd buy them scalability, schemas and complex queries to start.

Soon after, people started figuring out how to maintain all the benefits they'd gained (democratized massively parallel commodity computing) while bringing back some of the good stuff from the past. Only 2 years later, Google itself published the BigTable paper where it described a more sophisticated storage mechanism which optimized accesses better, and admittedly was tailored for a different use case, but could work in conjunction with mapreduce. Academia and the VLDB / CIDR crowd was more interested now.

Some years after that came out the papers for F1 and Spanner, which added back in a SQL-like query engine, transactions, secondary indexes etc on top of a similar distributed model in the context of WAN-distributed datacenters. Everyone preached the end of nosql and document databases, whitepapers were written about "newsql", frustrated veterans complained about yet another fad cycle where what was old was new again.

Of course that's not what happened: the story here was how a software paradigm failed to adapt to the changing hardware climate and business needs, so capitalism had its guts ripped apart and slowly reassembled in a more context-applicable way. Instead of storage engines we got so many things it's hard to keep up with, but leveldb comes to mind as an ancestor. With locks we got was chubby and zookeeper. With log structures we got kafka and its ilk. With query optimizer engines we got presto. With in-memory storage we got arrow. We got a cambrian explosion of all kinds of variations and combinations of these, but eventually the market started to settle again and now we're in a new generation of "no, really, our product can do it all". It's the lifecycle of unbundling and rebundling. It will happen again. Very curious what will come next.


AE Studio | LA Office | Multiple Roles | Full-Time | Remote | ae.studio

We are a development, data science and design studio that works closely with founders and executives to create custom software and machine learning solutions. We are hiring top notch professionals passionate about software development, data science, design or product management to work with our amazing team creating human agency increasing software!

Full descriptions: https://ae.studio/join-us


Upnext | 100% Remote | Full Time / Contract | Software Eng / Design / ML

At Upnext, we are passionate about solving information overload. Every day we get bombarded with content from social networks, news sites, blogs, messages, etc. It’s hard to keep up and it’s even harder to find the content that really matters to you. It takes time and energy to sift through the noise and find what really matters. That's why we created Upnext, our flagship product that lets users easily organize reading, audio, and video content in one neat space. We’re not stopping there, our latest app helps you stay up to date on the topics and news that you care about by aggregating updates into a single place. Using our own AI models we’re building in deep personalization from the beginning so Upnext users will always have the most important updates about topics they care about. We've got open roles for:

- Software developers: our tech stack is TypeScript / Node / React / Python - Designers: we're creating a seamless, beautiful experience across desktop, native and web - ML research / ML engineer: help us design build and deploy our first generation of recommendation and understanding systems - Marketing / growth: help us get the word out!

If you'd like to chat, email me at joe [at] upnext [dot] in!


Ingram Technologies | Remote

Ingram (https://ingram.tech/) is an AI lab based in Europe. We have a hybrid team with cells in Paris & Brussels, and individuals scattered across 3 continents.

We pride ourselves in getting large companies to transition to AI in ethical, privacy-respecting ways.

We're growing rapidly and looking to expand our team in early 2024.

[email protected] - Your emails are read by humans.

PS - We're also starting an AI startup incubator in Brussels. If you're BE-based and want to join the incubator's team, mention it in your email.


RINSE | REMOTE or San Francisco, Los Angeles, Chicago, Boston, New York, New Jersey, Seattle, Austin, Dallas, or Washington DC | Software Engineers | Full-Time | https://www.rinse.com

Rinse provides dry cleaning and laundry delivery services to customers in nine metropolitan areas in the US. We have sophisticated logistics optimization software, a polished consumer product, and firm business fundamentals. We're now almost a decade old - this is a stable, yet consistently growing and innovating, company.

Our engineering team is distributed across the United States and internationally, and has been entirely remote for years now, but a desk can be provided in the above cities if you'd prefer.

We're open to both newly-graduated engineers or more senior engineers, provided they meet our bar.

Search term bingo: Logistics, Django, Python, Optimization, React, React Native, Postgres

https://www.rinse.com/careers/software-engineer/

Interested? Email us as [email protected], or my first name at rinse.com


Ah yes—as the saying goes: “keep your friends at the Bayes-optimal distance corresponding to your level of confidence in their out-of-distribution behavior, and your enemies closer”

I think it's due to how Discord evolved as a platform

Discord start as "your private place for your friends to talk" during a time where there were a lot of privacy issues with other communication methods.

Then as it grew beyond this scope of being a private place for friends, it would have been good for indexing to be added but indexing a normal text channel is really hard since you don't know where the conversation starts / stops to submit to a sitemap.

Now we've got large public communities and forum channels so it's possible they roll out their own version soon, but it does still slightly go against how their product was originally created so there may be some hesitation with adding it due to not knowing what the community reaction will be like.


>What has everyone been doing wrong all these years

So it's important to note that all of these improvements are the kinds of things that are cheap to run on a pretrained model. And all of the developments involving large language models recently have been the product of hundreds of thousands of dollars in rented compute time. Once you start putting six digits on a pile of model weights, that becomes a capital cost that the business either needs to recuperate or turn into a competitive advantage. So everyone who scales up to this point doesn't release model weights.

The model in question - LLaMA - isn't even a public model. It leaked and people copied[0] it. But because such a large model leaked, now people can actually work on iterative improvements again.

Unfortunately we don't really have a way for the FOSS community to pool together that much money to buy compute from cloud providers. Contributions-in-kind through distributed computing (e.g. a "GPT@home" project) would require significant changes to training methodology[1]. Further compounding this, the state-of-the-art is actually kind of a trade secret now. Exact training code isn't always available, and OpenAI has even gone so far as to refuse to say anything about GPT-4's architecture or training set to prevent open replication.

[0] I'm avoiding the use of the verb "stole" here, not just because I support filesharing, but because copyright law likely does not protect AI model weights alone.

[1] AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.


A line item under reasons that Binance falls under US jurisdiction. "Binance relies on the Google suite of products for information management and email services."

That's got to make companies around the world feel more comfortable using it!

Oh, and AWS too!


Outside of lawyers, what communities do you think should have an "understanding" of intellectual property law, and to what degree? Or, maybe the fact that it takes a lawyer to truly understand it indicates that the complexity of applicable laws and regulations isn't beneficial to the communities they're ostensibly meant to protect?

Is Strassen's algorithm actually used in practice? Oddly enough I find myself doing a lot of matrix multiplication recently. But I am just using cublasCgemm3mStridedBatched from Nvidia's cuBLAS library, and it doesn't appear to be public information how it's implemented. Does anyone know if it's actually using Strassen?

Basically the library described at:

https://developer.nvidia.com/blog/cublas-strided-batched-mat...

I am a bit not-sold-yet on the AlphaTensor stuff because in practice it often seems like shuffling the data around in GPU memory is more expensive than doing the actual multiplications. It takes longer to move values between regular GPU memory and shared memory than it does to do a multiply, right? So all these algorithms that are optimizing the number of arithmetic operations, it isn't even clear to me that they're optimizing the right thing, because they require that you shuffle your data around in weird ways, and they don't generally measure the number of "memory moves" that are needed.

That said, I would be happy to drop in a replacement for cublasCgemm3mStridedBatched and test out if it worked better for me! It doesn't seem like these new AlphaTensor matrix multiplication routines are available as plain old c/c++ libraries yet, though.


It's possible to take anything good to excess. For every piece of advice, someone probably needs to hear the opposite.

But... if your manager is displeased with most ICs for brainstorming, considering edge cases, researching possible solutions, refactoring, helping other people, testing, and automating... you should probably consider working somewhere else. These are all things the industry does too little of, and that my company puts significant effort into encouraging and rewarding. They are also important steps in growing into (technical or people) leadership roles. And they are nowhere near the top of the list of things that people waste time on.


Crypto is a planet-destroying ponzi-scheme, but in a collapsed economy like Russia, I wonder if everyone can just have virtual wallets where you can buy goods by sending the seller some "coins". Which is basically the WeChat payment system, or how the Brazilian Real got established: https://www.npr.org/sections/money/2010/10/04/130329523/how-...

We already just exchange "coins" electronically, but with the Visa or MasterCard network getting involved...


I'd just want to add that, I in general don't think it's a good idea to train cats to learn behaviors that are unnatural for them. One could say that, eating processed cat food and living with humans are already unnatural, which is true. But to the extent that's possible, I want to provide my cats the freedom to act as naturally as possible.

On this particular cat toilet training trick: cats are in fact quite particular about litterbox -- where it's positioned, how safe it feels inside it, how it smells. A human toilet is decidedly an unnatural setup for cats, and who knows what can happen. Maybe the cat learns to use it but constantly feels anxious about it (e.g. slipper surface). Maybe it will avoid pooping until it gets really uncomfortable. All of these can't be good for its health.


There are even special litter boxes sold that are intended to go on the toilet seat for this type of training. But as cool as this is, I believe it can be dangerous, especially for owners of male cats.

Cats' urinary habits are important to keep an eye on. A change in urination volume or frequency can be the sign of a critical issue that requires veterinary attention. You just can't easily track urine volume if the cat is going to the toilet, even if they don't flush.

With a litterbox, you can get an idea of what their normal clump size is and notice if that changes. Regarding frequency, maybe you can hear your cat jump on the toilet, but I'm not sure it's as reliable as hearing them repeatedly scratching the litterbox. If you are out during the day and come home, with a litterbox you still notice the pee clumps to get an idea of how much they've peed - unlike with a toilet.

One aspect where the toilet may actually be better is that you _might_ be able to spot blood easier in the water.

I also wanted to train my first cat to use the toilet, until he suffered from a urinary blockage as a kitten. If he'd been using the toilet, I likely wouldn't have noticed that anything was wrong until it was far too late. Blockage is more likely in male cats, and it is _not_ an uncommon problem. It is already far too easy to not notice if a cat is blocked if you're not careful, and it doesn't take long at all to become life threatening. I believe a toilet would only exacerbate this risk.

If I see blood in the litter, no clumps, or clumps that are smaller than usual (or heck, even much bigger than usual), I know to pay extra attention to my cat and either book a vet appointment or go to the emergency room. It's allowed me to catch bouts of infection, crystals, and idiopathic cystitis early (all of which he is now more prone to after his initial urinary problems). A cat's kidneys and urinary system can also be impacted by stress levels, with unexplained straining and bladder inflammation thought to be in some cases caused by stress. Having them pee in a toilet in my opinion decreases visibility of a very important and sensitive health factor.


How can one learn about advanced python concepts like this - slots, meta programming etc. Is there a good book for this?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: