More

emcq · on April 14, 2024

The 1.4s is _after_ having the file loaded into RAM by the kernel. Because this is mostly I/O bound, it's not a fair comparison to skip the read time. If you were running on a M3 mac you'd might get less than 100ms if the dataset was stored in RAM.

If you account for time loading from disk, the C implementation would be more like ~5s as reported in the blog post [1]. Speculating that their laptop's SSD may be in the 3GB/s range, perhaps there is another second or so of optimization left there (which would roughly work out to the 1.4s in-memory time).

Because you have a lot of variable width row reads this will be more difficult on a GPU than CPU.

[1] https://www.dannyvankooten.com/blog/2024/1brc/

pama · on April 14, 2024

The performance report followed the initial request: run 6 times and remove the best and worst outliers, so the mmap optimization is fair game. Agreed that the C code has room left for some additional optimization.

emcq · on April 14, 2024

If we are going to consider using prior runs of the program having the file loaded in RAM by the kernel fair, why stop there?

Let's say I create a "cache" where I store the min/mean/max output for each city, mmap it, and read it at least once to make sure it is in RAM. If the cache is available I simply write it to standard out. I use whatever method to compute the first run, and I persist it to disk and then mmap it. The first run could take 20 hours and gets discarded.

By technicality it might fit the rules of the original request but it isn't an interesting solution. Feel free to submit it :)

gunnarmorling · on April 14, 2024

This actually doesn't fit the rules. I've designed the challenge so that disk I/O is not part of the measured runtime (initially, by relying on the fact that the first run which would pull the file into the page cache will be the slowest and thus discarded, later on by loading the file from a RAM disk). But keeping state between runs is explicitly ruled out as per the README, for the reason you describe.

pama · on April 14, 2024

Having write access to storage or spawning persistent daemons is an extra requirement and that is often not available in practice when evaluating contest code :-)

This is a fun project for learning CUDA and I enjoyed reading about it——I just wanted to point out that the performance tuning in this case is really on the parsing, hashing, memory transfers, and IO. Taking IO out of the picture using specialized hardware or Linux kernel caching still leaves an interesting problem to solve and the focus should be on minimizing the memory transfers, parsing, and hashing.

ww520 · on April 14, 2024

Also this uses 16 threads while the contest restricts to running in 8 cores. Needs to compare the benchmarks in the same environment to make a fair comparison.

pama · on April 14, 2024

The AMD Ryzen 4800U has 8 cores total so the author follows the contest restriction. This CPU supports hyperthreading. (I’d be very interested in seeing hyperoptimized CUDA code using unlimited GPU cores FWIW.)

ww520 · on April 14, 2024

Good to know. I didn’t know the contest has no limit on hyperthread.

fulafel · on April 15, 2024

1brc in the contest had SMT disabled [0]. (hyperthreading is an intel marketing name and trademark for their implementation of smt, but the benchmark was run on an amd cpu)

[0] https://github.com/gunnarmorling/1brc/issues/189#issuecommen...

ww520 · on April 16, 2024

Ok. So it's indeed restricted to 8 core (1 thread per core). Then the benchmark above using 16 threads was not really a fair comparison.

emcq · on Sept 21, 2022

Be wary of using this model - the licensing of this model seems sketchy. Several of the datasets used for training like WSJ and TED-LIUM have clear non-commercial clauses. I'm not a lawyer but releasing a model as "MIT" seems dubious, and hopefully OpenAI has paid for the appropriate licenses during training as they are no longer a research-only non profit.

jefftk · on Sept 21, 2022

This is a big dispute right now: OpenAI and other AI companies generally take the position that models learning from data does not make the output of the models a derivative work of that data. For example, GitHub Co-pilot uses all publicly available GitHub code regardless of license, and DALLE-2/StableDiffusion/etc use lots of non-free images. I don't think this has been challenged in court yet, and I'm very curious to see what happens when it is.

petercooper · on Sept 21, 2022

I think it might be even less problematic with something like Whisper than with DALLE/SD? Merely consuming data to train a system or create an index is not usually contrary to the law (otherwise Google wouldn't exist) – it's the publication of copyright content that's thorny (and is something you can begin to achieve with results from visual models that include Getty Photos logo, etc.)

I think it'd be a lot harder to make a case for an accurate audio to text transcription being seen to violate the copyright of any of the training material in the way a visual could.

jefftk · on Sept 22, 2022

They're not just training a system but publishing the trained system

bscphil · on Sept 22, 2022

> models learning from data does not make the output of the models a derivative work of that data

Most of the debate seems to be happening on the question of whether everything produced by models trained on copyrighted work represents a derivative work. I argue that at the very least some of it does; so the claim said to be made by the AI companies (see quote above) is clearly a false one.

We're in a weird place now where AI is able to generate "near verbatim" work in a lot of cases, but I don't see an obvious case for treating this any differently than a human reproducing IP with slight modifications. (I am not a lawyer.)

For example, copyright law currently prevents you from selling a T-shirt with the character Spider-Man on it. But plenty of AI models can give you excellent depictions of Spider-Man that you could put on a T-shirt and try to sell. It's quite silly to think that any judge is going to take you seriously when you argue that your model, which was trained on a dataset that included pictures of Spider-Man, and was then asked to output images using "Spider-Man" as a search term, has magically circumvented copyright law.

(I think there's a valid question about whether models represent "derivative work" in the GPL sense specifically, but I'm using the idea more generally here.)

jefftk · on Sept 22, 2022

That's right: the model is definitely capable of creating things that are clearly a derivative work of what they were trained on. But this still leaves two questions:

* Does the model require a copyright license? Personally I think it's very likely a derivative work, but that doesn't necessarily mean you need a license. The standard way this works in the US is the four factors of fair use (https://copyright.columbia.edu/basics/fair-use.html) where Factor 1 is strongly in favor of the model being unrestricted while 2-4 are somewhat against (and in some cases 4 is strongly against).

* Is all output from the model a derivative work of all of the input? I think this is pretty likely no, but unclear.

* Does the model reliably only emit derivative works of specific inputs when the user is trying to get it to do that? Probably no, which makes using one of these models risky.

(Not a lawyer)

emcq · on Sept 21, 2022

This is even slightly more direct: access to WSJ data requires paying LDC for the download, and the pricing varies depending on what institution / license you're from. The cost may be a drop in the bucket compared to compute, but I don't know that these licenses are transferable to the end product. We might be a couple court cases away from finding out but I wouldn't want to be inviting one of those cases :)

nshm · on Sept 21, 2022

I think they didn't use WSJ for training, only for evaluation. Paper includes WSJ under "Evaluation datasets"

pabs3 · on Sept 23, 2022

Are there any AI/ML models that don't use sketchy licensed datasets? Everything seems to be "downloaded from the internet, no license" or more explicitly proprietary. The only exception I can think of would be coqui/DeepSpeech?

emcq · on Jan 4, 2021

Shockingly the human genome itself has not been fully sequenced, despite the human genome project completing years ago [0]. There are difficult to map regions of the genome, some of which are interesting. Only recent advances in long read sequencers have helped to solve some of these issues [1].

For the future to truly be amazing with one sequencing the lab prep, chemistry, and equipment required needs to advance. Oxford Nanopore has some advancements here [2] but it's still a ways to go before you could have a sample prepared as easily as an ultrasound or x-ray.

[0] https://www.statnews.com/2017/06/20/human-genome-not-fully-s... [1] https://www.ecseq.com/support/ngs/are-there-regions-in-the-g... [2] http://nanoporetech.com/products/voltrax

aroch · on Jan 4, 2021

The Telomere-to-Telomere consortium has made fantastic progress on this in the last year or two: https://genomeinformatics.github.io/CHM13v1/

Thanks to them we now have a nearly completed genome, only missing the deconvoluted rDNA array segments (~12mb or so, we know the sequences since they’re basically identical but no one has accurately placed the individual array variants yet).

wolfretcrap · on Jan 4, 2021

I wonder what if genes are only like "functions" and inputs still come from environment. This makes it easy to not change that function often as the output changes based on input received from the environment. While only in very rare cases the function itself needs to be modified or simply put inside another function to give the function inside access to more environmental inputs.

xvedejas · on Jan 4, 2021

Indeed, DNA controls the conditions of its own expression using sequences called promoters (and other moving parts). Follow your curiosity: https://en.wikipedia.org/wiki/Promoter_(genetics)

emcq · on Dec 29, 2020

I used to work for a drone company. I made sure to go watch the mechanical testing for prop safety. One of the test objects was a chicken leg, and the prop cut through more than 2mm of bone. The props can do some serious damage!

I don't know what, if any, approaches are implemented but one solution is that you can design a safety feature in the motor control to turn off the motor if it detects blockage.

emcq · on Dec 25, 2020

Check out sse2neon as a way to substitute your x86 SIMD instrinsics for ARM neon. Perf was good enough to ship in one of the past projects I worked with.

rock_artist · on Dec 25, 2020

Thanks! Interesting. I'll look it up. What I know one of our devs didn't find equivalent to is ippsPowx_32fc_A11

emcq · on Dec 24, 2020

This article is quoting employees conversations from 2016. It wasn't until late 2016 that Google published research on wide and deep recommender systems. I'd be interested to see if people still think behavioral data is poor and ML has not provided advances.

That said some demographics are broadly correct from behavioral tracking (e.g. a child who watches baby shark on repeat), but good luck telling the difference between a rich child and a less rich child.

emcq · on Dec 24, 2020

You can get all this functionality and more working beautifully with Android with Garmin watches. Mine lasts for many days on a charge (perhaps a week without activity tracking), has better sport tracking features, but the screen isn't quite as good. There are currently a few good deals on the Vivoactive 3, but also many higher end models available too.

noarchy · on Dec 24, 2020

I use a Vivoactive 3, and the battery definitely doesn't last days if I actually use it to track a specific exercise. I'm quite unimpressed with the battery life, but the watch is fairly feature-rich, otherwise.

If I just wear it without running anything, then yes, it can last days. But if all I want is to track heart rate, sleep, and such, I can get that with a much cheaper watch.

emcq · on Dec 24, 2020

Don't forget Ron Graham who cowrote magical mathematics!

iainctduncan · on Dec 24, 2020

Yeah! And Ron was a pro juggler too. (as have been I!) Sadly Ron died this year.

emcq · on Dec 19, 2020

This is easy to resolve by staggering doses which I know at least other hospitals are doing.

craz8 · on Dec 19, 2020

It is, but are they actually planning to do this?

The key is that you need to stagger them now - maybe over a 10 day period to be safe (plan for 2 days off, so 20% out of action with 10 days, try to use “weekend” time for this)

But in the middle of a pandemic, even 20% is a lot of lost capacity

And then some people need to be in the day 8, 9 and 10 groups so won’t “get the vaccine first”

So it’s actually not quite as simple as just spreading them out over a few days and hope for the best.

emcq · on Dec 9, 2020

Elon's was quoted saying "you have a forest of redwoods and the little trees cant grow." While pointing his finger at California, I can't help but think this really reflects back on him.

Elon Musk's success and several others began as the result of PayPal.

To my knowledge none of Elon's companies since have produced a group like the PayPal mafia. The PayPal folks are smart people, but the bay area is full to the gills of smart visionaries lacking capital to take on ambitious visions. Many have worked at Elon's companies. To really see little trees grow people like Elon would need to change their equity structures to let the next group of innovators thrive.