Hacker News new | past | comments | ask | show | jobs | submit | tomthe's comments login

I wonder if you could implement it with only static hosting?

We would need to split the index into a lot of smaller files that can be practically downloaded by browsers, maybe 20 MB each. The user types in a search query, the browser hashes the query and downloads the corresponding index file which contains only results for that hashed query. Then the browser sifts quickly through that file and gives you the result.

Hosting this would be cheap, but the main barriers remain..


I've done something similar with a static hosted site I'm working on. I opted to not reinvent the wheel, and just use WASM Sqlite in the browser. Sqlite already splits the database into fixed-size pages, so the driver using HTTP Range Requests can download only the required pages. Just have to make good indexes.

I can even use Sqlite's full-text search capabilities!


How would that scale to 10TB+ of plain text though? Presumably the indexes would be many gigabytes, especially with full text search.

The client only needs to get indexes for the specific search; if the index is just a list of TF-IDF term scores per document (which gets you a very reasonable start on search relevance) some extremely back-of-the-envelope math leads me to guess at an upper bound in the low tens of megabytes per (non-stopword) term, which seems doable for a client to download on demand.

I wonder if you could take this one step further and have opaque queries using homomorphic encryption on the index and then somehow extracting ranges around the document(s) you're interested in

Inspired by: "Show HN: Read Wikipedia privately using homomorphic encryption" https://news.ycombinator.com/item?id=31668814


Super interesting.

Very cool open release. Impressive that a 27b model can be as good as the much bigger state of the art models (according to their table of Chatbot Arena, tied with O1-preview and above Sonnet 3.7).

But the example image shows that this model still makes dumb errors or has a poor common sense although it read every information correctly.


It seems to have been very benchmark-tuned for LMArena. In my own experiments, it was roughly in line with other comparably sized models for factual knowledge (like Mistral Small 3), and worse than Mistral Small 3 and Phi-4 at STEM problems and logic. It's much worse than Llama 3.3 70b or Mistral Large 2411 in knowledge or intelligence in reality, even though LMArena ranks it as better than those.


Looking at every other benchmark, it's significantly behind typical big models from a year ago (Claude 3.0, Gemini 1.5, GPT 4.0). I think Google must have extensive LMArena-focused RLHF tuning for their models to juice their scores.


I was thinking the same thing about the receipt calculation: a warning that only tourists tip 18% in Switzerland would no doubt have been appreciated!


It is not sufficient to understand the content very well, you also have to understand the state of the mind of your pupils very well.


I think there are pretty strong traces of pro-china manipulation on Tiktok.

See https://www.nytimes.com/2024/04/24/briefing/tiktok-ban-bill-...

Content about something like Hong-Kong protests or the Tiananmen square are suppressed.


There isn't even a clear methodology explained (where/when/how where the accounts created, etc) on that NCRI "study".

Even the data is often self contradicting.

All we see is that there's much less political discourse on TikTok than ig.

If there's manipulation, the only thing that this study shows is that the manipulation goes towards avoiding politics in general.

Not only that, but the study itself may show the reverse to be true: that IG pushes some narratives more than others.


There is no evidence whatsoever presented in that paid for NCRI study.

In fact, the same result can be used to interpret as "strong traces of anti-China manipulation on American social medias".


Can you elaborate? I had no problem with DuckDB and a few TB of data. But it depends on what exactly you do with it, of course.


few hundred TBs here - no issues :)


Wait, you eat only beef? Or beef as the only source of meat?


Only beef, it's a variant of the carnivore diet that excludes common autoimmune triggers, like eggs & dairy.


  > Only beef, it's a variant of the carnivore diet that excludes common autoimmune triggers, like eggs & dairy.
what about chicken, or fish? are those considered auto-immune triggers?


There is a lot of low quality chicken, both fish and chicken have very little fat. And their fatty acid profile is not ideal.


Won't that give you gout?


No, there are people following a leaner version of the diet who have no such issues. I eat moderate protein though (80-100g).


anymore??


Sorry, I was thirsty.

There were no "big" rivers, ever. More like springs. We have lots of subterranean water, so out of the 18 rivers we have in the city, 16 have their sources here [0]. They were used to power mills in the 19-20th century during the industrialization. Many of the rivers that used to go through the city center flow underground.

I live close to the river Olechówka [1], which flows into a regulated reservoir that used to feed a mill - so the area is called Młynek, "Little Mill" :)

[0] https://podwodnalodz.blogspot.com/2013/09/o-wodzie-po-ktorej... [1] https://i.imgur.com/SIp8CxN.jpeg


I made a similar map but with tiles that only load of you zoom in far enough: tomthe.github.io/hackmap/ (Sorry for posting my link so often) That way it has to load only a few megabyte for the first view.


Nice introduction, but I think that ranking the models purely by their input token limits is not a useful exercise. Looking at the MTEB leaderboard is better (although a lot of the models are probably overfitting to their test set).

This is a good time to chill for my visualization of 5 Millionembeddings of HN posts, users and comments: https://tomthe.github.io/hackmap/


Thanks, a couple other people gave me this same feedback in another comment thread and it definitely makes sense not to overindex on input token size. Will update that section in a bit.


Thank you, I really like the default tutorial how one can play with it. Is it possible to visualize data with this?


Depending on the data, maybe? SDFs aren't great at rendering large numbers of enumerated objects -- something like a point cloud would be prohibitively expensive, so I wouldn't think to use them for like traditional graphing.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: