Hacker Newsnew | past | comments | ask | show | jobs | submit | doomslice's commentslogin

Let's say you have 100000 documents in your index that match your query but only 10 of them the user has access to:

A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them. Most of the time, you've now eliminated all of your search results.

Your search must be access aware to do a reasonable job of pre-filtering the content to documents the user has access to, at which point you then can apply post-filtering with the "100% sure" access check.


Yes. But this is still an incredibly well known and solved problem. As an example - google's internal structured search engines did this decades ago at scale.


Which solutions are you referring to? With access that is highly diverse and changing, this is still an unsolved problem to my knowledge.


Probably Google Zanzibar (and the various non-Google systems that were created as a result of the paper describing Zanzibar).


Just use a database that supports both filtering and vector search, such as postgres with pgvector (or any other, I think all are adding vector search nowadays).


Agree...as simple as:

@pxt.query def search_documents(query_text: str, user_id: str): sim = chunks.text.similarity(query_text) return ( chunks.where( (chunks.user_id == user_id) # Metadata filtering & (sim > 0.5) # Filter by similarity threshold & (pxt_str.len(chunks.text) > 30) # Additional filter/transformation ) .order_by(sim, asc=False) .select( chunks.text, source_doc=chunks.document, # Ref to the original document sim=sim, title=chunks.title, heading=chunks.heading, page_number=chunks.page ) .limit(20) )

For instance in https://github.com/pixeltable/pixeltable


The thing about a user needing access to only 10 documents is that creating a new index from scratch on those ten documents takes basically zero time.

Vector Databases intended for this purpose filter this way by default for exactly this reason. It doesn't matter how many documents are in the master index, it could be 100000 or 100000000,doesn't matter. Once you filter down to the 10 that your user is allowed to see, it takes the same tenth of a second or whatever to whip up a new bespoke index just for them for this query.

Pre-search filtering is only a problem when your filter captures a large portion of the original corpus, which is rare. How often are you querying "all documents that Joe Schmoe isn't allowed to view"?


If you can move your access check to the DB layer, you skip a lot of this trouble.

Index your ACLs, index your users, index your docs. Your database can handle it.


Apache Accumulo solved the access-aware querying a while ago.


"Fun" Fact: ServiceNow simply passes this problem on to its users.

I've seen a list of what was supposed to be 20 items of something, it only showed 2, plus a comment "18 results were omitted to insufficient permissions".

(Servicenow has at least three different ways to do permissions, I don't know if this applies to all of them).


I'm not sure if enumerating the hidden results are a great idea :0


At least it's terrible user experience to have to click on the "more" button several times to see the number of items you actually wanted to see.

But yes, one could probably also construct a series of queries that reveal properties of hidden objects.


> Let's say you have 100000 documents in your index that match your query

If the docs were indexed by groups/roles and you had some form of RBAC then this wouldn't happen.


If you take this approach, you have to reindex when groups/roles changes - not always a feasible choice


You only have to update the metadata, not do a full reindex.


You'd have to reindex the metadata (roles access), which may be substantial if you have a complex enough schema with enough users/roles.


> You'd have to reindex the metadata (roles access), which may be substantial if you have a complex enough schema with enough users/roles.

Right, but this compare this to the original proposal:

> A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them

Using an index is much better than that.

And it should be possible to update the index without a substantial cost, since most of the 100000 documents likely aren't changing their role access very often. You only have to reindex a document's metadata when that changes.

This is also far less costly than updating the actual content index (the vector embeddings) when the document content changes, which you have to do regardless of your permissions model.


I don't understand how "using an index" is a solution to this problem. If you're doing search, then you already have an index.

If you use your index to get search results, then you will have a mix of roles that you then have to filter.

If you want to filter first, then you need to make a whole new search index from scratch with the documents that came out of the filter.

You can't use the same indexing information from the full corpus to search a subset, your classical search will have undefined IDF terms and your vector search will find empty clusters.

If you want quality search results and a filter, you have to commit to reindexing your data live at query time after the filter step and before the search step.

I don't think Elastic supports this (last time I used it it was being managed in a bizarre way, so I may be wrong). Azure AI Search does this by default. I don't know about others.


> I don't understand how "using an index" is a solution to this problem. If you're doing search, then you already have an index

It's a separate index.

You store document access rules in the metadata. These metadata fields can be indexed and then use as a pre-filter before the vector search.

> I don't think Elastic supports this

https://www.elastic.co/docs/solutions/search/vector/knn#knn-...


Do you know that the big expensive thing is what your customers actually need? Do you actually know what your customers need?

That’s basically the only important context. If you can’t deliver that, it doesn’t matter how well thought through, extensible, or scalable it is.


Do the customers know what they actually need?


Well if you don’t have a pretty good idea of the problem you’re solving for the customer, you’re much better off trying as many things as you can, quickly and cheaply, to figure out.


Did they remove it? It says the researchers provided a notebook that they used to verify the attack.


All this setting does is make the feature available to you in the frontend -- it's up to you to use it or not!


How do you know that? You can't because Dropbox doesn't say


Or do I…


In adtech a first price auction is also a single round.


You gotta save that optimization for when performance ends up on a top level OKR!


… ten minutes later.


I'm not defending this at all, but one of the reasons why there are no (or few) humans that can be contacted is that they* said that it was tried before and it caused a lot more issues with mistakes/takeovers due to social engineering.

* Can't remember who said it but it was at a town hall this year


> one of the reasons why there are no (or few) humans that can be contacted is that they* said that it was tried before

This just sounds like yet another excuse for holding payroll down as much as possible.

If I am a customer of Amazon, Apple, Netflix, Walmart or any number of the other companies with a similar market cap, I can get access to real live human beings who provide customer support.


A ticketing company was experimenting with BLE beacons to trigger things like seat upgrades and coupons when people walked by certain things in a venue… or at least that’s what they said it would be used for.

Instead they covered LA Live and surrounding area with them and then just sold that data to… well I’m not sure who since I left shortly after they did that.

The justification was “but we put it in the TOS and Privacy Policy”.


How did the BLE beacons track people? A phone app?


Yes the ticketing app that people were required to use to get into the events.


>Yes the ticketing app that people were required to use to get into the events.

Well, at least Android 12 has granular Bluetooth permissions.

While the majority of tickets at LA Live are sold by one company currently, there are others, so it's impossible to know which specific company you're talking about.

On a completely unrelated note, there was a big thing surrounding privacy concerns with AXS' app a couple years ago.[0][1]

Clearly those claims were overblown. Surely AXS would never blanket an area with Bluetooth Low Energy (BLE) beacons and invade users' privacy like that. I refuse to believe AXS would make an app virtually mandatory and then violate users' privacy using physical Bluetooth beacons. Say it ain't so.

Oh, almost forgot: fuck AXS.

[0] https://us.forums.blizzard.com/en/wow/t/axs-spyware-claim-de...

[1] https://us.forums.blizzard.com/en/overwatch/t/axs-isnt-spywa...


This was maybe 8 or so years ago so I'm not sure what else this unnamed ticketing company added since... but permissions were pretty lax at that point and had JUST started to tighten up.

Oh I forgot to mention that Apple rejected the iPhone version of the app at first because we didn't make it clear enough that we were tracking their locations like this. Our head of product at the time just called someone up at Apple and it got approved with no changes. It all stunk.


Nice. Hold on a sec while I uninstall all the random apps I have left over on my phone...


We're at the point where the lights in a store can be used to track you: https://www.usa.lighting.philips.com/systems/lighting-system...


On Android, the apps won't track you through bluetooth if they don't ask for the location permission.


They cut non-business critical travel which many employees for some reason felt was a perk of the job. I heard second hand that some employees would do "office tours" just to try the lunches at the different offices (under the guise of doing some in person meetings that certainly could have been done online). Cutting that kind of wasteful travel is necessary in an economic slowdown (and even necessary before that).


Perhaps necessary in an economic slowdown when you're a paper company with 2% margins... Let's be honest, Google does not "have" to do anything. They're just trying to grab some pennies.

The interesting part is that they felt the pennies nearest at hand were in their employee perks, rather than in expanding their business. That is always a bad sign even if the company we're talking about is as profitable as Google.

Not every business needs the best people working there, nor does every business need its employees to be highly motivated - it's a valid move in the extreme for Google to transition to being like SAP, but I can't believe all of the people in this thread that are defending that they have a right to do it without asking what their executives know that makes it seem so appealing.


That is pretty decadent actually. Financials aside, I do believe frivolous activity like that creates a culture of unseriousness


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: