Let's say you have 100000 documents in your index that match your query but only 10 of them the user has access to:
A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them. Most of the time, you've now eliminated all of your search results.
Your search must be access aware to do a reasonable job of pre-filtering the content to documents the user has access to, at which point you then can apply post-filtering with the "100% sure" access check.
Yes.
But this is still an incredibly well known and solved problem.
As an example - google's internal structured search engines did this decades ago at scale.
Just use a database that supports both filtering and vector search, such as postgres with pgvector (or any other, I think all are adding vector search nowadays).
The thing about a user needing access to only 10 documents is that creating a new index from scratch on those ten documents takes basically zero time.
Vector Databases intended for this purpose filter this way by default for exactly this reason. It doesn't matter how many documents are in the master index, it could be 100000 or 100000000,doesn't matter. Once you filter down to the 10 that your user is allowed to see, it takes the same tenth of a second or whatever to whip up a new bespoke index just for them for this query.
Pre-search filtering is only a problem when your filter captures a large portion of the original corpus, which is rare. How often are you querying "all documents that Joe Schmoe isn't allowed to view"?
"Fun" Fact: ServiceNow simply passes this problem on to its users.
I've seen a list of what was supposed to be 20 items of something, it only showed 2, plus a comment "18 results were omitted to insufficient permissions".
(Servicenow has at least three different ways to do permissions, I don't know if this applies to all of them).
> You'd have to reindex the metadata (roles access), which may be substantial if you have a complex enough schema with enough users/roles.
Right, but this compare this to the original proposal:
> A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them
Using an index is much better than that.
And it should be possible to update the index without a substantial cost, since most of the 100000 documents likely aren't changing their role access very often. You only have to reindex a document's metadata when that changes.
This is also far less costly than updating the actual content index (the vector embeddings) when the document content changes, which you have to do regardless of your permissions model.
I don't understand how "using an index" is a solution to this problem. If you're doing search, then you already have an index.
If you use your index to get search results, then you will have a mix of roles that you then have to filter.
If you want to filter first, then you need to make a whole new search index from scratch with the documents that came out of the filter.
You can't use the same indexing information from the full corpus to search a subset, your classical search will have undefined IDF terms and your vector search will find empty clusters.
If you want quality search results and a filter, you have to commit to reindexing your data live at query time after the filter step and before the search step.
I don't think Elastic supports this (last time I used it it was being managed in a bizarre way, so I may be wrong). Azure AI Search does this by default. I don't know about others.
Well if you don’t have a pretty good idea of the problem you’re solving for the customer, you’re much better off trying as many things as you can, quickly and cheaply, to figure out.
I'm not defending this at all, but one of the reasons why there are no (or few) humans that can be contacted is that they* said that it was tried before and it caused a lot more issues with mistakes/takeovers due to social engineering.
* Can't remember who said it but it was at a town hall this year
> one of the reasons why there are no (or few) humans that can be contacted is that they* said that it was tried before
This just sounds like yet another excuse for holding payroll down as much as possible.
If I am a customer of Amazon, Apple, Netflix, Walmart or any number of the other companies with a similar market cap, I can get access to real live human beings who provide customer support.
A ticketing company was experimenting with BLE beacons to trigger things like seat upgrades and coupons when people walked by certain things in a venue… or at least that’s what they said it would be used for.
Instead they covered LA Live and surrounding area with them and then just sold that data to… well I’m not sure who since I left shortly after they did that.
The justification was “but we put it in the TOS and Privacy Policy”.
>Yes the ticketing app that people were required to use to get into the events.
Well, at least Android 12 has granular Bluetooth permissions.
While the majority of tickets at LA Live are sold by one company currently, there are others, so it's impossible to know which specific company you're talking about.
On a completely unrelated note, there was a big thing surrounding privacy concerns with AXS' app a couple years ago.[0][1]
Clearly those claims were overblown. Surely AXS would never blanket an area with Bluetooth Low Energy (BLE) beacons and invade users' privacy like that. I refuse to believe AXS would make an app virtually mandatory and then violate users' privacy using physical Bluetooth beacons. Say it ain't so.
This was maybe 8 or so years ago so I'm not sure what else this unnamed ticketing company added since... but permissions were pretty lax at that point and had JUST started to tighten up.
Oh I forgot to mention that Apple rejected the iPhone version of the app at first because we didn't make it clear enough that we were tracking their locations like this. Our head of product at the time just called someone up at Apple and it got approved with no changes. It all stunk.
They cut non-business critical travel which many employees for some reason felt was a perk of the job. I heard second hand that some employees would do "office tours" just to try the lunches at the different offices (under the guise of doing some in person meetings that certainly could have been done online). Cutting that kind of wasteful travel is necessary in an economic slowdown (and even necessary before that).
Perhaps necessary in an economic slowdown when you're a paper company with 2% margins... Let's be honest, Google does not "have" to do anything. They're just trying to grab some pennies.
The interesting part is that they felt the pennies nearest at hand were in their employee perks, rather than in expanding their business. That is always a bad sign even if the company we're talking about is as profitable as Google.
Not every business needs the best people working there, nor does every business need its employees to be highly motivated - it's a valid move in the extreme for Google to transition to being like SAP, but I can't believe all of the people in this thread that are defending that they have a right to do it without asking what their executives know that makes it seem so appealing.
A basic implementation will return the top, let's say 1000, documents and then do the more expensive access check on each of them. Most of the time, you've now eliminated all of your search results.
Your search must be access aware to do a reasonable job of pre-filtering the content to documents the user has access to, at which point you then can apply post-filtering with the "100% sure" access check.