Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Is Datalog a good language for authorization? (neilmadden.blog)
68 points by triska on Feb 19, 2022 | hide | past | favorite | 21 comments



(biscuit author here) I like how that post points a crucial issue in authorization systems: how powerful and flexible should we make them?

If you go towards flexibility, you get more complexity and possibly unexpected results. And they become hard to audit and explain.

You can choose simple, single purpose systems, like a RBAC only authz backed by a SQL database. You write tight, easy to understand policies. But systems always grow beyond the bounds of those rules, and you end up with a lot of ad hoc rules, or role explosion.

Datalog was a good tradeoff for Biscuit (https://www.biscuitsec.org): powerful enough to cover a lot of use cases, simple enough to be implemented the same way in multiple languages


Well this was fun to see! I'm the CTO of Oso, where we're building Polar (the second of the links mentioned https://docs.osohq.com/).

I have a few really minor nitpicks, so will try and make up for it by adding to the discussion :)

First of all, it doesn't really make sense to talk about Datalog as a good language for authorization, because much like with Prolog there doesn't really exist a single implementation of it. OPA's language Rego is a datalog variant, and Polar started out as a Prolog variant (although it's not really recognisable as one any more).

And that's an important point because otherwise it would be pretty reasonable to decide that: logic programming is good for authorization => you should go find the most battle-tested language out there and use that. For example, there's SWI Prolog [1] and Scryer Prolog [2] as two of my favourites.

To me, the thing that is mind-blowing about logic programming, is (a) how powerful the paradigm is, and (b) how concisely you can implement a logic programming language. Take miniKanren [3] which is a full-blown logic language in a few hundred lines of code.

In my mind, the original article makes a decent case that logic programming is a good fit for authorization. And just generally I love anyone bringing attention to that :)

But in my opinion the reason logic programming is such a solid foundation for authorization logic is the pieces you can build on top of it. For Polar, we've added:

- Types! So you can write authorization logic over your data types and help structure your logic. We've implemented this by simply adding an additional operator into the language that can check types

- Authorization primitives like roles and relationships (RBAC and ReBAC). These are all backed by the logical rules, so can be easily extended with custom logic

- And last but not least... the ability to convert authorization logic into SQL [4]. Which is done by having the language return constraints over any unbound (free) variables.

To me this is what makes logic programming exciting for authorization. It gives you this small kernel of declarative programming, and gives you a ton of freedom to build on top.

[1] https://www.swi-prolog.org/

[2] https://github.com/mthom/scryer-prolog

[3] http://minikanren.org/

[4] https://www.osohq.com/post/authorization-logic-into-sql


As a practical matter in an enterprise use case, you really want to be able to tell the user why their request is denied and what they can do about it. One thing I really appreciate at my employer is that any "access denied" message provides an on-ramp to the process of either amending my request to something that's within my permissions, or else getting the requisite approvals.

How do you get this kind of explainability when the authorization decision involves such advanced magic as logic programming languages?


>> How do you get this kind of explainability when the authorization decision involves such advanced magic as logic programming languages?

You could do this by generating a trace of the execution of the logic program, as you run the program. For example you can write:

  allow(update, Resource, User, Reason):- 
    Reason = owner(Resource, User)
   ,Reason.
Although that is probably not very idiomatic datalog. That is, I don't know if you can "call" a literal bound to a variable like I do in the last line of the program by declaring "Reason" after binding Reason to "owner(Resource, User)". If you do that in Prolog, the Prolog engine will prove the literal bound to "Reason" and execute the "owner" program. I don't know if that works in datalog. You can of course do all this without call-magic:

  allow(update, Resource, User, Reason):- 
     owner(Resource, User)
    ,Reason = owner(Resource, User).
Historically, mechanisms of explanation very similar to the one above were common in expert systems of old, at least the ones implemented in Prolog. There used to be a nice free e-book describing how to build expert systems in Prolog in a quick-and-dirty fashion, but I can't find it online anymore. There's this book by Amzi Prolog though:

https://amzi.com/distribution/files/xsip_book.pdf

And this has a section on "Explanation" that describes a tracing mechanism based on the classic box-model debugger of Prolog engines. The technique can be applied to datalog and the program doesn't have to be an expert system, just to make it clear.


I think security systems tend to tell as little as possible in these cases because to do otherwise reveals something of the internal structure of a system, potentially indicating vulnerable points.

IIRC a post on naming servers where someone said if they knew one server was called Bilbo then they had a pretty good idea what the other servers might be named after. Seemed a good point.


This seems misguided. If you are taking principle-of-least-privilege seriously (and it seems that's what people are trying to do with very sophisticated authz systems) then legitimate users will bounce off the edges of their privileges very frequently. If they can't figure out what those edges are or how to expand the frontier when necessary, they're not going to be able to accomplish necessary tasks.

A policy of "stay the hell away, don't even ask" only works if you're drawing the boundary very far from the user's legitimate sphere of influence.


Dunno, just reporting what I understood to be. It may be a distinction can be made between company-internal and company-external attempts at access, but that is pretty risky.

Any solution to the valid point you have made would have to be organisational.

But the saying goes something like "security annoying, good security is very annoying" - anyone know the original?


This is definitely a consideration for authorization. To do it well you need to be able to distinguish when you want to reveal information.

For example, our APIs push people towards returning a 404 if the user doesn't have read access (you shouldn't even know it exists!), but a 403 if you can read but not edit.

You would probably want to do similar here -- only return a reason if the user is allowed to know it exists, but they don't have access. (e.g. it might be "read_exist" vs "read_details" -> "you cannot read the details of this document because you are not a member of this folder".)


Puzzled: why is the conversion to SQL either interesting (SQL being a logic language so if disregarding recursion[1] they should map straightforwardly[2]) or of business value?

[1] yeah, CTEs, I know

[2] yeah, words are cheap from my end, I know

Edit: additionally, and for curiosity: the article covers use of datalog but no underlying security model (eg. consistency of access to resources so you don't have one way blocked to you but can go around via another route, as I know of one homebrew system where you could do that). How do you deal with that? I understand lattices can be used for this (TBH I have only done a little reading on these).


- potential to push policy execution pushdown to the db, such as for authing bulk oriented queries vs record level

- potential for use with RLS

- potential for analysis tools

In practice it seems boring, but in principle, interesting. I'd love to see actual staged compilation generation & integration (so no runtime service/runtime) and, for analysis, conversion to z3/SMT. We went for Casbin bc of the former, and seems a simpler leap for the latter.


RLS?


Row-level security, I assume.



No mention of XACML?


>> The other factor we need to consider is that the polynomial data complexity of Datalog is based on the most basic form, which doesn’t have function symbols (ie datatypes like lists) or arithmetic. Once you start adding such features (which almost all real-world Datalogs do), the data complexity inevitably goes up.

The informal definition of Datalog is that it is a decidable subset of Prolog, or a pure set of Prolog, etc, but I think it's more accurate to say that datalog clauses are a subset of the definite clauses such that a definite clause, C, is datalog if and only if a) C has no functions other than constants and b) every variable in the head literal of C is shared with a literal in the body of C.

For example, below, C is datalog, D and E are not:

  C = P(x,y) ← Q(x,z), R(z,y)    % datalog
  D = P(x,y) ← Q(x,z)            % z is not shared with head(D)
  E = P(x,y) ← Q(f(x),z), R(z,y) % f(x) is not a constant function.
Note that Prolog programs are sets of definite clauses, but without the restrictions (a) and (b) of datalog.

As the article points out, lists need functions, so for example p([H|T], H) is not datalog because [H|T] is a function, and for arithmetic n(s(0)) is not datalog because s(0) is a function.

So what's up with functions? In logic programming a function is a function symbol followed by n terms, where n is the arity of the function symbol, so if f(x,y) is a function then f is a function symbol with arity 2. Unlike in imperative programming, functions in logic programming _are not replaced by their values_, instead the truth of a formula containing functions is calculated by substituting the variables in the function. So f(x) can become f(a), f(b) etc. where a and b are constants. Problem is, variables in logic programming can be substituted for arbitrary terms and functions themselves are terms, so given the function symbol f of arity 1 we can make an infinite number of functions: f(x), f(f(x)), f(f(f(x))), f(f(f(f(x)))), etc. That's one important reason for the undecidability of First Order Logic (FOL). Definite logic, the logic of definite clauses, which is a restriction of FOL, is _semi_ decidable, meaning that if a theorem doesn't have a proof, there is no way to know in finite time. Datalog doesn't have functions so it removes this particular source of undecidability. If, in addition to not having functions, the Herbrand Universe (the set of all constants) is finite, then datalog is decidable, informally because every proof can only replace the variables in a program with a finite set of constants, which means there's a finite set of proofs for every program.

What about the requirement that the head literal shares variables with the body literals? That's to ensure that the entire datalog program can be ground. Datalog is evaluated "bottom up" meaning that literals in the body are ground before literals in the head, so in a clause like D above, even if Q(x,z) is grounded during evaluation, the variable y in the head of the clause cannot be grounded because it's not shared with the body. Which means that we can't prove that P(x,y) is a consequence of Q(x,z). In a clause like C, on the other hand, we can ground every variable by grounding the literals in the body and so bottom-up evaluation is possible. Bottom-up evaluation is desired because it avoids left-recursions and it makes the ordering of clauses irrelevant for the proof of a program, unlike in Prolog. This is an attractive property of datalog.

Now, about functions, the Inductive Logic Programming community knows a few tricks to retain the decidability of datalog programs even with functions. The sneakiest such trick is "flattening" where functions are "hidden" away from the datalog clauses. For instance, instead of writing:

  member(X, [X|Xs]) ← ... 
With flattening we can write:

  member(X, Xs) ← head(Xs,X), ... 
Where "head" is defined as head([X|Xs],X). With flattening we can keep some parts of the program datalog and the rest definite, and so have our cake and eat it.


Interesting post will spend a little time reading it late,in the meantime a quick Q, this

D = P(x,y) ← Q(x,z) % z is not shared with head(D)

throws me a bit, isn't it equivalent (as z is unused) to

D = P(x,y) ← Q(x)

which makes me wonder what y is doing. Would that leave y unbound, and if so is that why it's disallowed?

TIA


Yes, sorry, the example is not as simple as Einstein would have advised. Yours is simpler. The rule is that every variable in the head must share with the body, so that the head can be ground as the body is ground. I make it more clear in the rest of the comment above, at least I hope :)


Sure it is, because the language of Open Policy Agent, called Rego, is a Datalog variant. I find a researcher’s failure to mention this concerning.


The second sentence of the post links to OPA as an example of a system using Datalog for authz.

Also, "X uses Y for Z" doesn't imply that Y is good at Z. It can be the start of an argument about how that choice is turning out for X, but it's in no way an argument by itself.


My assertion is glib, but your response is wholly insubstantive. Can you to point out how this piece treats Rego as an actual example of a real-world Datalog? I don't see any link when mentioning real-world Datalogs.

The fact that this piece makes no mention of it but provides a drive-by link to it suggests that the author is not actually interested in evaluating the problem in a realistic scenario. The piece the literary equivalent of a reminder that one can harm oneself with a gun without ever actually discussing how one should implement gun safety.





Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: