01What permission-aware RAG means
Permission-aware RAG is retrieval-augmented generation that enforces each user's access rights as part of retrieval. Before the prompt is assembled, every candidate record and field is checked against that user's identity and clearance. Authorized data passes through; data the user is not entitled to see is withheld before it ever reaches the language model.
Standard RAG has one job during retrieval: find the most relevant passages and put them in the context window. That is an information-retrieval objective, and it is indifferent to who is asking. A vector index does not know that a support agent should not see a patient's diagnosis, or that a junior analyst should not see an unredacted account number. Relevance ranks; it does not authorize. Permission-aware RAG adds the missing question to retrieval: not just "is this relevant," but "is this person allowed to see it."
The distinction matters because the model is downstream of retrieval. Whatever the retriever places in context, the model can read, reason over, quote, summarize, and surface. If retrieval is blind to permissions, then a well-phrased question becomes an access-control bypass, and the model becomes a confused deputy acting with more privilege than the person at the keyboard.
02Relevance is not permission
A retriever's notion of "best match" has nothing to do with authorization. The most relevant chunk to a question about a customer might be the exact field that customer's contract, or the law, says this particular employee cannot see. Ranking by similarity surfaces what is useful; it is silent on what is allowed.
This is the thesis that separates permission-aware RAG from ordinary RAG. Treating the two as the same is how sensitive data ends up in answers it should never have reached. The fix is not a better ranker. It is a separate, explicit authorization step that sits between retrieval and the prompt and has the power to remove records and redact fields regardless of how relevant they are.
03ACL-aware retrieval vs post-hoc output filtering
There are two places you can try to enforce access in a RAG pipeline: at retrieval, before the model sees anything, or at the output, after the model has answered. ACL-aware retrieval filters candidate records and redacts sensitive fields according to the requesting actor's permissions before the prompt is built. Post-hoc output filtering lets the model read everything, then scans the generated text and tries to scrub anything that should not have appeared.
The difference is not cosmetic. Output filtering is detection after the fact. By the time it runs, the model has already ingested the sensitive data, and that data may persist in the context window, in conversation memory, in logs, and in traces. A model cannot reliably un-see what it has read, and a post-hoc scrubber has to recognize every form a leak can take. ACL-aware retrieval removes the data at the source, so there is nothing for the model to leak.
| Property | ACL-aware retrieval | Post-hoc output filtering |
|---|---|---|
| When it acts | Before the prompt is built | After the model answers |
| Does the model see sensitive data? | No, it is withheld | Yes, then scrubbed |
| Leak in context, logs, traces? | Prevented at source | Still possible |
| Granularity | Per record and per field | Whatever the scrubber catches |
| Failure mode | Fail-closed, withhold on doubt | Misses produce silent leaks |
| What it is | Prevention | Detection |
Output filtering is not worthless; it can be a useful second layer. But it cannot be the primary control, because its best case is catching a leak that has already happened inside the system. Prevention belongs upstream of the model. For the broader pattern across a RAG system, see RAG security.
04Per-actor identity and clearance resolution
Enforcing permissions at retrieval requires knowing who is asking. The first step is to resolve the requesting actor's identity, not the application's service account. In many RAG deployments the retriever runs as a single high-privilege identity, which is exactly how a low-privilege user inherits access they should not have. Permission-aware RAG carries the end user's identity through to the authorization decision.
Custosa maps each verified identity to a five-level clearance lattice, a small ordered set of levels that says what an actor at each level may see. Roles from your identity provider resolve to a clearance level, and that level governs every field decision for that request. The lattice is intentionally coarse and ordered, which makes verdicts predictable and easy to reason about: a higher level can see everything a lower level can, plus more.
Identity resolution also has to be honest about uncertainty. If the actor's clearance cannot be established, the safe behavior is to withhold rather than guess. Authorization that defaults to "allow" when it is unsure is not authorization.
05Field-level redaction before the prompt
Records are rarely all-or-nothing. A single patient record might contain a name the agent may see, an appointment time they may see, and a diagnosis they may not. Coarse, record-level filtering forces a bad trade: drop the whole record and lose useful context, or include it and leak the sensitive field. Permission-aware RAG resolves this at the field level.
The unit of decision is a per-field verdict: for this actor, each field is either PASS or REDACT. Authorized fields flow into the prompt; sensitive fields are removed or replaced before the prompt is assembled. The model receives a record it can use for the legitimate parts of the task, with the protected fields simply absent. Because the redaction happens before inference, the model cannot leak what it never received. This is the operational meaning of authorization before augmentation: clear the data, then augment the prompt.
06Deterministic policy, so decisions are explainable
Who decides PASS or REDACT, and on what basis? In permission-aware RAG the decision must be reproducible and auditable, which rules out using a model to judge sensitivity. Custosa evaluates every field with a deterministic formal policy engine, Cedar. Given the same actor, the same field, and the same policy, it always returns the same verdict. There is no sampling, no temperature, no drift between two identical requests.
Determinism buys two things that matter for access control. First, explainability: a verdict traces to a specific rule, so you can answer why a field was redacted rather than shrug at a model's opinion. Second, reproducibility: an auditor can replay a decision and get the same result, which is the difference between a control and a guess. The behavior is fail-closed; if the engine cannot reach a verdict, it blocks rather than letting the request through.
07Content-free evidence of every retrieval decision
Enforcing permissions is necessary; being able to prove it is what makes the control trustworthy. Every retrieval decision in Custosa is recorded as evidence: which actor, which fields passed, which were redacted, under which policy. Each entry is signed with HMAC-SHA256 and hash-chained to the previous one, producing an append-only, tamper-evident ledger.
The evidence is content-free. It records that a field was redacted and under what rule, never the field's value. The data plane that performs inspection runs inside your environment, so records never cross your boundary; the ledger carries verdicts, hashes, and signatures, not content. The chain can be verified independently and offline, without contacting Custosa, so an auditor can confirm that retrieval enforced the policy you configured without anyone re-exposing the protected data to check.
08Implementation patterns and a checklist
Permission-aware RAG is an architecture choice, not a single feature you bolt on at the end. A few patterns make it durable.
Put the control plane between the data and the model. The authorization step belongs inline in retrieval, after candidate selection and before prompt assembly, so no path reaches the model without passing through it. A control that can be skipped is not a control.
Carry the end-user identity end to end. Do not let the retriever collapse every request into one service account. The clearance that governs redaction must be the requesting user's.
Decide at the field level and fail closed. Per-field PASS or REDACT preserves useful context while protecting the sensitive parts, and a fail-closed default means doubt resolves to withholding.
- Resolve the end user's identity, not the app's, on every retrieval.
- Map identity to a clearance level that governs field decisions.
- Enforce authorization before the prompt is built, not on the output.
- Decide per field: PASS or REDACT, never all-or-nothing per record.
- Use a deterministic policy so verdicts are explainable and reproducible.
- Fail closed when clearance or a verdict cannot be established.
- Record content-free, tamper-evident evidence of every decision.
See permission-aware retrieval in practice
Custosa inspects every record and field at runtime, before the model sees it, and signs every decision into a content-free evidence ledger.
Frequently asked questions
What is permission-aware RAG?
Permission-aware RAG is retrieval-augmented generation that enforces each user's access rights as part of retrieval, so the model is only ever given records and fields that user is authorized to see. Access is resolved per actor and applied before the prompt is built, not after the answer is generated.
What is ACL-aware retrieval?
ACL-aware retrieval is retrieval that respects the access control list, or permission model, of the source data. Candidate records are filtered and sensitive fields are redacted according to the requesting actor's identity and clearance before any content reaches the language model, so relevance never overrides permission.
Why is filtering the output not enough?
Filtering the output is detection after the fact. By the time the model has generated text, it has already read the sensitive data, and that data may persist in context, logs, or traces. A model cannot reliably scrub what it has already absorbed. Withholding the data before inference removes the leak at its source.
How do you enforce permissions at retrieval time?
Resolve the requesting actor's identity and clearance, then evaluate each candidate record and field against a deterministic policy before the prompt is assembled. Authorized fields pass through; sensitive fields are redacted; if no verdict can be reached, the request fails closed. Every decision is recorded as content-free evidence.
Does this slow RAG down?
The added cost is small relative to model inference. Custosa targets p99 added latency of 50 to 110ms because decisions are deterministic policy evaluations rather than model calls. The check runs inline during retrieval, so it adds a bounded, predictable step rather than a second round trip to a model.