SR 11-7 for LLMs: Model Risk Management for GenAI

SR 11-7 was the supervisory guidance on model risk management issued by the Federal Reserve and the Office of the Comptroller of the Currency in 2011 (OCC Bulletin 2011-12). It defined model risk as the potential for adverse consequences from decisions based on incorrect or misused model output, and it laid out how a supervised institution should manage that risk across a model's life cycle through three pillars: sound development, implementation, and use; effective independent validation; and governance, policies, and controls. Its definition of a model was broad enough that a large language model producing a score, classification, or decision in a banking process looked like a model under that framework. In April 2026, SR 26-2 (OCC Bulletin 2026-13) rescinded and replaced SR 11-7 and explicitly excluded generative and agentic AI from its formal scope, so an LLM is no longer formally inside the perimeter SR 11-7 once drew.

That exclusion is a deliberate guidance gap, not a green light. Supervisors still expect banks to apply model-risk principles to consequential AI, and the agencies plan a request for information with further AI-specific guidance to follow. The instinct to treat a generative assistant as a mere productivity tool, which leaves it outside the inventory, outside validation, and outside governance, is exactly the instinct examiners push back on. The work of applying model-risk discipline to AI is mostly the work of closing that gap now rather than waiting.

01What SR 11-7 was

SR 11-7 was issued jointly by the Federal Reserve and the OCC as supervisory guidance on model risk management. It built on earlier OCC guidance and became the reference standard for how banks identify, measure, and control the risk that a model creates. Its central idea was that models are useful but fallible, and that the fallibility itself is a risk to be managed, not merely a technical detail.

The guidance defined model risk as the potential for adverse consequences from decisions based on incorrect or misused model output. It then attributed that risk to two sources: a model may have fundamental errors and produce inaccurate output, or a model may be used incorrectly or inappropriately even if it is sound. Both sources still matter for AI, and the second is easy to underweight.

SR 11-7 was principles-based rather than prescriptive. It did not hand institutions a checklist of approved methods; it set expectations and expected each institution to meet them in proportion to the risk a model carries. That proportionality is why a high-stakes credit model and a low-stakes drafting assistant were never held to identical depth of validation.

The development to be precise about is this: in April 2026 the OCC, Federal Reserve, and FDIC issued revised interagency model risk management guidance, SR 26-2 (OCC Bulletin 2026-13), that rescinded and replaced SR 11-7 for covered banks. Traditional quantitative models, such as credit scoring, market risk, and regulatory capital models, remain in scope. The new guidance explicitly excludes generative and agentic AI, stating that those systems are novel and rapidly evolving and are not within its scope, and the agencies plan a request for information on model risk management and banks' use of AI. The SR 11-7 way of thinking about model risk still informs the new framework, and the supervisory expectation to apply model-risk principles to consequential AI does not disappear; it now sits in a deliberate gap rather than in the formal rulebook.

02Why an LLM looked like a model under SR 11-7

SR 11-7 defined a model as a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates. The definition was deliberately broad, and it had three parts: an information input, a processing component, and a reporting component that turns estimates into useful output.

A large language model fit this on its face. It takes input, applies a statistical method learned from data, and produces output. When that output is a credit recommendation, a fraud score, a suitability classification, an alert disposition, or a figure that feeds financial reporting, the LLM is doing exactly what the definition described: turning inputs into quantitative or decision-relevant estimates that a bank acts on. The 2026 guidance ended the formal version of that classification by excluding generative and agentic AI, but the function-based logic is why model-risk principles still fit these systems.

The places an LLM most often crosses into model territory include:

Credit and underwriting, where a model summarizes or scores an applicant's file and the output shapes a lending decision.
Fraud and financial-crime surveillance, where a model triages alerts, classifies transactions, or drafts suspicious-activity narratives.
Financial reporting, where a model produces or supports a number that lands in a control over financial reporting, which pulls ICFR into view.
Customer-facing decisions, where a model influences pricing, eligibility, or treatment in ways that touch fair-lending and consumer-protection law.

The 2026 guidance carves generative and agentic AI out of the formal model-risk scope, describing those systems as novel and rapidly evolving, with a request for information and further AI-specific guidance expected. Even so, supervisors expect banks to apply model-risk principles to consequential AI. The lesson for practitioners is the same either way: if an LLM influences a decision a regulator cares about, treat it with model-grade discipline now rather than wait for the gap to close. This is the heart of financial services AI compliance.

03The three components examiners expect

SR 11-7 organized model risk management around three reinforcing components, and those principles carry forward as the model-risk discipline supervisors expect banks to apply to consequential AI. An examiner reviewing an AI deployment will still look for all three, and a weakness in any one undermines the others.

Robust development, implementation, and use. The institution should be able to show how the model was built, what data and assumptions it rests on, what its known limitations are, and how it is actually used in production. For an LLM this includes the data the model can reach, the prompts and guardrails around it, and the controls that govern what it is permitted to see and do.

Effective validation by an independent party. Validation is the set of activities that confirm a model works as intended and is suitable for its purpose, performed by people independent of those who built it. It includes evaluation of conceptual soundness, ongoing monitoring, and outcomes analysis. SR 11-7 also recognizes effective challenge, the critical review by competent and independent parties, as the engine that makes validation real rather than ceremonial.

Governance, policies, and controls. The framework that ties the first two together: clear ownership, board and senior-management oversight, policies that define how models are approved and reviewed, and controls that enforce them. Without governance, good development and validation are unrepeatable accidents.

The table below maps each expectation to a concrete practice for an LLM deployment.

SR 11-7 expectation	What to implement for an LLM
Robust development and use	Document the model's purpose, data access, prompts, and guardrails; define and constrain what the model is allowed to see and do.
Independent validation	Independent testing of behavior across benchmark sets, probing for hallucination and data exposure, with documented effective challenge.
Ongoing monitoring	Track output quality, drift, and control performance in production; re-validate on a defined cadence and after material change.
Governance and controls	Named model owner, approval and review policy, board and management oversight, and enforced access and use controls.
Model inventory	Register the LLM and every AI-driven decision point, with owner, inputs, outputs, limitations, and validation status.
Documentation and audit	Maintain development and validation records plus a verifiable trail of what the system decided and why.

04The model inventory and why it is the top finding

SR 11-7 expected an institution to maintain a comprehensive inventory of all models in use, in development, or recently retired, and a complete inventory remains a core model-risk principle under the 2026 guidance. The inventory is not a formality. It is the control that makes every other control possible, because a system that is not on the list is one that no one validates, monitors, or owns. Extending it to consequential AI is sound practice even though generative AI sits outside the formal 2026 scope.

An incomplete model inventory is the single most common examination finding in model risk management, and AI makes the problem worse for a specific reason: LLM-based capability rarely arrives through a formal model-approval process. It shows up as a feature inside a vendor product, a copilot bolted onto a workflow, or a script a team wrote to summarize documents. None of those announce themselves as models, so none of them land in the inventory by default, and each becomes an ungoverned decision point.

A defensible AI inventory entry captures, at minimum:

The model's purpose and the decisions it influences.
Its owner, the accountable individual or function.
Its inputs and outputs, including what data it can reach.
Its known limitations, such as a tendency to hallucinate or to over-collect data.
Its validation status and the date of last review.
Whether it is vendor-supplied or embedded, since third-party models are in scope and the bank remains responsible for them.

The practical discipline is to inventory AI at the level of the decision, not the tool. One vendor platform may host several distinct AI-driven decisions, and each is a separate item to govern. Treating the platform as a single line hides exactly the risk the inventory exists to surface.

05Validating a non-deterministic model

The hardest part of applying model-risk discipline to LLMs is validation, because validation was designed around reproducibility and a generative model is not reproducible in the usual sense. The same prompt can yield different output on two runs. Classic validation, which compares a model's output to an expected result, does not map cleanly onto a system whose output is a distribution rather than a point. This is one reason the 2026 guidance excludes generative AI from its formal scope, and one reason supervisors still expect validation-equivalent controls around it.

Validation adapts in a few ways. You test the distribution of behavior rather than a single answer, evaluating the model against benchmark sets and measuring how often it produces acceptable output. You probe deliberately for the failure modes that matter for banks: hallucinated or unsupported claims, exposure of customer data in a prompt or response, and the use of output in a decision without adequate human review. And you bound the system so its behavior is constrained, which shrinks the space a validator has to cover.

The most effective move is structural. Wrap the non-deterministic model in deterministic controls, so the parts of the system that actually govern risk are reproducible even though the model is not. A formal policy engine that decides what data the model is allowed to see is deterministic by construction: the same inputs produce the same verdict every time. That property is what makes a control validatable. You can test it exhaustively, document its behavior, and demonstrate to an examiner that it does what it claims, in a way you can never quite do for the stochastic core.

You cannot make a generative model deterministic. You can make the controls around it deterministic, and validate those.

This reframes the validation problem from "prove the model is correct," which is intractable for an LLM, to "prove the controls that constrain the model are correct," which is tractable. The model still needs evaluation and monitoring, but the institution's defensible line of control runs through the deterministic layer, not the model's internals. That layer is where AI data governance meets model risk management.

06Ongoing monitoring and documentation

SR 11-7 treated validation as continuous, not a one-time gate, and ongoing monitoring remains a model-risk principle supervisors expect applied to consequential AI. A model that passed review at launch can degrade as the world it models shifts, as its inputs drift, or as its use expands beyond what was validated. For LLMs the drift can come from a model-version update, a change in the data the system can reach, or a quiet expansion of the decisions it touches.

Effective monitoring for an AI deployment watches several signals at once.

Output quality, sampled and reviewed against expectations, to catch degradation before it reaches a decision.
Control performance, confirming that the guardrails and access controls are firing as designed and failing closed when they should.
Scope creep, checking that the model is still used only for the purpose it was validated for, not quietly repurposed.
Version and configuration change, so that a swap in the underlying model or its settings triggers re-validation rather than slipping through.

Documentation is the connective tissue. SR 11-7 expected records detailed enough that a knowledgeable third party can understand how a model works, what its limitations are, and how it is controlled, without relying on the people who built it, and the 2026 guidance preserves that expectation for in-scope models. For AI, that documentation has to cover the deterministic control layer as much as the model, because the controls are where the institution's defensible risk management actually lives, and they are the auditable evidence supervisors look for in the gap.

07Evidence and the audit trail

Documentation describes how a system is supposed to behave. Evidence proves how it behaved on a given request. Model-risk governance and audit expectations, and any examiner reviewing the deployment, turn on the second. The question is concrete: for this decision, what policy applied, what verdict was reached, and which data was allowed or withheld?

An audit trail built to answer that should satisfy several properties at once.

What the evidence has to do

Be content-free, recording the policy, the verdict, the fields acted on, and a hash, not the customer data itself, so the log is not a second copy of regulated information to protect.
Be signed, so each entry's integrity can be verified; with HMAC-SHA256, a single altered byte invalidates the signature.
Be hash-chained and append-only, so that altering or deleting any entry breaks the chain and the tampering is visible.
Be verifiable offline, so an examiner can confirm the record independently, without trusting the system that produced it.
Tie each decision to a deterministic policy, so the same inputs always yield the same verdict and the record is reproducible and therefore defensible.

Evidence with these properties does something documentation cannot: it lets the institution demonstrate, request by request, that its controls operated. For a stochastic model that is the difference between asserting a control exists and proving it worked, and it is the form of proof an examiner can act on.

08How Custosa supports model risk management for LLMs

Custosa is the runtime data-control plane for enterprise AI. It sits between your data and the LLM and inspects every record and field at runtime, before the model sees it. For a bank applying model risk management to generative AI, Custosa is the deterministic control layer that sits around the non-deterministic model and is, by design, the part you can validate and evidence.

Decisions are made by a deterministic formal policy engine, Cedar, not a model, so the same inputs always produce the same verdict and the system fails closed. That determinism is what makes the control validatable in the model-risk sense: it can be tested exhaustively and documented, where the model's own behavior cannot. Verdicts are per-field Pass or Redact, by role, through a five-level clearance lattice, and fields a role is not cleared for are withheld before they enter the prompt, which directly addresses the data-exposure failure mode examiners probe for.

The data plane runs inside your environment, so records never leave your boundary; self-managed, on-premises, and air-gapped options are available, along with a FIPS build, and the control plane receives only content-free verdict evidence. Every decision is signed with HMAC-SHA256 and hash-chained into an append-only, tamper-evident, content-free evidence ledger that can be verified offline, which is the audit trail an examiner can act on. Custosa ships SOC 2 and SOC 1 packs and maps its controls to SR 11-7, FFIEC, and ICFR expectations. Data is protected with TLS in transit and AES-256-GCM at rest, with BYOK on request, and the p99 added-latency target is ≤50ms. Custosa is early-stage and in production with design partners.

To be precise about scope: Custosa provides the controls and the evidence. It does not validate your model, sign your model risk management framework, or stand in for your independent validation function. Model risk governance remains the institution's responsibility; Custosa makes the control layer around the model deterministic, enforceable, and provable.

Put a validatable control layer around your LLM

See how Custosa makes the controls around a generative model deterministic, enforces them at runtime, and records signed evidence an examiner can verify offline.

Request access See it work

Frequently asked questions

What is SR 11-7?

SR 11-7 was supervisory guidance on model risk management issued by the Federal Reserve and the Office of the Comptroller of the Currency in 2011 (OCC Bulletin 2011-12). It defined model risk as the potential for adverse consequences from decisions based on incorrect or misused model output, and it set out how supervised institutions should manage that risk across a model's life cycle. Its core expectations were robust model development, implementation, and use; effective validation by an independent party; and sound governance, policies, and controls. It was the foundational US bank model-risk framework for over a decade. In April 2026 the OCC, Federal Reserve, and FDIC issued revised interagency guidance, SR 26-2 (OCC Bulletin 2026-13), that rescinded and replaced SR 11-7 for covered banks. Its principles still shape how examiners think about model risk.

Does SR 11-7 apply to AI and LLMs?

Not anymore, at least not directly. The SR 11-7 definition of a model was broad enough to reach a large language model used to produce a number, score, classification, or decision, but SR 11-7 was rescinded and replaced in April 2026 by SR 26-2 (OCC Bulletin 2026-13). The 2026 guidance explicitly excludes generative and agentic AI from its formal scope, describing those systems as novel and rapidly evolving, and the agencies plan a request for information on banks' use of AI. So there is currently a deliberate guidance gap for LLMs. Even so, supervisors expect banks to apply model-risk principles to consequential AI: an LLM that affects a credit, fraud, surveillance, or financial-reporting decision should still have documented governance, inventory, validation or effective challenge, monitoring, and auditable evidence. The prudent posture is to apply that discipline now rather than wait for further AI-specific guidance.

What is a model inventory?

A model inventory is a complete, maintained record of every model an institution uses, including models that are vendor-supplied or embedded in other systems. SR 11-7 expected the inventory to capture each model's purpose, owner, inputs and outputs, limitations, validation status, and use, and that expectation carries forward as a core model-risk principle under the 2026 guidance. As a matter of sound practice, the inventory should also extend to consequential LLM-based tools, which are easy to overlook because they are often introduced through a product feature or a workflow rather than a formal model approval. An incomplete inventory is the most common examination finding in model risk management, because a system the institution did not record is a system it is not governing.

How do you validate a non-deterministic model?

A generative model is stochastic, so the same input can produce different outputs, which makes the reproducibility that classic validation relies on hard to achieve. Validation adapts by testing distributions of behavior rather than single answers, evaluating against benchmark sets, probing for failure modes such as hallucination and data exposure, and constraining the system so its behavior is bounded. A powerful technique is to wrap the model in deterministic controls, such as a formal policy engine that decides what data the model may see, so that the parts of the system that govern risk are reproducible and can be validated and documented even though the model itself is not.

What evidence do examiners want for AI models?

Examiners want to see that controls exist and that they operated. For an AI system that means a complete model inventory entry, development and validation documentation, evidence of ongoing monitoring, and a record, for each request, of what policy was applied, what decision was reached, and which data was allowed or withheld. The strongest form is a tamper-evident log that an examiner can verify independently. A content-free, signed, hash-chained ledger records verdict metadata, hashes, and signatures rather than customer data, so it demonstrates control without moving regulated information.

SR 11-7 for LLMs: model risk management for generative AI