Home/Learn/Synthetic data
Compare

Custosa vs synthetic data: Gretel and Mostly.ai

Updated June 2026 · 6 min read

Synthetic data tools such as Gretel and Mostly.ai generate artificial data that mimics real records, mainly to protect sensitive data in development, testing, and training. Custosa governs access to real data at production inference time. This page sets out what synthetic data is good for, the gap for production AI on real records, and why the two are complementary.

The short version: synthetic data replaces real sensitive records with artificial ones so teams can build and test without exposing production data; Custosa governs access to the real data your AI reads in production. Gretel and Mostly.ai train generative models on a dataset and emit new data that resembles it statistically, which is valuable for development, testing, and training pipelines. But once a retrieval-augmented assistant or agent answers a real question over real customer records, synthetic data is no longer in the loop. You still need field-level access control, redaction by role, and evidence of each real-data decision. The two address different points in the lifecycle and pair naturally.

If you are evaluating a synthetic data alternative for keeping sensitive data out of your AI, the honest framing is that synthetic data and runtime data control are not competing tools. Synthetic data protects non-production environments; Custosa protects real data at runtime. The table below maps the two categories against the capabilities buyers tend to conflate.

CapabilitySynthetic dataCustosa
Generate artificial data for dev and testingNot its job
Preserve statistical structure for model trainingNot its job
Privacy mechanisms (e.g. differential privacy)✓ On the generated setDifferent control point
Govern access to real data at inference timeNo; works on a copy
Per-actor, per-field PASS or REDACT on real recordsNo
Deterministic formal policy engineNo; generative model✓ Cedar, fail-closed
Withhold real fields before the prompt is builtNo
Signed, hash-chained, content-free evidence per decisionNo
Guaranteed fidelity to the real recordNo; approximate by design✓ Operates on the real record
Runs inside your environment; records never leaveVaries by deployment✓ Data plane in your boundary

01What synthetic data is good for

Synthetic data tools generate artificial datasets that are statistically similar to real ones without containing the real records. Gretel and Mostly.ai both train generative models on a source dataset, learn its distributions and relationships, and emit new data that behaves like the original for analysis and modeling. Used well, they remove a real and recurring source of risk: the need to copy sensitive production data into places it does not belong.

  • Development and testing. Engineers can build and test against realistic data without provisioning real customer records into lower environments, which shrinks the exposure surface in CI, staging, and local development.
  • Model training. Synthetic data can stand in for sensitive data when training or fine-tuning, and can help augment or rebalance datasets where real data is scarce or skewed.
  • Sharing and collaboration. A synthetic dataset can be shared across teams, vendors, or partners with less of the sensitivity that attaches to the original records.
  • Privacy mechanisms. Both platforms offer privacy controls, and Gretel supports differential privacy for certain workflows, adding a measurable bound on what the generated data reveals about any individual in the source.

For these jobs, synthetic data is the right instrument. It addresses the problem of real data being in the wrong place by manufacturing a safer stand-in. The question is what happens when your AI has to read the real data, not a stand-in.

02The gap for production AI on real records

Synthetic data is generated ahead of time and consumed in place of the original. That model works cleanly for development and training, where a realistic copy is exactly what you want. It does not address production inference, where a retrieval-augmented assistant or an agent must read the actual customer records to answer the actual question. At that point there is no synthetic substitute in the path, and the controls synthetic data provides simply do not apply.

  • It does not govern access to real data at runtime. Synthetic data protects a copy used for building and testing. It says nothing about which real records a given user may read when the system is live and answering questions over production data.
  • It does not give per-actor, field-level verdicts. A generated dataset has no notion of who is asking. It cannot pass a field to one role and withhold it from another, because that decision belongs to the moment of access on the real record, not to the generation of an offline copy.
  • It does not produce evidence of each real-data decision. Quality and privacy scores describe the synthetic set. They are not a per-request, signed, tamper-evident record of who was allowed to see which real field and why.
  • Fidelity and re-identification considerations remain. Synthetic data is approximate by design, so it can lose fidelity for edge cases, and without sufficient privacy controls a generated dataset can in some circumstances carry re-identification risk. It is a strong tool for its purpose, but it is not a guarantee, and it is not a substitute for governing the real data your AI uses in production.

The core point is that synthetic data solves an upstream, offline problem, while production RAG and agents have a live, real-data problem. You cannot answer a customer's question about their own account with a statistically similar fabrication. This is the same access-at-runtime concern that sits at the center of RAG security.

Synthetic data can give your developers a safe copy of a claims table to build against. It cannot decide, when a support agent asks the live assistant about a specific member, that the diagnosis field must be withheld while the claim status is shown. That decision is about real data, a real actor, and a real moment of access.

03What Custosa does, and where the two overlap

Custosa is a runtime data-control plane for enterprise AI. Its data plane runs inside the customer's environment and inspects every record and field at runtime, before the model, so records never leave the boundary. A deterministic formal policy engine, built on Cedar rather than a model, evaluates each field against the actor's role using a five-level clearance lattice and issues a per-field PASS or REDACT verdict, withholding prohibited fields before the prompt is built. Because the engine is deterministic and fail-closed, the same inputs always produce the same decision, and an unresolved decision blocks rather than leaks. Every decision is signed with HMAC-SHA256 and hash-chained into an append-only, tamper-evident, content-free evidence ledger that is verifiable offline; the control plane receives only content-free verdict evidence.

The overlap with synthetic data is narrower than buyers sometimes assume, but it exists. Both are ways to keep sensitive data from being exposed where it should not be. Both can appear under a heading like "data protection for AI" on a checklist. The difference is the control point and the data. Synthetic data acts before use, on an offline copy, mainly for development and training; Custosa acts during use, on the real records, per actor and per field, with signed evidence. One manufactures a safe substitute; the other governs the genuine article at the moment it is read.

04Why they are complementary

Because synthetic data and Custosa operate at different stages of the lifecycle, the practical answer is usually both, each owning what it is built for.

  • Use synthetic data for development, testing, and training: generate realistic, artificial datasets so teams can build and validate without provisioning real sensitive records into lower environments, with privacy mechanisms on the generated set where needed.
  • Use Custosa for production: govern access to the real data your AI reads at inference time, with per-actor, per-field verdicts, withholding before the model, and signed, tamper-evident evidence of each decision, all inside your environment.

A clean separation of duties follows naturally. Synthetic data protects the development and training pipeline, so real records are not scattered across environments that do not need them. Custosa protects the production path, so when the live system reads real data to serve a real request, the right fields reach the model for the right actor and every decision is provable. Added latency from Custosa inspection is typically a p99 of 50 to 110ms, which fits inside an interactive AI request. The two are not in tension; they cover adjacent stretches of the same lifecycle.

Custosa is early-stage and in production with design partners. It does not generate synthetic data and is not a substitute for it. Synthetic data is for dev, test, and training; Custosa is for governing real data at runtime. The same logic underpins broader AI data governance, where each stage of the data lifecycle needs the control suited to it.

Govern real data when the model actually reads it

Custosa inspects every record and field at runtime, redacts by role inside your environment, and signs content-free evidence of each decision. It runs alongside the synthetic-data tooling you use for development and training.

05Frequently asked questions

What is synthetic data?

Synthetic data is artificial data generated to be statistically similar to a real dataset without being the real records. Tools such as Gretel and Mostly.ai train generative models on production data, learn its statistical properties and relationships, and emit new datasets that resemble the original. Some support privacy mechanisms such as differential privacy. The goal is to give teams realistic data for training, testing, and development without exposing the underlying sensitive records.

Is Custosa an alternative to synthetic data?

No; they solve different problems and are usually complementary. Synthetic data replaces real sensitive records with artificial ones, mainly so teams can build and test without touching production data. Custosa governs access to real data at production inference time: it inspects records and fields at runtime, applies deterministic role-based policy, withholds prohibited fields before the model, and signs evidence of each decision. Synthetic data is for development and training; Custosa is for governing real data in production.

Does synthetic data make AI compliant?

It helps in development and testing, but it does not by itself make a production system compliant. The moment a retrieval-augmented assistant or agent reads real customer records to answer a real question, synthetic data is no longer in the loop, and you still need access control, redaction by role, and an evidence trail for those real-data decisions. Synthetic data reduces exposure in non-production environments; it does not govern the real data your AI uses in production, and it can carry residual fidelity and re-identification considerations.

What is the difference between synthetic data and runtime redaction?

Synthetic data is generated ahead of time to replace a real dataset, so downstream consumers work with artificial records. Runtime redaction acts at the moment of access on the real records: Custosa evaluates each field against the actor's role and withholds prohibited fields before the prompt is built, on live production data. One substitutes the dataset before use, mainly for dev and test; the other governs the real dataset during use, per actor and per field, with signed evidence of each decision.

Can you use synthetic data and Custosa together?

Yes, and the combination is natural. Use synthetic data from a tool such as Gretel or Mostly.ai to build, test, and train without exposing real records in non-production environments. Then use Custosa in production to govern access to the real data your AI reads at inference time: field-level verdicts by role, withholding before the model, and signed, tamper-evident evidence of each decision. Synthetic data protects the development pipeline; Custosa protects real data at runtime.