Building AI on Sensitive Data: Privacy-Preserving Approaches

AI development is moving closer to sensitive data. What was earlier built on public or loosely governed datasets is now being trained and fine-tuned on customer data, internal documents, and regulated information. This changes the risk profile. Data cannot be freely shared, moved, or even accessed in the same way. At the same time, model performance still depends on the scale, diversity, and realism of data.

These constraints are already visible in how organisations are struggling to operationalise AI.

The constraint is not only data availability, but how data can be used without exposing it.

A set of practical approaches is emerging to work within these constraints, each addressing a different part of the problem: access, collaboration, and leakage, making it possible to build and improve AI systems without directly relying on raw data movement.

Using Synthetic Data Where Real Data Cannot Be Used

Synthetic data is used to address constraints associated with real-world data, including privacy risks, regulatory restrictions, and limited availability of representative datasets. Even when anonymised, real data can remain sensitive or difficult to share across organisations and jurisdictions. Synthetic data enables the training, testing, and validation of AI models on datasets that are statistically realistic but artificially generated. These datasets can replicate complex behaviours, support rare or edge scenarios, and allow controlled experimentation.

Banking & Financial Services. Banks and regulators have been early adopters of synthetic data to improve fraud detection and enable cross-industry collaboration without exposing sensitive information. The Financial Conduct Authority (FCA) in the UK, working with the Turing Institute and leading banks, has developed synthetic datasets for industry use. One example is the Authorised Push Payment fraud dataset, built using agent-based simulations of 20,000 synthetic individuals to replicate two years of transactions, communications, and scam scenarios. More recently, the FCA developed a dataset with realistic money laundering scenarios, available through its Digital Sandbox, to model how criminal behaviour may evolve in response to new detection systems.

Healthcare. Synthetic data is widely used in healthcare, particularly in non-clinical settings such as software testing, tool demonstration, and bias analysis. Its use is now extending into clinical research and patient care. The SYNTHIA project, under the Innovative Health Initiative (IHI), is developing and validating synthetic datasets across clinical notes, genomics, imaging, and laboratory data. It focuses on six disease areas, including lung and breast cancer, multiple myeloma, diffuse large B-cell lymphoma, Alzheimer’s disease, and type 2 diabetes, addressing challenges such as small patient cohorts, privacy constraints, high trial costs, and long follow-up periods.

Retail & Consumer Goods. Retailers are deploying computer vision for use cases such as shelf monitoring and planogram compliance. Training these models on real images is constrained by labelling costs, frequent product changes, and privacy concerns from incidental capture of customer faces. Technology companies such as Asseco and Neurolabs are using synthetic images to supplement datasets, improving efficiency while reducing privacy risk.

Energy & Utilities. The National Renewable Energy Laboratory (NREL) is using synthetic datasets to provide detailed infrastructure insights without exposing real asset locations. Through projects such as SMART-DS (Synthetic Models for Advanced, Realistic Testing), it has developed distribution system models that replicate the physical and electrical characteristics of U.S. cities. These are built using Reference Network Models (RNM), converting public data such as building footprints and road networks into functional electrical circuits. In addition, NREL’s End-Use Load Profiles provide hourly synthetic energy consumption data for nearly 900,000 buildings, enabling simulation of technologies such as electric vehicle charging and heat pumps, which is difficult using fragmented real-world data.

Federated Learning to Enable Collaboration Without Data Sharing

Federated learning addresses scenarios where AI models must be trained on real-world data that cannot be easily shared. In many cases, no single organisation has access to sufficiently diverse or complete datasets, while pooling data is constrained by privacy concerns and competitive sensitivities. Federated learning enables organisations to collaboratively train models while keeping data local, sharing model updates rather than raw data.

A common use case comes from banking, where fraud and money laundering patterns span multiple institutions, but data sharing is restricted. This creates gaps in visibility that criminals exploit, as each bank only sees part of the network. In the Nordic region, Handelsbanken and Swedbank collaborated with AI Sweden to test federated learning for anti-money laundering. The experiment showed that models trained on distributed datasets were more effective at detecting cross-bank laundering patterns more effectively than those developed in isolation.

Differential Privacy to Limit Data Leakage from Models

Differential privacy addresses the risk that AI models, particularly fine-tuned LLMs, may memorise and unintentionally expose sensitive training data. This risk is increasing as organisations move from general-purpose models to fine-tuning on proprietary datasets such as customer interactions, internal documents, and regulated data. Attack vectors, including adversarial prompting, have shown that models can reveal whether specific records were part of training or reproduce fragments of underlying data.

Differential privacy mitigates this by limiting the influence of individual training examples and introducing controlled noise into model updates, reducing the likelihood of leakage. A key advantage is that it provides a measurable privacy guarantee, allowing organisations to manage the trade-off between model performance and privacy risk.

Recent developments such as Google’s VaultGemma show that LLMs can be trained with formal differential privacy guarantees. While still at a research stage, it demonstrates that privacy-preserving training is technically viable, with noise applied during training to limit memorisation and exposure of sensitive data.

Ecosystm Opinion

Privacy-preserving techniques are becoming a core design requirement for AI systems as organisations increasingly rely on sensitive, regulated, and proprietary data. No single approach is sufficient; synthetic data, federated learning, and differential privacy address distinct constraints around data availability, collaboration, and leakage risk, and are often combined within the same workflow.

The shift towards fine-tuning models on internal datasets materially increases exposure to privacy risks, elevating the importance of techniques that provide measurable guarantees and limit data leakage. At the same time, cross-organisation collaboration, particularly in sectors such as banking and healthcare, is driving demand for approaches that enable shared model development without centralising data.

This introduces a deliberate trade-off between model performance, data utility, and privacy, requiring stronger governance alongside technical controls. While maturity varies, with synthetic data more established and others still evolving, these techniques collectively expand the universe of data available to organisations.