500,000 Volunteers Breached Through Authorised Access: A Controlled Access Buyer's Guide
In April 2026, approved researchers exfiltrated the health records, genetic data, and medical histories of 500,000 UK Biobank volunteers through authorised access channels, then listed the data for sale on Alibaba. The breach was not caused by a hack. It was caused by a model that assumes licence agreements can prevent data theft. This guide covers why that model fails and what physical controls replace it.

Mark Fermor
Director & Co-Founder, Firevault

Why This Guide Exists
500,000 people volunteered their genetic data, medical histories, and lifestyle records for research. That data was sold on Alibaba by an approved researcher using authorised access.
The researcher had signed a data use agreement. They had been vetted through an approved-researcher pathway. They had passed ethics board review. They accessed the data through the same channels that every other approved researcher uses. None of it mattered. The genetic sequences of 500,000 volunteers, their linked NHS records, their mental health data, and their cardiovascular histories were listed for sale on a public marketplace.
This guide exists because the Biobank breach was not an anomaly. It was the logical outcome of a model that treats licence agreements as access controls. Whether you hold health records, genetic data, financial datasets, legal discovery materials, or intellectual property, the question is the same: once an authorised user can see your data, what physically prevents them from taking it?
Your Role and Your Data
This guide is written for data owners, compliance leads, research governance teams, CISOs, and anyone responsible for controlling access to datasets that carry regulatory, reputational, or commercial risk.
The datasets in scope include health records governed by the UK GDPR and the Data Protection Act 2018, genetic and genomic data subject to the common law duty of confidentiality, research datasets held in trusted research environments, financial records regulated under the FCA handbook and the Digital Operational Resilience Act (DORA), legal discovery and litigation hold materials, and intellectual property including trade secrets, source code, and proprietary algorithms.
There is a compliance paradox at the heart of modern data governance. You can licence access to a dataset, enforce a data use agreement, require ethics board approval, restrict access to a trusted research environment, maintain a full audit trail of every query, and still lose the data entirely. Compliance with the process does not guarantee control of the outcome. 500,000 Biobank volunteers discovered this when their genetic sequences appeared on Alibaba.
How Dataset Access Is Typically Controlled
The standard model for controlling access to sensitive datasets relies on multiple administrative and technical layers. Each layer is designed to reduce risk, but none of them physically prevents data from being copied once access has been granted.
Licence agreements and data use agreements (DUAs) define what a user may and may not do with the data. They are legal instruments. They create liability after a breach but provide no mechanism to prevent one.
Approved researcher lists restrict who can apply for access. In the Biobank model, researchers submitted applications that were reviewed against eligibility criteria. Approval was binary: you were either on the list or you were not. Once on the list, a researcher approved to study cardiovascular risk could access genetic sequences, mental health records, and linked NHS data belonging to all 500,000 volunteers.
Ethics board approvals assess the purpose of the research, not the security of the access pathway. An ethics board does not evaluate whether a researcher can download, copy, or export 500,000 genetic profiles after approval.
Data enclaves and trusted research environments (TREs) provide a controlled compute environment where researchers can run queries without downloading raw data. However, most TREs permit some form of output export, statistical results, derived datasets, or visualisations, and the boundary between permitted output and raw data exfiltration is enforced by software, not physics.
Audit trails record who accessed what, when, and from where. They are retrospective instruments. An audit trail tells you what happened after the data has already been accessed. It does not prevent the access from occurring.
Every one of these layers was in place at UK Biobank. Every one of them failed to prevent the exfiltration of 500,000 volunteers' genetic sequences, medical histories, and NHS-linked records.
The UK Biobank Breach: What Happened
UK Biobank holds genetic sequences, medical histories, mental health records, cardiovascular data, lifestyle surveys, and linked NHS records for 500,000 volunteers recruited between 2006 and 2010. It operates as a research resource: approved researchers apply for access, sign a material transfer agreement (MTA) and data use agreement, and are granted access to datasets through Biobank's data access platform.
The access model was designed to be open. Biobank's stated mission was to make data available to as many bona fide researchers as possible. Over 30,000 researchers across more than 90 countries had been granted access by 2025. Each of those 30,000 researchers could access the genetic data, mental health records, and medical histories of all 500,000 volunteers.
In early 2026, UK Biobank was alerted that the genetic sequences, medical histories, mental health data, cardiovascular records, and linked NHS records of 500,000 volunteers had been listed for sale on Alibaba's cloud marketplace. The data had not been obtained through a network intrusion, a vulnerability exploit, or a social engineering attack. It had been accessed through the standard approved-researcher pathway, downloaded through permitted export channels, and then sold.
The breach timeline illustrates the gap between detection and response. The data, 500,000 people's genetic profiles, their NHS records, their mental health histories, was listed for sale before Biobank was aware of the exfiltration. Revocation of the researcher's access occurred days after the listing was identified. By that point, the genetic data, the medical histories, and the NHS-linked records had already been copied to infrastructure outside Biobank's control. Source: BBC News, 23 April 2026.
Five Systemic Failures the Breach Exposed
The Biobank breach was not caused by a single point of failure. It exposed five systemic weaknesses in the licensed-access model that apply to any organisation governing access to sensitive datasets.
1. Access scope was too broad. A researcher approved to study cardiovascular risk had access to genetic sequences, mental health records, lifestyle data, and linked NHS records for all 500,000 volunteers. None of the genetic data, none of the mental health records, and none of the NHS-linked histories were relevant to their approved project. Broad access scope meant that a single compromised or malicious account could exfiltrate the entire dataset, every genetic sequence, every medical history, every NHS record.
2. Download and export were permitted by design. The access model allowed researchers to download data to their own infrastructure. This was a feature, not a bug, Biobank intended for researchers to be able to work with data locally. But once the genetic sequences, medical histories, and NHS records of 500,000 volunteers had been downloaded, the data owner had no technical mechanism to control what happened to them. The data was now a copy, governed only by the licence agreement.
3. Licence terms were unenforceable in real time. The data use agreement prohibited sharing, selling, or redistributing the data. These terms are legally binding but technically unenforceable. There is no software mechanism that can prevent a user from copying 500,000 genetic profiles, uploading them to Alibaba, or sharing them with an unauthorised third party. The licence creates liability; it does not create a barrier.
4. Revocation lagged behind the breach by weeks. Once the exfiltration was detected, Biobank revoked the researcher's access credentials. But revocation only affects future access. It does not recall genetic sequences, medical histories, or NHS records that have already been downloaded. The gap between exfiltration and revocation, measured in days or weeks, meant the data of 500,000 volunteers was already replicated and distributed before any response was possible.
5. Audit trails recorded but did not prevent. Biobank maintained logs of researcher access, including queries run, datasets accessed, and files downloaded. These logs confirmed that 500,000 volunteers' genetic sequences and medical histories had been exfiltrated. But they did not trigger any real-time intervention. No alert was raised when a researcher downloaded the genetic data of 500,000 people. No block was applied when NHS-linked records were accessed in bulk. The audit trail was a witness, not a guard.
Why Licence Agreements Cannot Replace Physical Controls
A data use agreement is a contract. It defines obligations and creates legal liability for non-compliance. It is enforceable in court, after the breach, through litigation. It is not enforceable at the point of access, in real time, through technology.
The distinction is fundamental. A DUA says "you must not share this data." It does not prevent you from sharing 500,000 people's genetic sequences on Alibaba. It is the difference between a locked room and a sign that says "do not enter." Both communicate a restriction. Only one enforces it.
This is not a criticism of licence agreements. They serve a necessary legal function. But they are post-breach instruments being used as pre-breach controls, and the Biobank case, 500,000 volunteers, their genetic data, their NHS records, sold on a public marketplace, demonstrates what happens when that substitution fails.
Consider the comparison across five dimensions:
- Enforceability: A licence agreement is enforceable through litigation after a breach. A physical control is enforced at the point of access, before any data leaves the environment.
- Revocation speed: Revoking a licence requires identifying the breach, initiating a legal or administrative process, and notifying the user. Revoking physical access requires disconnecting a drive. One takes weeks. The other takes seconds.
- Export prevention: A licence prohibits export through contractual language. A physical control prevents export by eliminating the export pathway entirely, no mechanism exists to download 500,000 genetic profiles.
- Real-time audit: A licence relies on retrospective log analysis. A physical control provides real-time visibility into whether the storage medium is connected, who authenticated, and whether any data has been read.
- Third-party governance: A licence relies on the user not to share data with unauthorised parties. A physical control ensures that only the authenticated user can access the storage medium in the first place, there is no file to forward, no dataset to upload, no copy to distribute.
What Controlled Access Actually Requires
Controlled access is not a policy. It is a physical state. A dataset is under controlled access when the data owner retains the ability to grant, restrict, and revoke access through mechanisms that cannot be bypassed by the user.
This requires five capabilities:
1. Physical disconnection. The storage medium must be capable of being physically disconnected from all networks, including the internet, local area networks, and direct-attached compute. When the drive is disconnected, the data does not exist on any accessible infrastructure. The genetic sequences, NHS records, and mental health data of 500,000 Biobank volunteers would have been unreachable, not behind a firewall, not encrypted at rest, but physically absent from every network.
2. Identity-locked access. The storage medium must require multi-factor authentication before any data can be read. Authentication must be bound to a specific individual, not a shared credential, a role-based group, or an API key. If the authenticated user is not physically present with the correct credentials, the drive remains locked. At Biobank, 30,000 researchers shared the same access pathway. Under identity-locked access, each researcher authenticates individually, and the data owner knows exactly who is viewing what.
3. Time-bound sessions. Access must be granted for a defined duration. When the session expires, the connection is severed and the drive returns to a disconnected state. There is no persistent access. At Biobank, approved researchers had ongoing access, days, weeks, months of uninterrupted connectivity to 500,000 volunteers' data. Under time-bound sessions, a researcher views data for an approved window and then the drive disconnects.
4. View-only access with no export path. The user must be able to view, query, or analyse the data without any mechanism to download, copy, screenshot, or export it. This is not a software restriction that can be bypassed with a screen recorder or a clipboard tool. It is an architectural constraint: the data is rendered in a controlled environment, and no raw data ever reaches the user's device. At Biobank, the researcher downloaded 500,000 genetic profiles to their own infrastructure. Under view-only access, that download pathway does not exist.
5. Dataset-level isolation. A researcher approved to study cardiovascular risk had access to genetic sequences, mental health records, and linked NHS data for all 500,000 volunteers. Under dataset-level isolation, each of those datasets sits on a separate drive. Access to one grants zero access to the others. The cardiovascular data is on one drive. The genetic sequences are on another. The mental health records are on a third. The NHS-linked data is on a fourth. Compromise of one dataset does not cascade. The researcher who sold data on Alibaba would have accessed one dataset, not all of them.
Applying Controlled Access to Research Data
In the Biobank context, controlled access through offline secure storage would fundamentally change the relationship between the data owner and the researcher.
Under the current model, a researcher applies for access, is approved, signs a DUA, and is granted the ability to download 500,000 volunteers' genetic sequences, medical histories, and NHS records to their own infrastructure. The data owner loses physical control at the point of download. Everything that follows, whether the researcher complies with the licence, shares the data, or lists it on Alibaba, depends entirely on trust and contractual obligation.
Under a controlled-access model using Firevault Offline Secure Storage (OSS), the researcher would apply for access and be approved through the same governance pathway. But instead of downloading data, the researcher would authenticate to a specific Firevault drive containing only the dataset relevant to their approved research purpose, cardiovascular data only, not genetic sequences, not mental health records, not NHS-linked histories. The session would be time-bound. The data would be rendered in a view-only environment. No download, copy, or export pathway would exist. When the session ends, the drive disconnects. The 500,000 volunteers' data remains under the data owner's physical control throughout.
This does not replace ethics board review, approved researcher lists, or data use agreements. It adds a physical enforcement layer that makes the contractual terms technically enforceable. The DUA says "you must not share this data." The Firevault drive ensures you cannot.
Applying Controlled Access to Regulated Datasets
The same controlled-access model applies to any dataset where the data owner needs to grant temporary, auditable, non-exportable access to authorised users.
Financial data (FCA/DORA): The Digital Operational Resilience Act requires financial institutions to demonstrate control over critical data assets, including third-party access governance. Offline secure storage satisfies DORA's requirements for ICT risk management by ensuring that sensitive financial data, transaction records, customer data, algorithmic trading strategies, is physically disconnected when not in use and accessible only through authenticated, time-bound sessions.
Health records (UK GDPR/HIPAA): Patient health records require the highest level of access control under both UK and international data protection frameworks. The Biobank breach demonstrated what happens when 500,000 health records are governed by licence agreements instead of physical controls. Offline secure storage provides the physical separation that the UK GDPR's "appropriate technical and organisational measures" requirement demands. Each patient cohort or dataset can be isolated on a separate drive, ensuring that a breach of one dataset does not expose others.
Legal discovery: Litigation hold and legal discovery datasets are subject to strict chain-of-custody requirements. Data must be preserved in its original form, accessible only to authorised legal teams, and protected against tampering or unauthorised access. Offline secure storage provides a verifiable chain of custody: the drive is either connected and audited, or disconnected and inaccessible. There is no ambiguity about whether the data was accessed, by whom, or when.
Intellectual property: Trade secrets, proprietary algorithms, source code, and R&D data represent existential risk if exfiltrated. Unlike personal data breaches, IP theft often goes undetected for months or years. Offline secure storage eliminates the persistent-access model that makes IP theft possible. The data exists on a disconnected drive. It is accessible only during authenticated sessions. It cannot be copied, downloaded, or transmitted. When the session ends, the data returns to a state of physical isolation.
Software-Defined vs Physically Controlled Access
The following comparison illustrates the difference between software-defined access controls (the current standard) and physically controlled access (the offline secure storage model).
- Network isolation: Software-defined controls use firewalls, VPNs, and network segmentation, all of which can be bypassed through misconfiguration, privilege escalation, or insider action. Physical controls disconnect the storage medium from all networks entirely. No network path exists to reach 500,000 volunteers' genetic data.
- Authentication: Software-defined controls use passwords, SSO tokens, or API keys that can be shared, stolen, or replayed. Physical controls require multi-factor authentication bound to a specific individual and a specific device.
- Session management: Software-defined controls use session timeouts that can be extended, refreshed, or bypassed through token manipulation. Physical controls enforce time-bound sessions at the hardware level. When the session expires, the drive disconnects.
- Export prevention: Software-defined controls use DLP policies that scan for known patterns, and miss everything else. Physical controls eliminate the export pathway entirely. No genetic sequence, no NHS record, no medical history reaches the user's device in a form that can be copied.
- Dataset-level isolation: Software-defined controls use role-based access control (RBAC) to restrict which tables or files a user can query. RBAC can be misconfigured, and a single misconfiguration at Biobank exposed the genetic data, medical histories, and NHS records of 500,000 people. Physical controls store each dataset on a separate drive. Access to one drive grants zero access to any other.
- Licence enforcement: Software-defined controls rely on the user to comply with the licence terms voluntarily. Physical controls make the licence terms architecturally enforceable, you cannot violate a prohibition on data export if no export mechanism exists.
- Bulk export prevention: Software-defined controls may rate-limit queries or flag large downloads. A determined user can exfiltrate data in small increments over time. Physical controls prevent any data from leaving the controlled environment, regardless of volume or method.
- Third-party governance: Software-defined controls have no mechanism to prevent an authorised user from sharing 500,000 genetic profiles with an unauthorised third party after download. Physical controls ensure the data never reaches the user's own infrastructure. There is nothing to share.
Where Firevault Fits
If Biobank had stored each dataset on a separate Firevault drive, the approved researcher could have viewed cardiovascular data in a time-bound session with no export path. The genetic data, 500,000 volunteers' full genomic sequences, would have remained on a separate, physically disconnected drive. The mental health records would have been on another. The NHS-linked histories on a fourth. The researcher would have seen only the data relevant to their approved purpose, for a defined duration, with no mechanism to download, copy, or list any of it on Alibaba.
Firevault Offline Secure Storage (OSS) delivers all five controlled-access capabilities described in this guide. The drive is physically disconnected from all networks when not in use. Access requires multi-factor authentication bound to a named individual. Sessions are time-bound and enforced at the hardware level. Data is rendered in a view-only environment with no download, copy, or export path. Each dataset is stored on a separate drive, providing dataset-level isolation that prevents cross-contamination in the event of a single-point compromise.
This changes the fundamental relationship between the data owner and the data user. Under the current model, the data owner delegates trust: "I trust you to comply with the licence agreement." Under the Firevault model, the data owner delegates access within physical constraints: "You can view this specific dataset, for this specific duration, under these specific conditions, and the hardware enforces every one of those constraints."
Trust is no longer the control. Physics is.
Next Step: Evaluate Your Access Controls
Before selecting a storage solution, assess your current access controls against the five requirements outlined in this guide. The following questions will identify the gaps:
- Can the genetic data, medical histories, or NHS records you hold be accessed over a network connection right now?
- Is access to those datasets governed by a shared credential, a role-based group, or an individual identity?
- Can an authorised user download, copy, or export the data they are permitted to view?
- If you revoked a user's access today, how long before the revocation takes effect, and what happens to data they have already downloaded?
- Can a licensed user bulk-export the dataset they are authorised to view?
- Is each dataset physically isolated from every other dataset, or does a single access credential unlock multiple data stores?
- Are your audit trails retrospective records, or do they trigger real-time intervention?
If the answer to any of these questions concerns you, your access controls are not controlled. They are conditional, and conditions can be bypassed. 500,000 Biobank volunteers learned that the hard way.
To see how Firevault OSS applies to your specific datasets and regulatory requirements, request a walkthrough. We will map your current access model against the five physical controls and identify exactly where the gaps are.
Suggested Reading
- What is Offline Secure StorageThe foundation of physical disconnection
- Why Offline Secure StorageThe case for physical control
- Ransomware DefenceHold gold copies offline
- Firevault ControlPhysical path control for IT and OT
- Knowledge VaultAll articles, guides and whitepapers
- Book a DemoSee Firevault in action




