How to Secure AI Factories | Guidelines, Architecture, Controls

Written by Tony Goulding | Jul 8, 2025 1:09:23 PM

AI innovation is moving in-house. That means as you shift from leveraging cloud AI platforms to building powerful "AI factories" in your own data centers, you're increasing both opportunity and risk.

AI factories promise greater control, customization, and cost savings. However, they also introduce challenges for AI governance, IT, and security teams: autonomous AI agents, AI identity management complexities, unsanctioned model deployments, and excessive access to sensitive data.

If you’re not accustomed to securing highly specialized, high-performance environments, understanding the controls and resources necessary to reduce risk is more important now than ever.

This blog explores how to mitigate the risk of AI factories, using three pillars:

NIST's emerging guidance for securing High Performance Computing and AI factories
NVIDIA's AI Factory architecture as a blueprint
Proven identity security controls from Delinea's Privilege Control for Servers (PCS) and Server Suite

What are AI factories?

AI factories are on-premise or hybrid data centers purpose-built to develop, train, and deploy AI at scale.

Unlike traditional cloud-based AI services, AI factories give you full control over your data, infrastructure, and model pipelines. They typically combine high-performance GPU clusters, ultra-fast networking, parallel storage, and orchestration frameworks to industrialize the AI lifecycle, from raw data ingestion to model deployment, much like a factory produces goods from raw materials.

How do AI factories increase your risk of a cyber attack?

Unlike managed cloud services (where a third-party provider secures much of the stack), the organization running an AI factory is fully responsible for locking down these advanced environments.

Without effective identity management and privileged access controls, the AI engines driving innovation can become vectors for insider threats, data breaches, and compliance failures. In fact, insider misuse is a significant concern in high-performance computing (HPC)/AI clusters. Even authorized users might abuse high-performance systems, or malicious hackers might steal credentials to gain unauthorized access.

NIST SP 800-234: A framework for HPC and AI factory security

To address the unique nature of AI factories, NIST has developed the draft SP 800-234, High-Performance Computing (HPC) Security Overlay. This framework recognizes that large-scale AI training and simulation environments have requirements beyond traditional IT.

... identity-centric controls are non-negotiable in high-performance AI environments

NIST makes it clear that identity-centric controls are non-negotiable in high-performance AI environments. Built upon the moderate NIST SP 800-53B baseline, the overlay tailors 60 security controls with supplemental guidance to suit HPC and AI contexts.

The goal is practical, performance-conscious security that can safeguard AI models and sensitive data without hindering the mission. This emphasis on performance-conscious security is particularly significant for organizations that tend to perceive security as a bottleneck to progress.

NIST's guidance provides a crucial roadmap, allowing you to confidently accelerate your AI initiatives, knowing security is built in rather than an afterthought. By offering a structured framework, NIST reduces the uncertainties and ad-hoc security decisions that can impede fast-paced AI development.

This approach transforms security from a reactive impediment into a proactive enabler of AI progress.

Applying NIST controls to AI factories

NIST's guidance emphasizes a zone-based architecture (isolating access, management, compute, and storage zones) and enforces role-based access and least privilege across those zones. It also highlights strong authentication, software governance, and comprehensive auditing to address the performance and scale of AI environments. The table below highlights the important controls of NIST SP 800-234.

NIST Tailored Control	Risk Addressed in AI Factory
AC-2 Account Management (zone-based roles)	Prevents unmanaged or excessive accounts by tying every identity to an authorized role and zone access.
AC-6 Least Privilege (separate admin and user roles)	Avoids over-privileged Machine Learning (ML) workloads by enforcing clear separation of duties (admins shouldn't run jobs with root privileges).
CM-11 User Installed Software (isolation and monitoring)	Blocks unsanctioned or "shadow AI" deployments by restricting unauthorized software installation and requiring oversight for user-developed code.
IA-2 / SC-8 Strong Authentication (Kerberos)	Prevents impersonation and spoofing by requiring multi-factor/Kerberos authentication for all users and using secure, non-routable or encrypted channels for sensitive data.
AU-2 Comprehensive Audit Logging	Ensures visibility into AI operations and helps detect unauthorized actions or anomalies, while balancing HPC performance (e.g., prioritizing critical logs in management zones).

NIST's HPC overlay makes it clear that identity-centric controls are non-negotiable in high-performance AI environments. Following these tailored controls allows you to create zone-isolated, least-privilege ecosystems where every account, process, and dataset is governed. The overlay effectively puts an "umbrella" of best practices over all your AI factories, ensuring that fundamentals like account management, authentication, and auditing are never sidelined even at extreme scale and speed.

NVIDIA's AI factory blueprint: modern AI infrastructure at scale

NVIDIA's AI factory concept aligns seamlessly with NIST's guidance. It's essentially a secure-by-design blueprint for AI data centers. These AI-centric facilities integrate large GPU clusters with ultra-fast networks (for example, NVLink, Spectrum-X, and InfiniBand for RDMA), shared high-performance storage (often NFS or a parallel file system accelerated by RDMA), and AI-focused orchestration software (such as NVIDIA AI Enterprise suites).

In this industrialized approach, data is the raw material, GPUs are the machinery, and AI models are the products. Organizations are increasingly adopting this on-premise AI factory model to regain control over sensitive data, manage costs, and customize AI development pipelines to their needs.

However, all is not rosy.

While NVIDIA's AI factory represents a revolutionary step for innovation, its immense power and flexibility paradoxically amplify security challenges. The very features that make these environments powerful—massive GPU clusters, complex software stacks, and operations requiring elevated privileges for performance—also create unique attack surfaces and identity management complexities that traditional security models often struggle to address.

They can spawn unmanaged service identities (accounts for AI services or agents that aren't tracked centrally).
Teams may create "shadow AI" projects, spinning up ad-hoc models or using unvetted tools outside official oversight, leading to governance blind spots. This is not a flaw in NVIDIA's design but an inherent characteristic of high-performance, specialized computing.
High-performance environments also tend to have processes running with end user accounts for access and audit purposes, many times with elevated privileges (for example, jobs that interact directly with hardware or network drivers). If not contained, these over-privileged processes could access data or functions they shouldn't.

In short, the AI factory can become a wild west of identities, credentials, and code if left unchecked.

NIST's zone-based approach directly helps here.

For example, enforcing that users enter through an access zone and only reach GPU compute nodes via a scheduler limits exposure.
Similarly, using a dedicated management zone for administrators and requiring separate admin accounts (preventing AI jobs running as root) cuts down on the blast radius of any one account.

However, you need strong identity security tooling alongside the architecture to implement these practices effectively. This is where solutions like Delinea identity security come in.

Delinea's identity security proven in HPC (and ready for AI)

Delinea pioneered security for large-scale computing clusters (Hadoop environments such as Cloudera, Hortonworks (merged with Cloudera), and MapR (acquired by HPE), which share many traits with AI factories. The same identity and privilege controls that kept big data platforms in check directly apply to today's AI data centers.

Kerberos and AD integration

Delinea integrates Linux-based AI clusters with Active Directory (AD), providing unified identities and Kerberos single sign-on across all nodes. This unified identity service is often required across all nodes to enable job processes to run as the user who submitted the job.

Users and services authenticate via centrally managed Kerberos tickets, eliminating hard-coded passwords and ensuring trust between components. This also enables strong authentication from the cluster to network-attached storage via NFSv4 where Kerberos and unified identity control access to the remote file systems. This addresses AC-2 and IA-2 controls by tying accounts to a single directory and enabling strong, token-based authentication.

Automated service account management

AI factories' unique scale and automation introduce a critical new dimension to identity management—the proliferation of machine and service identities.

High-performance systems often use service accounts for schedulers, data movers, AI microservices or Model Control Plane (MCP) Servers. These often-overlooked accounts, running with elevated privileges for performance, become significant vectors for attack if not centrally managed and secured.

Delinea's Server Suite automates account provisioning, credential rotation, and lifecycle management of these accounts (including Kerberos keytab distribution). This prevents credential sprawl and human error—no orphaned or default passwords—mitigating the risk of unmanaged identities or forgotten backdoor accounts.

Role-Based Access Control (RBAC) and Privileged Access Management (PAM)

By mapping AD Groups to role-based privileges on cluster resources, Delinea enforces least privilege (fulfilling AC-6). Administrators have just-in-time privileged access: for example, an admin can elevate to root on a specific node for maintenance with MFA and audit logging enforced, but cannot use that privilege to run regular AI workloads. Unprivileged users cannot escalate their rights.

This separation and control closes the door on root or sudo rights abuse, tackling the insider threat head-on. Multi-factor authentication (MFA) is applied to all sensitive actions, adding an extra hurdle against compromised credentials.

Auditing and accountability

With Delinea, every access and administrative action on your AI factory infrastructure is recorded.

Delinea captures session logs, command histories, and security events across the environment (addressing AU-2 and related audit controls). In an AI context, this means the ability to trace which user or service accessed a training dataset, who modified a configuration, or which process initiated an unusual data transfer.

Comprehensive audit trails support compliance and are essential for forensic investigations in the event of an incident.

Applying Delinea PCS and Server Suite to secure AI factories

In practice, Delinea's Privilege Control for Servers (PCS) and Server Suite bring all these capabilities into an AI factory deployment:

1. Seamless AD/Kerberos identity integration across AI compute nodes, storage servers, and orchestration tools, ensuring one source of truth for user identities and credentials.

2. Automated service account onboarding and key management (for example, programmatically creating Kerberos principals for AI services and rotating their keys), eliminating weak static passwords.

3. Zone-based access control using AD groups to restrict who can access what (for example, only admins can log into management-zone nodes, only data scientists to the access zone), enforcing need-to-access principles per NIST guidance.

4. Privileged session management with MFA. Administrators must use MFA t o elevate privileges or access critical systems, drastically reducing the risk from stolen credentials.

5. Secure, Kerberized NFS storage, integrating identity into storage access (NFSv4 with Kerberos) so that data access on the AI factory's shared file systems is authenticated and logged per user or trusted client, preventing a data free-for-all on the cluster storage.

6. Detailed audit logging and session recording for all administrative actions and authentication events, feeding into SIEMs or dashboards for real-time monitoring.

7. Detection of shadow AI or privilege abuse through policy-based controls and analytics (Delinea's identity threat detection), flagging anomalous service accounts or processes that deviate from normal usage (for example, an AI process trying to access unauthorized data or a new unapproved model training instance spinning up).

By leveraging these controls, you can directly map to the NIST SP 800-234 controls and mitigate the risks unique to AI factories. A scalable, centralized platform handles the heavy lifting of identity security often overlooked in high-performance environments.

The path forward to securing AI factories

The rapid growth of enterprise AI is indeed creating new security blind spots. As you build AI factories, you are effectively establishing mini-supercomputers on-premise, and the responsibility for securing this dynamic, high-performance infrastructure rests entirely with you.

NIST SP 800-234 provides a much-needed roadmap to harden these AI factories, encompassing both technical controls and process guidance. NVIDIA's reference architectures demonstrate that performance and security can coexist, particularly when adhering to a zone-based, secure-by-design philosophy.

However, the identity and access layer truly integrates these components.

Delinea brings essential identity security rigor that these environments demand, ensuring that only the authorized individuals or services have the appropriate access at the right time, and that all actions are fully accountable.

Ultimately, robust, identity-centric security is not merely a defensive measure but also a strategic enabler. Organizations that proactively integrate security into their AI factories from Day One will mitigate risks and build a foundation of trust and compliance that accelerates their AI initiatives, turning security into a competitive differentiator in the race for AI leadership.

By treating identity as the connective tissue of their AI security strategy, you can confidently embrace the AI factory model to accelerate insights while safeguarding critical assets. The organizations that succeed with on-premise AI will embed security and identity into the fabric of their AI factories from the outset, rather than treating them as an afterthought.

This outcome is achievable through adherence to frameworks like NIST and using purpose-built tools from Delinea.

View full post