Skip to content

Guardrails for Physical AI

There are quite a few hard problems in securing AI systems. We need to answer difficult questions just to model the threats, let alone to come up with solutions. What are the real world assets in an agentic AI system? How do we define an objectively “bad” outcome, anyway? Still, AI is here and we need to cope with what we’ve got now, even if it’s by using imperfect shortcuts. One area where I believe shortcuts can be made relatively effective is Physical AI. This area, consisting of AI used in robots, vehicles, industrial systems, etc., may or may not be easier to model, I don’t know, but it does allow for easier specification of guardrails, which may help address the overall Physical AI security problem.

Let us discuss what guardrails are, and how they can be implemented in Physical AI systems to help address AI security challenges.


Intro to guardrails

What are guardrails?

Guardrails, in themselves, are a workaround. They are a way to connect vaguely understood threats to effective defenses, while assuming imperfection in everything from the threat modeling to the implementation of specific mitigation. Guardrails are a countermeasure that follows the broad brush approach of: “I don’t know what the exact threats are, nor who the threat actors are. I also don’t know what all attack vectors are, nor their precise impacts. But I do know that this one particular resulting situation, regardless of how it came to be, is definitely horrible, extremely unlikely to be benign, and so it shall never be allowed to happen.”

Some see guardrails as falling under the category of response, rather than that of prevention. This depends on whether the main attack impact got to happen or not. Storing backups in case ransomware wipes your storage is ‘response’; having kill-switches that disconnect the storage devices if they receive an unreasonable number of deletion requests counts as ‘prevention’. In this case, like with guardrails in general, it is not guaranteed that no damage is caused, just that enough of it can be eliminated. In this particular case our guardrail acts like a circuit-breaker. In other cases, guardrails impose operational margins that keep the system from causing (too much) harm.

Pros and cons of guardrails

Guardrails are appealing because they require little modeling compared to other approaches where one has to list assets, attack types or vectors, the attack agents by their capabilities, impact, etc. All threat modeling methodologies require explicitness. To deploy guardrails you do not need much of that, just a list of situations that shall be avoided however they come by.

Guardrails are not perfect, however. Deploying them requires certain conditions:

  • You need to be able to list those “horrible” situations to put guardrails in front of them. In almost every system you can probably think of one or two, but your security model cannot suffice with defending only against those. Listing even just the extreme scenarios may require rigor, and the extreme scenarios are probably not all you need to defend against.
  • Not all threat scenarios, including the horrific ones, necessarily result in situations that are both:
    • identifiable,
    • impossible to occur as part of benign operation, and
    • stoppable (or preventable) effectively.

The example given above, of the storage being disconnected by a kill-switch, may also experience false alarms, such as during system maintenance. The likelihood and cost of those false-alarms need to be accounted for.

To sum it up, guardrails are not a panacea against all security concerns, but they are certainly an effective tool in places where they can be deployed, particularly against attacks that the threat model did not foresee or for when other mitigation somehow failed.


Guardrails in Physical AI

There are a few factors that make guardrails particularly interesting for securing Physical AI systems, in my opinion at least.

The motivation

First, Physical AI, as AI in general, is a discipline where security modeling is at its infancy. Security engineering practitioners are just now trying to wrap their heads around threat modeling and mitigation in AI systems, where AI inference behaves almost like a black box. A security mechanism that starts with the assumption that modeling is imperfect is thus very welcome.

Second, Physical AI systems, at least when compared to AI systems that verbally guide humans rather than machines, may lead to undesirable conditions that are (hopefully) less diverse, because their output is of a narrower span. An agentic assistant can trick the user into everything from approving a change to a system configuration, through redirecting a bank transaction, all the way to giving away its digital identity (all of which may result in outcome of similar severity, worth noting.) Physical AI that drives a robot, or a car, can also be tricked into carrying out plenty of malicious actions, but at least its actuators are finite, hence also its possible resulting actions (and consequently the subset consisting of disastrous actions.) Fortunately for Physical AI, it does not connect to a human actuator that can do too much with just a little convincing.

This does not imply that Physical AI systems have a lower potential for harm. I would still (marginally) prefer my bank account wiped clean over my car throwing me off a cliff. Just that from the pure engineering perspective, an AI agent has more actuation diversity than, say, a car, so hopefully it is simpler to define guardrails for the car.

The challenges

Two challenges come to mind when implementing guardrails for a Physical AI system. The first is that guardrail logic that is sufficiently capable of addressing real-world threats often requires itself to be based on AI, and hence to be just as susceptible to the unknown consequences of well-crafted malicious input. The second is the more traditional challenge of requiring the guardrail enforcement mechanism to be systematically privileged over the rest of the system whose compromise it tries to contain.

The following subsections detail each of those two challenges.

It takes AI to watch AI

Guardrails that are based on trivial conditions can be implemented outside the boundaries of AI. Those are limited, however. Most of the contextual information that is needed to enforce guardrails on the actions of Physical AI is available only within the AI process. An arm of a robot can be taught to never cross a certain fixed area, but cannot be taught to never approach a standing person at a high speed that may cause injury, because the location of objects and their classification may only be available within the AI process. For complex guardrails to be effective, they require to be made part of an AI inference model, so they can utilize its judgment, or at least its contextual awareness.

This is hard to implement securely. The AI-based logic that we need to trust to enforce a guardrail is that same AI logic that we try to secure against, realizing it could be compromised into producing dangerous judgment calls.

Execution separation

In security engineering, it is common to have one logic (say, a program) that imposes restrictions on another, or that is expected to operate benignly even when other parts of the system are compromised. An example of the first is an anti-virus software that monitors another program and its actions and prevents it from running (or even removes it altogether) if deemed malicious. An example of the latter is a TPM (Trusted Platform Module) which is trusted to authentically measure aspects of the system even when other parts of it may have been compromised. In each of those cases, one of two conditions holds:

  • either the enforcing logic runs in a separate execution environment and is systematically privileged over the other, or
  • it does not, that is, it runs in the same environment or in one that is not necessarily more secure or privileged than the other.

The TPM example above, smart-cards used for authentication, TrustZonetm, and similar trusted execution environments, are all examples of the first case. The anti-virus is often an example of the second case; in many deployments the anti-virus software often has no privilege over the malware it tries to protect against; whatever the anti-virus program can do to the virus can also be done the other way around, depending on what piece of software is more clever or acts first.

The defender (us, given I reached my intended audience) is much better-off when the defense logic is systematically privileged over the rest of the system it protects. Of course, the defense logic can still be tricked into performing unintended operations by anything from exploitation of API logic bugs to physical intrusion, but the overall integrity of the executed logic can be (carefully) assumed. When the defense logic is not systematically separated and privileged, we get the type of security that resembles an endless cat-and-mouse game, as we see with anti-malware, DRM on open platforms, and software license checks.

We put guardrails on a system because an attacker may find a way to affect its behavior. If we deploy our guardrails in the same execution space, or using the same potentially-vulnerable AI as the rest of the system, we may be forced to repeatedly outsmart the attacker. This barely works. For example, guardrails in agentic AI that instruct it to ask the user for confirmation before performing a dangerous operation can sometimes be tricked by an attacker injecting more convincing instructions into the prompt.

Just like in the real world, sustainable enforcement requires a systemic privilege in the system. Courts would have been less effective if you could put the judge in jail as easily as the judge can put you.

Approaching a solution

A reasonable approach is to run the guardrail logic on a separate AI-enabled component. It shall use a separate AI model that will not be readily abused under the circumstances in which the primary one would, and should potentially run in an execution environment that can be isolated from where the primary AI logic runs, depending on whether or not execution integrity is perceived to be threatened by attacks against the AI model (it is not necessarily so.)

When using a separate AI model for implementing guardrails, the following few design principles shall be considered:

  1. Execution independence: The guardrail AI should run on a separate execution environment from the rest of the system. The type of separation will depend on the threat model of the platform. For example, if the risk we perceive is constrained to the contamination of AI model results, and those results cannot affect execution integrity, then a separate execution level may not be required, only independence between the two AI models.
  2. Model simplicity: The guardrail AI is chartered only with sanity-checking results of the main model. It requires only the capabilities needed for observing the output of the main model, establishing the physical context, and matching the output against strict “no go” conditions (our guardrails). It needs to do so with high consistency and with very little “creativity” (i.e., low temperature).
  3. Model and inference hygiene: To protect the guardrail AI from some of the threats that may affect the main AI logic, it shall be one that is not trained by, nor infers using, information that might not be trustworthy or that may be controlled by an attacker (according to the threat model of the use-case). This is in contrast to the main AI, which operates on as much information it can glean from its environment. The role of the guardrail AI is to process the output of the main AI and to test it against coarsely-defined extreme scenarios, not to evaluate its output, so the less direct exposure to un-sanitized input that could adversely mislead it -- the better.
  4. Constrained output: The output of the guardrail AI should optimally be binary, indicating whether guardrails are hit or not by a given output of the main AI. The results should be gated to simple non-AI circuit breakers or enforcers.

This approach devises guardrail AI that runs against the main AI of a Physical AI system. The guardrail AI has limited creativity by design and is built to run without the full set of (potentially malicious) input vectors that the main AI is subject to, and with limited context data. In return, it is expected to only process the output actions of the main AI (along with limited context) and examine it only against a limited set of well-identified definitely-malicious scenarios. Among the many uses of AI, all of which beg for proper security assurances, the Physical AI domain is probably unique in allowing this type of mitigation to be feasible and effective.


See also

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

No comments

Add Comment

Markdown format allowed
Enclosing asterisks marks text as bold (*word*), underscore are made via (_word_), else escape with (\_).
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
Form options

Submitted comments will be subject to moderation before being displayed.