Security
Containers don’t isolate themselves; you isolate them. Perry exposes the
standard OCI security knobs on both ContainerSpec (single-container)
and ComposeService (orchestrated stacks), plus first-party support
for Sigstore / cosign image verification and a workload-graph policy
tier API for declarative isolation levels.
Per-container security knobs
The same set of fields work on run(), create(), and any service in a
compose up():
| Field | Type | Effect | Cross-backend |
|---|---|---|---|
read_only | boolean | Mount the root filesystem as read-only. Forces all writable state to be in declared volumes. | All backends |
privileged | boolean | Run privileged: grants ALL Linux capabilities + access to host devices. Avoid unless absolutely necessary. | Docker / Podman / Lima only — apple/container has no concept and drops the field with a warning |
user | string | UID, username, or "UID:GID" — runs the container’s processes as that identity. The image’s CMD ignores this if it does its own user-switching, but most properly-built images respect it. | All backends |
workdir | string | Working directory inside the container. | All backends |
cap_add | string[] | Linux capabilities to add. Specific (e.g. ["NET_BIND_SERVICE"]), not blanket. | All backends |
cap_drop | string[] | Capabilities to drop. ["ALL"] is the canonical “drop everything” starting point. | All backends |
seccomp | string | Seccomp profile path or "default" (uses the runtime’s default profile). | Docker / Podman / Lima only — apple/container has no equivalent and drops the field with a warning |
⚠️ Cross-backend security caveat.
privileged,seccomp,--security-opt no-new-privileges, IPC/PID namespace sharing, and SELinux mount labels are not honored on apple/container — its Apple-VM model means those concepts don’t translate. Perry’s normalization pass drops the fields and emits atracing::warn!rather than silently downgrading the security policy. For production deployments that demand cross-backend parity, setEnforcementMode::Stricton the engine — any unsupported security field becomes a hardup()failure rather than a silent drop. Full matrix at Cross-Backend Determinism.
Recommended baseline
Start with maximum isolation and add back only what the workload needs:
import { run as runSecure } from "perry/container";
// Maximum-isolation single-container run for an untrusted workload:
// - read-only root filesystem
// - no Linux capabilities at all
// - non-root user
// - working directory pinned
// - default seccomp profile
async function runUntrustedWorkload(): Promise<void> {
await runSecure({
image: "alpine:3.19",
cmd: ["sh", "-c", "echo isolated && exit 0"],
read_only: true,
cap_drop: ["ALL"],
user: "nobody",
workdir: "/tmp",
seccomp: "default",
});
}
Field-by-field rationale:
read_only: true— even an exploit that lands code execution can’t persist to the image’s filesystem. Anything mutable goes into a declared volume.cap_drop: ["ALL"]— removes Linux capabilities the workload didn’t explicitly ask for. Most apps need none.user: "nobody"— non-root inside the container. If the image doesn’t have anobodyuser, replace with"65534:65534"(the numeric UID/GID ofnobodyon most distros).workdir: "/tmp"— the only writable location underread_only: trueis/tmp(which istmpfs-backed by default).seccomp: "default"— uses docker’s default seccomp profile (~50 syscalls blocked).
Capability addition patterns
cap_drop: ["ALL"] plus targeted cap_add:
| Workload | Capabilities |
|---|---|
| Web server binding to port 80/443 | cap_add: ["NET_BIND_SERVICE"] |
| Network namespace manipulation | cap_add: ["NET_ADMIN"] |
| Kernel time setting | cap_add: ["SYS_TIME"] |
| chown to other users (rare) | cap_add: ["CHOWN"] |
| Bind-mount filesystems inside | cap_add: ["SYS_ADMIN"] (still avoid if possible) |
The full capability list is in man capabilities(7). Always start with
cap_drop: ["ALL"] and add only what fails when removed — most
applications need zero capabilities.
Image verification
Set PERRY_CONTAINER_VERIFY_IMAGES=1 to enable cosign keyless
verification on every run(), create(), and pullImage() call:
export PERRY_CONTAINER_VERIFY_IMAGES=1
./my-app
Perry’s verifier:
- Resolves the image tag to its digest via
inspect_image. - Looks up the digest in an in-memory
VERIFICATION_CACHE— subsequent runs against the same digest are free. - Runs
cosign verify --certificate-identity ${CHAINGUARD_IDENTITY} --certificate-oidc-issuer ${CHAINGUARD_ISSUER} <ref>@<digest>and caches pass/fail. - On fail, the FFI rejects with a
verification failederror (the container is never created).
Default identity / issuer point at Chainguard’s keyless signing flow:
| Const | Value |
|---|---|
CHAINGUARD_IDENTITY | https://github.com/chainguard-images/images/.github/workflows/sign.yaml@refs/heads/main |
CHAINGUARD_ISSUER | https://token.actions.githubusercontent.com |
For your own org’s images, override these via the (planned) per-call
verification options. For now, using Chainguard-signed base images is
the path of least resistance — cgr.dev/chainguard/<tool> is signed.
Cosign required. Set
PERRY_CONTAINER_VERIFY_IMAGES=1only whencosignis installed and onPATH. The verification is OFF by default so the bare-metal./my-appexecution doesn’t depend on a separate cosign install.
Capability sandbox helper
For one-off command execution against an untrusted image (CI helper,
build tool, code-evaluation sandbox), use the
run_capability pattern
which wraps run() with the maximum-isolation defaults:
read_only: truecap_drop: ["ALL"]- No network attached
user: "nobody"- Image verified via cosign before pull
This is the same path the internal perry-stdlib::container::capability
module uses for shell-command sandboxing in plugin systems.
Workload-graph policy tiers (perry/workloads)
For multi-node deployments where different workloads have different
trust levels, the workload-graph engine accepts a per-node policy:
import { graph, runGraph, runtime, policy } from "perry/workloads";
const g = graph("my-app", {
trusted_db: { image: "postgres:16-alpine",
runtime: runtime.oci(),
policy: policy.default() }, // no extra hardening
isolated_api: { image: "myapp/api",
runtime: runtime.oci(),
policy: policy.isolated() }, // no_network=true
hardened_proxy: { image: "myapp/proxy",
runtime: runtime.oci(),
policy: policy.hardened() }, // read_only_root + seccomp
untrusted_eval: { image: "myapp/sandbox",
runtime: runtime.microvm(), // ← required by tier
policy: policy.untrusted() }, // microVM-only, all hardening on
});
await runGraph(g);
The four PolicyTier levels and what they enforce:
| Tier | no_network | read_only_root | seccomp | microvm |
|---|---|---|---|---|
default() | — | — | — | — |
isolated() | ✅ | — | — | — |
hardened() | — | ✅ | ✅ | — |
untrusted() | ✅ | ✅ | ✅ | required |
untrusted requires kernel-level isolation (i.e. a microVM, not a
shared-kernel container). When the active backend doesn’t expose a
microVM runtime (apple/container’s VM mode, Lima, Firecracker), the
engine returns BackendNotAvailable rather than silently dropping the
isolation guarantee. Use PERRY_ALLOW_UNTRUSTED_SHARED_KERNEL=1 to opt
out — not recommended for actually-untrusted code.
User-explicit per-flag overrides on top of a tier are honored: setting
policy.tier = "default" and no_network: true produces an
isolated-network default-tier node.
Defense in depth
Stacking patterns for production:
- Verify images (
PERRY_CONTAINER_VERIFY_IMAGES=1). - Run as non-root (
user: "nobody"or numeric UID). - Drop all capabilities, add specific ones back (
cap_drop: ["ALL"]+ minimalcap_add). - Read-only root filesystem (
read_only: true). - Internal networks for the database side (
internal: trueon the db’s network — see Networking). - No published ports for private services (omit
ports:on internal-only services). - Resource limits (planned:
mem_limit,cpu_limiton Service).
See also
- Compose orchestration — applying these knobs in a stack spec.
- Production patterns — Forgejo example uses several of these (internal-only db net, published web port, USER_UID/GID).
- Networking — internal-only networks for database isolation.