Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .github/workflows/branch-checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,20 @@ jobs:

- name: Test
run: mise run test:python

markdown:
name: Markdown
runs-on: build-amd64
container:
image: ghcr.io/nvidia/openshell/ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
steps:
- uses: actions/checkout@v4

- name: Install tools
run: mise install

- name: Lint
run: mise run markdown:lint
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -202,3 +202,6 @@ architecture/plans
rfc.md
.worktrees
.z3-trace

# Markdown/mermaid lint tooling deps
scripts/lint-mermaid/node_modules/
31 changes: 31 additions & 0 deletions .markdownlint-cli2.jsonc
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"globs": [
"**/*.md",
"**/*.mdx"
],
"gitignore": true,
"ignores": [
".agents/**",
".claude/**",
".opencode/**",
".github/**",
"THIRD-PARTY-NOTICES/**",
"CLAUDE.md"
],
"config": {
"default": true,
// Allow long lines — prose paragraphs are single-line per project style.
"MD013": false,
// Allow GitHub-rendered HTML commonly used in READMEs (centered logos,
// collapsible sections, keyboard hints). Regular prose HTML still flagged.
"MD033": { "allowed_elements": ["p", "img", "br", "a", "div", "details", "summary", "kbd", "sub", "sup"] },
// Allow duplicate headings in different sections.
"MD024": { "siblings_only": true },
// Bare URLs are fine in changelogs and tables.
"MD034": false,
// First line does not need to be a heading.
"MD002": false,
// Repo uses padded table pipes (`| foo | bar |`); rule default is "compact".
"MD060": { "style": "padded" }
}
}
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
@AGENTS.md
@AGENTS.md
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ See [docs/CONTRIBUTING.mdx](docs/CONTRIBUTING.mdx) for the current docs authorin

This project uses [Conventional Commits](https://www.conventionalcommits.org/). All commit messages must follow the format:

```
```text
<type>(<scope>): <description>

[optional body]
Expand All @@ -279,7 +279,7 @@ This project uses [Conventional Commits](https://www.conventionalcommits.org/).

**Examples:**

```
```text
feat(cli): add --verbose flag to openshell run
fix(sandbox): handle timeout errors gracefully
docs: update installation instructions
Expand Down
2 changes: 1 addition & 1 deletion SECURITY.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Security
# Security

NVIDIA is dedicated to the security and trust of our software products and services, including all source code repositories managed through our organization.

Expand Down
2 changes: 1 addition & 1 deletion TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ mise run ci # Everything: lint, compile checks, and tests

## Test Layout

```
```text
crates/*/src/ # Inline #[cfg(test)] modules
crates/*/tests/ # Rust integration tests
python/openshell/ # Python unit tests (*_test.py suffix)
Expand Down
2 changes: 1 addition & 1 deletion architecture/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ The connection flow works as follows:
5. The CLI and sandbox exchange SSH traffic bidirectionally through the tunnel.

This design provides several benefits:

- Sandbox pods are never directly accessible from outside the cluster.
- All access is authenticated and auditable through the gateway.
- Session tokens can be revoked to immediately cut off access.
Expand Down Expand Up @@ -198,7 +199,6 @@ The inference routing system transparently intercepts AI inference API calls fro
| Gateway inference service | `crates/openshell-server/src/inference.rs` | Stores cluster inference config, resolves bundles with credentials from provider records |
| Proto definitions | `proto/inference.proto` | `ClusterInferenceConfig`, `ResolvedRoute`, bundle RPCs |


### Container and Build System

The platform produces three container images:
Expand Down
2 changes: 1 addition & 1 deletion architecture/custom-vm-runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ never binds a host-side TCP listener.
`openshell-driver-vm` embeds the VM runtime libraries and the sandbox rootfs as
zstd-compressed byte arrays, extracting on demand:

```
```text
~/.local/share/openshell/vm-runtime/<version>/ # libkrun / libkrunfw / gvproxy
├── libkrun.{dylib,so}
├── libkrunfw.{5.dylib,so.5}
Expand Down
2 changes: 1 addition & 1 deletion architecture/gateway-deploy-connect.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ This stores `auth_mode = "plaintext"`, skips mTLS certificate extraction, and by

All connection artifacts are stored under `$XDG_CONFIG_HOME/openshell/` (default `~/.config/openshell/`):

```
```text
openshell/
active_gateway # plain text: active gateway name
gateways/
Expand Down
12 changes: 7 additions & 5 deletions architecture/gateway-security.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ graph TD

The PKI is a single-tier CA hierarchy generated by the `openshell-bootstrap` crate using `rcgen`. All certificates are created in a single pass at cluster bootstrap time.

```
```text
openshell-ca (Self-signed Root CA, O=openshell, CN=openshell-ca)
├── openshell-server (Leaf cert, CN=openshell-server)
│ SANs: openshell, openshell.openshell.svc,
Expand Down Expand Up @@ -94,7 +94,7 @@ The Helm StatefulSet (`deploy/helm/openshell/templates/statefulset.yaml`) mounts

Environment variables point the gateway binary to these paths:

```
```text
OPENSHELL_TLS_CERT=/etc/openshell-tls/server/tls.crt
OPENSHELL_TLS_KEY=/etc/openshell-tls/server/tls.key
OPENSHELL_TLS_CLIENT_CA=/etc/openshell-tls/client-ca/ca.crt
Expand All @@ -108,7 +108,7 @@ When the gateway creates a sandbox pod (`crates/openshell-server/src/sandbox/mod
- A read-only mount at `/etc/openshell-tls/client/` on the agent container.
- Environment variables for the sandbox gRPC client:

```
```text
OPENSHELL_TLS_CA=/etc/openshell-tls/client/ca.crt
OPENSHELL_TLS_CERT=/etc/openshell-tls/client/tls.crt
OPENSHELL_TLS_KEY=/etc/openshell-tls/client/tls.key
Expand All @@ -119,7 +119,7 @@ OPENSHELL_ENDPOINT=https://openshell.openshell.svc.cluster.local:8080

The CLI's copy of the client certificate bundle is written to:

```
```text
$XDG_CONFIG_HOME/openshell/gateways/<gateway-name>/mtls/
├── ca.crt
├── tls.crt
Expand Down Expand Up @@ -183,7 +183,7 @@ The gateway supports three transport modes:

### Connection Flow

```
```text
TCP accept
→ TLS handshake (mandatory client cert in mTLS mode, optional in dual-auth mode)
→ hyper auto-negotiates HTTP/1.1 or HTTP/2 via ALPN
Expand Down Expand Up @@ -225,10 +225,12 @@ Sandbox pods connect back to the gateway at startup to fetch their policy and pr
| `OPENSHELL_TLS_KEY` | `/etc/openshell-tls/client/tls.key` |

These are used to build a `tonic::transport::ClientTlsConfig` with:

- `ca_certificate()` -- verifies the server's certificate against the cluster CA.
- `identity()` -- presents the shared client certificate for mTLS.

The sandbox calls two RPCs over this authenticated channel:

- `GetSandboxSettings` -- fetches the YAML policy that governs the sandbox's behavior.
- `GetSandboxProviderEnvironment` -- fetches provider credentials as environment variables.

Expand Down
5 changes: 4 additions & 1 deletion architecture/gateway-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ pub const REGISTERED_SETTINGS: &[RegisteredSetting] = &[
The reserved key `policy` is excluded from the registry. It is handled by dedicated policy commands and stored as a hex-encoded protobuf `SandboxPolicy` in the global settings' `Bytes` variant. Attempts to set or delete the `policy` key through settings commands are rejected.

Helper functions:

- `setting_for_key(key)` -- look up a `RegisteredSetting` by name, returns `None` for unknown keys
- `registered_keys_csv()` -- comma-separated list of valid keys for error messages
- `parse_bool_like(raw)` -- flexible bool parsing from CLI string input
Expand Down Expand Up @@ -83,6 +84,7 @@ The `UpdateSettings` RPC multiplexes policy and setting mutations through a sing
| `global` | `bool` | Target gateway-global scope instead of sandbox scope |

Validation rules:

- `policy` and `setting_key` cannot both be present
- At least one of `policy` or `setting_key` must be present
- `delete_setting` cannot be combined with a `policy` payload
Expand Down Expand Up @@ -266,7 +268,7 @@ This prevents conflicting values at different scopes. An operator must delete a

When a global policy is set, sandbox-scoped policy updates via `UpdateSettings` are rejected with `FailedPrecondition`:

```
```text
policy is managed globally; delete global policy before sandbox policy update
```

Expand Down Expand Up @@ -442,6 +444,7 @@ openshell policy get --global --full
All `--global` mutations require human-in-the-loop confirmation via an interactive prompt. The `--yes` flag bypasses the prompt for scripted/CI usage. In non-interactive mode (no TTY), `--yes` is required -- otherwise the command fails with an error.

The confirmation message varies:

- **Global setting set**: warns that this will override sandbox-level values for the key
- **Global setting delete**: warns that this re-enables sandbox-level management
- **Global policy set**: warns that this overrides all sandbox policies
Expand Down
18 changes: 10 additions & 8 deletions architecture/gateway-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ For the target daemon (local or remote):
- k3s server command: `server --disable=traefik --tls-san=127.0.0.1 --tls-san=localhost --tls-san=host.docker.internal` plus computed extra SANs.
- Privileged mode.
- Volume bind mount: `openshell-cluster-{name}:/var/lib/rancher/k3s`.
- Network: `openshell-cluster-{name}` (per-gateway bridge network).
- Network: `openshell-cluster-{name}` (per-gateway bridge network).
- Extra host: `host.docker.internal:host-gateway`.
- The cluster entrypoint prefers the resolved IPv4 for `host.docker.internal` when populating sandbox pod `hostAliases`, then falls back to the container default gateway. This keeps sandbox host aliases working on Docker Desktop, where the host-reachable IP differs from the bridge gateway.
- Port mappings:
Expand Down Expand Up @@ -230,7 +230,7 @@ After deploy, the CLI calls `save_active_gateway(name)`, writing the gateway nam

The cluster image is defined by target `cluster` in `deploy/docker/Dockerfile.images`:

```
```text
Base: rancher/k3s:v1.35.2-k3s1
```

Expand All @@ -242,6 +242,7 @@ Layers added:
4. Kubernetes manifests: `deploy/kube/manifests/*.yaml` -> `/opt/openshell/manifests/`

Bundled manifests include:

- `openshell-helmchart.yaml` (OpenShell Helm chart auto-deploy)
- `envoy-gateway-helmchart.yaml` (Envoy Gateway for Gateway API)
- `agent-sandbox.yaml`
Expand Down Expand Up @@ -363,9 +364,10 @@ flowchart LR
4. Force-remove the per-gateway network via `force_remove_network()`, disconnecting any stale endpoints first.

**CLI layer** (`gateway_destroy()` in `run.rs` additionally):

6. Remove the metadata JSON file via `remove_gateway_metadata()`.
7. Clear the active gateway reference if it matches the destroyed gateway.
<!-- markdownlint-disable MD029 -->
5. Remove the metadata JSON file via `remove_gateway_metadata()`.
6. Clear the active gateway reference if it matches the destroyed gateway.
<!-- markdownlint-enable MD029 -->

## Idempotency and Error Behavior

Expand All @@ -378,8 +380,8 @@ flowchart LR
- Docker API failures from inspect/create/start/remove.
- SSH connection failures when creating the remote Docker client.
- Health check timeout (6 min) with recent container logs.
- Container exit during any polling phase (health, mTLS) with diagnostic information (exit code, OOM status, recent logs).
- mTLS secret polling timeout (3 min).
- Container exit during any polling phase (health, mTLS) with diagnostic information (exit code, OOM status, recent logs).
- mTLS secret polling timeout (3 min).
- Local image ref without registry prefix: clear error with build instructions rather than a failed Docker Hub pull.

## Auto-Bootstrap from `sandbox create`
Expand Down Expand Up @@ -436,7 +438,7 @@ Environment variables that affect bootstrap behavior when set on the host:

Artifacts stored under `$XDG_CONFIG_HOME/openshell/` (default `~/.config/openshell/`):

```
```text
openshell/
active_gateway # plain text: active gateway name
gateways/
Expand Down
1 change: 1 addition & 0 deletions architecture/gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -510,6 +510,7 @@ graph LR
```

All buses use `tokio::sync::broadcast` channels keyed by sandbox ID. Buffer sizes:

- `SandboxWatchBus`: 128 (signals only, no payload -- just `()`)
- `TracingLogBus`: 1024 (full `SandboxStreamEvent` payloads)
- `PlatformEventBus`: 1024 (full `SandboxStreamEvent` payloads)
Expand Down
2 changes: 1 addition & 1 deletion architecture/inference-routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@ This eliminates full-body buffering for streaming responses (SSE). Time-to-first

When the proxy truncates a streaming response, it injects an SSE error event via `format_sse_error()` (in `crates/openshell-sandbox/src/l7/inference.rs`) before sending the HTTP chunked terminator:

```
```text
data: {"error":{"message":"<reason>","type":"proxy_stream_error"}}
```

Expand Down
1 change: 1 addition & 0 deletions architecture/policy-advisor.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@ The TUI sandbox screen includes a "Network Rules" panel accessible via `[r]` fro
- Expanded detail popup with full binary path, rationale, security notes, and proposed rule

Keybindings are state-aware:

- **Pending** → `[a]` approve, `[x]` reject, `[A]` approve all
- **Approved** → `[x]` revoke
- **Rejected** → `[a]` approve
Expand Down
2 changes: 1 addition & 1 deletion architecture/sandbox-connect.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ sequenceDiagram
CLI->>GW: CreateSshSession(sandbox_id)
GW-->>CLI: token, gateway_host, gateway_port, scheme, connect_path

Note over CLI: Builds ProxyCommand string; exec()s ssh
Note over CLI: Builds ProxyCommand string: exec()s ssh

User->>CLI: ssh spawns ssh-proxy subprocess
CLI->>GW: CONNECT /connect/ssh<br/>X-Sandbox-Id, X-Sandbox-Token
Expand Down
2 changes: 1 addition & 1 deletion architecture/sandbox-providers.md
Original file line number Diff line number Diff line change
Expand Up @@ -333,7 +333,7 @@ the full implementation details, encoding rules, and security properties.

### End-to-End Flow

```
```text
CLI: openshell sandbox create -- claude
|
+-- detect_provider_from_command(["claude"]) -> "claude"
Expand Down
Loading
Loading