Skip to content

Trusted-peer, membership-gated, crash-safe, write-authorized KB sync + attention bus (ADR-017/020/022/023/024)#69

Draft
cuttlefisch wants to merge 193 commits into
mainfrom
feat/crdt-collab-validation
Draft

Trusted-peer, membership-gated, crash-safe, write-authorized KB sync + attention bus (ADR-017/020/022/023/024)#69
cuttlefisch wants to merge 193 commits into
mainfrom
feat/crdt-collab-validation

Conversation

@cuttlefisch

@cuttlefisch cuttlefisch commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Ready for review. The full T1–T7 matrix, Step 8 (B-19 epoch fence), and
Step 9 (ADR-024 notification bus + the B-20→B-23 modal/security arc) are GREEN
across two real machines, MCP-driven
. The client is assumed hostile; all write
enforcement is daemon-side. Remaining before merge is housekeeping only (crdt_doc
flush-on-write + tracked non-security follow-ups), listed at the bottom.

Brings MAE's collaborative editing from "text buffers over a trusted LAN" to
trusted-peer, membership-gated, crash-safe, write-authorized replicated knowledge
bases — with a first-class attention/notification surface for the resolution UX
,
validated across two real machines. The substance is four arcs:

1. Trusted-peer collaboration security core (ADR-017)

  • mTLS-as-identity: each peer is an Ed25519 self-signed cert; the daemon checks
    the client cert ∈ authorized keys, the editor TOFU-pins the daemon (known_hosts).
    shared/mcp/src/tls.rs (rustls/ring), ClientTransport::{Plain,KeyJson,KeyTls}.
  • Per-KB membership (ADR-018): kb/join + kb/node_update gated on creator-or-member;
    owner-only kb/add_member/remove_member/approve. Strict identity binding — an
    authenticated peer's label/saved_by is its verified identity, not self-claimed.
  • Interactive TOFU first-connect UX; PSK + key modes. Validated by
    collab-mtls-e2e.sh + collab-membership-e2e.sh (both in CI).

2. Crash-safe convergent KB sync (ADR-020 → ADR-022)

Replicated KB nodes as per-node yrs CRDTs through the daemon hub. Live two-machine
testing drove this from broken to green and surfaced a chain of bugs no test caught
because the tests used stand-in values / hand-rolled serialization the production path
never produced:

  • B-8 kb/node_update emitted without an id → daemon dropped it as a notification.
    Fixed by a single shared wire builder (mae_sync::wire) used by editor + daemon + tests.
  • B-12 owner re-share clobbered daemon-side membership · B-13 joiner never live-subscribed
    · B-14/B-15 divergent same-id lineages / ignored field edits · B-16 hardcoded
    client_id=1 collision · B-17 derive_kb_client_id returned a full u64 but yrs
    ClientID is 53-bit · B-18 node tags (a yrs YArray) did not CRDT-sync (only
    title/body did) — added KbNodeDoc::set_tags + wired through emit.
  • ADR-022 — the crash-safety mechanism: (re)join does a bidirectional state-vector
    reconcile
    (KnowledgeBase::reconcile_remote_node) instead of a blind full-snapshot adopt.
    A durable-but-unsynced edit is re-derived from the durable crdt_doc on reconnect —
    independent of the pending-queue row surviving a crash. Never replaces an existing node.

3. Write-authorization: epoch-fenced rebase (ADR-023, B-19 + B-20) — security

Reasoning through the live T7 role test surfaced B-19: the daemon gated writes on the
member's current role but merged opaque, client-authored CRDT updates with no
per-op attribution
. So a viewer's locally-applied-but-denied edits stayed local-ahead and
would silently cascade to everyone once they were later granted editor — deferred
privilege escalation. MAE is open-source ⇒ the client is assumed hostile, so client-side
revert is theatre; enforcement is daemon-side.

  • Mechanism: a per-member authorization epoch on the collection doc (daemon-authored
    ⇒ unforgeable), bumped when an existing member's role changes; the KB client_id is
    epoch-rotated (derive_kb_client_id(fp, epoch)); the daemon decodes each update and
    rejects any op authored under a stale-epoch client_id
    (rebase required). A continuously-
    authorized editor's epoch is stable ⇒ full CRDT merge + offline preserved (no T4/T5 regression).
  • B-20 (found live in Step 9c, fixed): the fence attributed new ops via
    yrs::Update::state_vector(), which omits a contiguous-clock continuation of a client
    already in the canonical base
    . A member demoted→re-promoted (whose editor kept authoring
    under a still-canonical client) could append a post-demotion edit that slipped the fence —
    a real bypass of the B-19 guarantee on the demote→re-promote path. Fixed by attributing ops
    via apply-and-diff against the authoritative node state (catches continuations), unioned
    with the legacy signal so divergent lineages stay caught. Daemon + unit regressions, both red
    pre-fix; validated live (9c): the stale continuation now fences, no cascade.
  • Server-authoritative, chosen over capability-signed ops (a malicious client backdates the
    grant-stamp — only a causal-hash DAG defeats that, deferred) and over re-stamping (LWW) /
    hosted-edit (no offline). Adversarial exploit-path review in docs/adr/023-*.md.

4. Attention/notification bus + the resolution UX (ADR-024) — and its hardening

The B-19 fence needs a user-facing resolution path (a fenced editor must learn their edit
was rejected and adopt/re-author), and the only surfaces were a clobberable status line and a
buried *Messages* log. ADR-024 adds a real attention bus + the host-key TOFU modal it
generalizes:

  • NotificationCenter (crates/core/src/notifications.rs): severity→surface routing
    (OptionRegistry-backed, Scheme-accessible), dedup-by-key, a non-clobberable mode-line
    attention badge
    , and a magit-style *Notifications* buffer.
  • Collab resolution round-trip: kb/node_fetch RPC + async adopt-and-re-author
    (R1, fixes the "fenced editor is stuck" gap); a fenced edit raises an ActionRequired
    notification with Accept-remote / Keep-mine / Stash actions (R2); MCP notifications_list
    • notify_resolve {id, action} for headless/agent parity (R3); no silent overwrite of
      divergent local work on (re)join (R5).
  • Generalized blocking modal (R4): the bespoke host-key prompt becomes one consumer of a
    generic BlockingReply modal — answerable by keypress or bus action.
  • Live-hardening found by driving the TOFU modal on two machines (Step 9d):
    • B-21 — runtime :set collab-host-key-policy wasn't honored (the verifier was built once
      at task setup and cached). Now reads a live policy cell, honored on the next connect.
    • B-22a/b/c — the GUI TOFU modal didn't render (a single-threaded bridge runtime was
      starved by the synchronous host-key wait → multi-thread pool; and the render pass
      skipped the overlay), didn't capture input (routed only in command-palette mode + an
      AI-input-lock stole Esc), and wasn't answerable over the bus (added NotifCommand::Reply
      Accept/Reject actions).
    • B-23 — the modal didn't size to content, clipping the host-key fingerprint (which
      must be fully readable for the out-of-band trust compare). Fixed with content-adaptive
      sizing + wrapping.
  • Architectural through-line: B-22a and B-23 were the same shape — overlay-priority and
    dialog-geometry logic duplicated per backend (GUI vs TUI) that had drifted. Both are now
    single shared computations in render_commonoverlay::active_overlay() and
    dialog::mini_dialog_layout() — each unit-tested, so that whole "the two backends diverge"
    class of bug is structurally closed.
  • Also: a required/core module tier (required = true manifest flag) so cross-cutting
    modules like notifications (whose buffers can be raised by background events) auto-enable
    regardless of the (mae!) block — Doom's core/ analog.

Live validation (two machines, MCP-driven) — GREEN

  • T1–T7 (membership/restart/offline-merge/kill -9 stress/concurrent-edit/WAL-recovery/role
    enforcement) — all PASS, both directions corroborated.
  • Step 8 (B-19 epoch fence): viewer-era edits denied → promote → re-push fenced; daemon
    viewer_era_edits_do_not_cascade_on_grant e2e (red without the fence).
  • Step 9 (ADR-024 + B-20): 9a/9b fence-notification + Keep-mine converge; 9c
    the B-20 continuation now fences (no cascade, canonical unchanged throughout — proven from the
    daemon WAL); resolution coverage complete (Keep-mine + Accept-remote).
  • Step 9d (TOFU/R4): reject = fast abort + no pin, accept = auth + join + pin, with
    the full fingerprint visible — through a modal that renders (B-22a), captures input (B-22b),
    sizes to content (B-23), and is bus-answerable (B-22c). FULL PASS.

Test rigor

  • N-peer editor-logic harness (crates/core/tests/kb_sync_n_peer_e2e.rs, N∈{2,3,5}) driving the
    real CRDT path with production-derived client_ids — caught B-17 on its first run.
  • Real-daemon SV-reconcile + role + B-19/B-20 e2e; kb_node_tags_round_trip (B-18) +
    kb_node_update_survives_daemon_restart (T6) production-protocol e2e. New unit suites for the
    overlay-priority resolver, the dialog-layout (fingerprint-not-clipped + narrow-screen wrap),
    the host-key live-policy cell, and the bus reply action. Methodology in
    docs/collab-kb-sync-testing-lessons.md.
  • Fixed a config-precedence bug (env/CLI now override init.scm) found during the live run.

ADRs / docs

ADR-017 (trusted-peer auth), ADR-020 (replicated KB CRDT), ADR-021 (membership/policy compliance
direction), ADR-022 (crash-safe convergent sync), ADR-023 (secure write-access — epoch-fenced
rebase)
, ADR-024 (notification/attention bus). Two-machine procedures + the full live log in
docs/collab-testing-plan.md (Step 8 = B-19, Step 9 = ADR-024/B-20→B-23) and
docs/collab-test-notes-bob.md.

Still to land before merge / tracked follow-ups (non-security)

  • crdt_doc flush-on-write (durability hardening) · daemon SQLite WAL power-loss durability.
  • Broadcaster→consumer sweep: migrate the remaining clobberable set_status/*Messages* callers
    onto the bus.
  • Deferred security hardening (documented, not blocking): unpredictable daemon-issued epoch token
    (pre-rotation attack); monotonic epoch across remove/re-add; ADR-021 append-only audit log.

Update — testing-gap closure + non-UX fixes + event-driven triggers (pre-UX pass)

Closes the automation gaps and non-UX issues surfaced by the live two-machine run, before the
planned KB-sharing UX review. Every item ships with a RED-before/GREEN-after guard (CLAUDE.md #9).

Automated the manual tests (Arc A): daemon fence no-cascade oracle (canonical node stays
byte-identical across a fenced push); editor notify-resolution unit test (3 actions; Keep-mine
records pending_reauthor, Accept-remote doesn't); collab_bridge KbNodeAdopted round-trip
(keep-mine re-authors over authoritative / accept-remote discards); real-daemon two-peer concurrent
convergence (byte-identical merge over TCP); MITM changed-host-key rejection without overwriting
the pin
+ unauthorized-peer scenario in collab-mtls-e2e.sh.

Non-UX fixes (Arc B): split-window mouse-click coordinates fixed in the shared
handle_mouse_click_inner (both GUI fallback and TUI passed absolute screen coords) via a pure
window_relative layout-origin translation; CozoKbStore::load_all degrades a query-bind failure
to Ok(empty) instead of an Err that aborted kb_join and tripped the 10s main-thread stall
watchdog (B-5). B-2/B-3/B-6 verified already-correct + locked with regression tests (config-key
kebab-alias invariant, joined-instance surfacing, primary-KB-store XDG-first contract).

Event-driven triggers (Arc C):

  • C1 (security-gated): the editor now relearns its KB authorization epoch from a live kbc:
    membership broadcast
    (previously ignored as an "unknown buffer"), so a promote/demote takes
    effect with no manual reconnect. A local CRDT replica of the collection doc
    (CollabState.kb_collection_state) is applied as a delta and epoch_of(fingerprint) re-derived.
    The daemon remains the sole authority (re-derives each member's epoch from its own collection when
    fencing), so a tampered replica can only mislead the client about its own epoch — never
    self-elevate. No-weakening gate: the daemon viewer_era_* / stale_epoch_continuation_* fence
    tests stay GREEN.
  • C2 connect-critical config (server address, auth mode/PSK) verified read-live; C3 embeds
    the git build SHA (build.rsMAE_BUILD_SHA) in the editor + daemon startup log, --version,
    and $/debug, and collab-doctor warns on an editor↔daemon build mismatch.

A3 (live two-editor fence-resolve e2e) — documented, not fabricated: C1 removes the
deterministic online fence trigger (honest clients now relearn and aren't fenced), and there is no
validated scheme recipe for editing a shared KB node to force a fenced update — so rather than ship
an unverifiable e2e, docs/collab-testing-plan.md gains an automated-coverage map (each manual flow →
its guarding test) and flags the residual full-sequence run as Tier-2 manual (deterministic trigger =
the offline edit). Its constituent pieces are all unit-covered.

Gates: cargo fmt + clippy -D warnings clean (both workspaces); mae-core 2292, mae-kb 212,
mae-mcp 127, mae bins 283, daemon 152, n-peer e2e 12 — all green.

🤖 Generated with Claude Code

cuttlefisch and others added 30 commits June 15, 2026 18:16
The 0.13.11 and 0.13.12 version bumps updated Cargo.toml but not the
workspace member versions in Cargo.lock (earlier bumps had explicit
'sync Cargo.lock' chores; these two missed it). A plain cargo build
regenerates these, dirtying the tree — sync them once so both dev
machines start from a clean working tree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add shared/mcp/src/keystore.rs: a permission-guarded trusted_keys file
(default $XDG_DATA_HOME/mae/collab/trusted_keys, 0600) holding symmetric
PSKs out of config.toml. Format: '[name] <secret>' per line, # comments.
Both editor and daemon read it via mae-mcp so path + format live in one place.

Extend PskAuth to be multi-key on the server side: it can trust a SET of
named keys (a keystore) and select the one a client advertises via a new
optional key_id in the auth hello. Backward compatible — unnamed clients
use the server's default (first) key; serde ignores the absent/extra field
so old and new peers interoperate. Proof verification now uses constant-time
Mac::verify_slice instead of string compare.

Foundation only; daemon + editor wiring follow. mae-mcp: 100 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Daemon: in psk mode, build the trusted set from the keystore (every entry
is a peer credential) plus legacy psk/psk_command (one unnamed key), and
construct a single shared multi-key PskAuth. Add 'mae-daemon keygen [name]'
(random 0600 key, printed for copying to peers) and 'mae-daemon keys'
(names + fingerprints, never secrets). check-config/doctor now report the
keystore path + key count and warn on loose perms.

Editor: resolve the client credential via resolve_client_credential() —
precedence psk_command > psk > keystore primary key — and advertise the
key's name as the wire key_id so the daemon selects it. Pure resolver is
unit-tested; the keystore lookup no longer makes the empty-psk test flaky.

Verified end-to-end: a client with only a keystore key connects to a psk
daemon — 'PSK auth succeeded key=client-cli'. Closes the gap where a PSK
could only come from config.toml (which is being retired). mae-mcp 100,
collab_bridge 84, daemon config tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Design for an asymmetric collab auth mode ('key') alongside none/psk:
Ed25519 keypairs, known_hosts (client pins daemon) + authorized_keys
(daemon trusts clients), mutual signed-challenge handshake, client TOFU
policy (prompt/accept-new/strict), daemon pending-approval + admin CLI
(identity/authorized/pending/authorize/revoke). Enables trust-on-first-use
and per-peer revocation without shared-secret rotation. Symmetric keystore
(this branch) remains as 'psk' mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ADR-017 phase 1)

shared/mcp/src/identity.rs: Ed25519 Identity (load_or_generate, 0600 private
key), PublicKey (base64 wire form, SSH-style 'mae-ed25519 <b64> <label>' lines,
SHA256: fingerprints), KnownHosts (client pins daemon keys), AuthorizedKeys
(daemon trusts client keys, add/authorize/revoke), and a HostKeyVerifier
abstraction with a known_hosts-backed FileHostKeyVerifier implementing the
accept-new / strict / prompt TOFU policies (pins on first use, aborts on a
changed host key).

shared/mcp/src/auth.rs: KeyAuth AuthProvider — a mutual signed-challenge
handshake binding both pubkeys + nonces into a domain-separated transcript.
Server verifies the client signature and checks authorized_keys; client
verifies the server signature and applies the host-key policy before proving
its own key. Adds ed25519-dalek + base64 deps.

Crypto core only; daemon + editor wiring + TOFU UI follow. mae-mcp: 113 tests
pass (13 new for identity/keyauth), clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add CollabAuth enum (None/Psk/Key); in 'key' mode the daemon loads/generates
its Ed25519 identity and an authorized_keys trust store, and runs KeyAuth::server
per connection. check_collab accepts 'key' and flags an empty authorized_keys.

New admin CLI:
  mae-daemon identity              show the daemon pubkey line + fingerprint
  mae-daemon authorized            list trusted client keys (label + fingerprint)
  mae-daemon authorize <pubkey>    add a client pubkey line to authorized_keys
  mae-daemon revoke <label>        remove client key(s) by label

check-config/doctor report the identity fingerprint + authorized key count.
Verified: identity → authorize → check-config OK. clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…phase 0)

Add shared/mcp/src/tls.rs: mutual TLS where each peer presents a self-signed
X.509 cert whose SPKI is its existing Ed25519 Identity key. TLS 1.3 gives
confidentiality + proof-of-possession; peer trust moves into custom verifiers:
the daemon checks the client cert's pubkey against AuthorizedKeys, the editor
TOFU-pins the daemon cert's pubkey via HostKeyVerifier. This unifies encryption
+ mutual auth + pinning on the identities we already manage, superseding the
JSON KeyAuth handshake on the TLS path.

- ring crypto backend with an explicit CryptoProvider (avoids clashing with the
  editor's reqwest aws-lc-rs default). Daemon gains rustls for the first time and
  builds cleanly (ring only, no aws-lc-rs/cmake conflict with cozo).
- ed25519_pubkey_from_cert (x509-parser, OID 1.3.101.112) is the trust-critical
  extraction — round-trip tested against our own cert.
- PeerIdentity {label,fingerprint,pubkey} added to identity.rs (authoritative
  identity for strict binding); Identity::pkcs8_der; Debug on HostKeyVerifier.

mae-mcp: 119 tests (6 new incl. full in-process mTLS handshake: authorized
client succeeds + identity recovered, unauthorized rejected, untrusted host
rejected). clippy clean; both workspaces build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…phase 1)

Daemon: add CollabAuth::KeyTls (built for mode=key + tls=true, default).
The accept loop wraps the whole TcpStream with the rustls TlsAcceptor (not
pre-split), recovers the verified PeerIdentity via peer_identity_from_tls, then
splits the TlsStream and runs the session. Plaintext psk/legacy-key/none paths
unchanged. AuthConfig gains tls: bool (default true); check-config shows it.

Session plumbing: ClientSession gains peer_identity + with_identity() +
authenticated_label(); collab_handler refactored so handle_client (anon) and
the new handle_client_authenticated(peer) share run_session(). handle_client_with_auth
(psk/legacy-key) now synthesizes a PeerIdentity from the auth label and routes
through it — the authenticated label finally reaches the session instead of being
dropped. mae-mcp re-exports tokio_rustls::{TlsAcceptor,TlsConnector}.

Regression-safe: 36 daemon collab_e2e tests pass; clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ase 2a)

Add three Scheme-configurable options (OptionRegistry + get/set + validation):
- collab_auth_mode (none|psk|key) — selects the handshake; key = Ed25519
  trusted-peer identity over mTLS.
- collab_host_key_policy (prompt|accept-new|strict) — TOFU policy for an unknown
  daemon identity.
- collab_tls (default true) — mTLS vs plaintext JSON KeyAuth fallback.
CollabState gains the fields (defaults psk/prompt/true). config.toml wiring
intentionally omitted (config.toml is retiring; set via init.scm / :set).

Add 'mae --collab-identity': prints this editor's Ed25519 peer identity
(generating it on first use) + the exact 'mae-daemon authorize' line, so an
admin can authorize the peer. Label = hostname.

Transport wiring to actually use key mode follows in 2b. mae-core option tests
+2; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hase 2b)

Refactor the editor connection so it can speak mTLS. run_collab_task is now
generic over the stream: a ClientTransport{Plain,KeyJson,KeyTls} enum (resolved
once from collab_auth_mode/tls/host_key_policy) drives a single
establish_connection() helper; read/write halves are type-erased (Box<dyn
AsyncBufRead/AsyncWrite>) so TCP and TLS share one loop. spawn_reader_task is
generic; the three connect sites (Connect/StartServer/reconnect) route through
establish_connection, skipping the PSK handshake on the TLS path. In key mode
the editor loads its Ed25519 identity + a known_hosts FileHostKeyVerifier
(TOFU policy) and connects via tls::client_config; KeyJson is the tls=false
fallback. mae-mcp re-exports ServerName.

E2E: scripts/collab-mtls-e2e.sh (make test-collab-mtls-e2e) spins up a real
key+tls daemon, authorizes the editor identity, and runs a real editor over
mTLS — connect, share a buffer, daemon confirms the share. Verified 7/7 green;
daemon authenticates the peer 'framework' by cert (strict binding visible).

collab_bridge 84 tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The authenticated peer identity (from mTLS, or the JSON-handshake label) is now
authoritative for attribution. Thread auth_label through run_session into the
doc handlers (handle_doc_*_inner; thin #[cfg(test)] wrappers keep the 28 existing
handler tests untouched) and enforce:
- kb/share: a key/TLS-authenticated peer that claims a creator other than its
  verified identity is REJECTED ('creator mismatch'); the authenticated label is
  the authoritative creator. Anonymous (psk/none) sessions keep self-claimed
  values (backward compatible).
- sync/awareness: broadcast user_name (cursor label) overridden with the
  authenticated label — cursor labels can't be spoofed.
- docs/save_committed: saved_by overridden with the authenticated label.

Closes the spoofable-creator gap. 3 new unit tests (spoofed rejected, matching
allowed, anonymous preserved); daemon 76 lib + 36 e2e tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Least-privilege access to shared KBs among trusted peers. An authenticated
(key/TLS) peer may join/update a KB only if it is the creator or in the KB's
KbCollectionDoc.members(); anonymous (psk/none) sessions keep connection-level
trust (backward compatible).

- kb_membership_check gates kb/join and kb/node_update.
- New owner-only methods kb/add_member / kb/remove_member {kb_id, member}:
  verify the caller is the collection creator, apply add/remove via the
  collection CRDT, persist + broadcast the update.
- Residual limitation (documented): a member could still smuggle membership
  edits through a raw kbc: sync/update; server-side CRDT field ACLs are future
  work. The sanctioned path is the owner-only methods.

4 well-designed unit tests: creator joins / non-member denied; owner add→join
→update, remove→denied; only-owner-manages; anonymous-not-gated. daemon 80 lib
+ 36 e2e tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire :kb-member-add / :kb-member-remove <kb-id> <member> end to end:
CollabIntent::KbAddMember/KbRemoveMember (dispatch_collab parses args from the
ex-command line) → CollabCommand::KbMember → run_collab_task sends kb/add_member
/kb/remove_member RPC (PendingResponseKind::KbMember) → response becomes a status
line, or a CollabEvent::Error on denial (e.g. 'only the owner can manage
members'). Disconnected handler reports not-connected.

3 dispatch unit tests (args→intent, both add/remove, missing-args→no-intent);
editor 90 collab + core collab tests pass; clippy clean. The daemon enforcement
this drives is covered by the 4 membership unit tests in phase 4.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…or membership e2e

Bug caught by the two-editor e2e: strict binding overrode the kb/share creator
VARIABLE but not the collection doc's internal creator()/members() (set by the
client from its user_name). So the owner-check in kb/add_member failed
(coll.creator() != authenticated label), the add was silently rejected, and a
newly-added member was still denied.

Fix: KbCollectionDoc::set_creator re-stamps the creator + seeds it as a member;
the daemon calls it on kb/share for authenticated sessions, binding the shared
collection to the verified peer identity.

scripts/collab-membership-e2e.sh (make test-collab-membership-e2e): two real
editors over mTLS — alice shares, bob denied (not a member), alice adds bob, bob
joins. Oracle = daemon log. VERIFIED PASS. + set_creator unit test. mae-sync 144,
daemon 80 tests; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When collab_host_key_policy=prompt and the editor meets an unknown daemon
identity, a PromptingHostKeyVerifier emits CollabEvent::HostKeyPrompt and BLOCKS
the connection task on a std reply channel; the main (UI) thread shows a
'Trust Daemon Key? <fingerprint> [y/N]' MiniDialog (MiniDialogContext::PeerKeyAccept),
and the y/n answer is routed back (Editor.pending_host_key_reply) to pin (accept)
or abort (reject). The collab task runs on a separate thread from the winit/TUI
loop, so the block is safe; a 120s timeout rejects if unanswered. A previously
pinned key that matches is accepted silently; a CHANGED key aborts (MITM).
accept-new/strict keep the non-interactive file verifier (headless default).

4 unit tests cover the channel round-trip + pinning. mae 88 collab + mae-core 15
dialog tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fix confirmed inaccuracies that misled users:
- daemon CLI: removed nonexistent --unix-socket/--db/--wal-threshold; documented
  the real flags (--bind/--config/--data-dir/--check-config) + the keygen/keys/
  identity/authorized/authorize/revoke subcommands.
- env var MAE_COLLAB_ADDR → MAE_COLLAB_SERVER.
- editor options: point at init.scm (config.toml retiring), correct defaults
  (collab-server-address 127.0.0.1:9473, collab-user-name not -username, backoff 2),
  add collab-auth-mode/host-key-policy/tls.
- WAL recovery path collab.db → collab/state.db.

Add §10 Trusted-Peer Mode: Ed25519 mTLS setup end to end (daemon identity →
authorize peer → editor key mode + TOFU → per-KB membership commands), and
update §8 Security for the three auth modes (none/psk/key) + mTLS shipped.
ADR-017 → Accepted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The e2e job now also pulls the mae-daemon release artifact (needs: [check, daemon])
and runs scripts/collab-mtls-e2e.sh + scripts/collab-membership-e2e.sh against the
real release binaries — exercising the full trusted-peer stack (Ed25519 mTLS
handshake, TOFU, strict identity binding, per-KB membership) headlessly. Adds
iproute2 (the scripts use ss for port readiness). Verified both pass with release
binaries locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…live)

Consolidated step-by-step plan: Tier 0 automated (unit + e2e + CI commands),
Tier 1 single-host CLI smoke, Tier 2 the two-machine live run (daemon+editor on
D, editor on E) covering identity exchange/authorize, TOFU connect, buffer
convergence + authenticated cursor labels, KB membership (deny→add→allow→remove),
and security/negative checks (unauthorized peer, changed host key, tcpdump
confidentiality). Results checklist + troubleshooting.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prepend a Setup section: Rust >=1.95 (MSRV), iproute2 for the e2e scripts,
optional GUI build deps; build both workspaces (make build-tui + build-daemon);
get binaries on PATH (install targets or copy). Plus a key-setup table clarifying
that the automated e2e scripts generate+authorize their own keys, while the
manual tiers need you to exchange identities + mae-daemon authorize (Tier 2
Step 3), and where identities live + how to reset them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
One-command, idempotent key-mode setup: `mae setup-collab [--server <addr>]`
generates the peer identity (if absent), persists collab-auth-mode=key + server +
auto-connect to init.scm (via the existing save_option_to_init), and prints the
exact `mae-daemon authorize` line. Re-running updates in place (no duplicates).

SSH integration (opt-in key reuse for an SSH-like purpose):
- Editor: `mae setup-collab --ssh-key ~/.ssh/id_ed25519` imports an unencrypted
  OpenSSH Ed25519 PRIVATE key as the collab identity (Identity::import_ssh_private_key
  via the ssh-key crate; from_seed/save helpers).
- Daemon: `mae-daemon authorize --from-ssh-pub <file> <label>` imports the SSH
  PUBLIC key (PublicKey::from_ssh_line — manual SSH wire parse, no dep).
- Verified consistent end-to-end: the editor's imported MAE fingerprint EQUALS the
  daemon's authorized fingerprint, so the editor presents exactly the key the
  daemon trusts. Errors clearly on encrypted/non-ed25519 keys.

Note: reusing one key across SSH + MAE couples their compromise; a dedicated MAE
identity (default) keeps them separate — documented in COLLABORATION.md §10.

3 new mae-mcp tests (ssh pubkey roundtrip, ssh private import matches pubkey, +
existing). mae-mcp 121, daemon 80; clippy clean. Docs + testing plan updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…against 0.0.0.0

The two-machine testing plan used the default collab port 9473, which collides
with an already-running personal daemon (binds 127.0.0.1:9473; a test daemon on
0.0.0.0:9473 overlaps loopback). Switch Tier 2 to a non-default port (9480) with
an explicit "check it's free first" note, and document bind-vs-connect (0.0.0.0
is a bind address, never a connect target).

- scripts: collab-mtls-e2e.sh / collab-membership-e2e.sh now auto-select the
  first free port (scan upward from 9476/9477 via `ss`) unless MAE_E2E_PORT is
  set explicitly — so a running daemon or a concurrent test run never triggers
  "address already in use". Loopback-bound, so they never touched 9473 anyway;
  this just makes them robust against any busy port.
- mae setup-collab: reject `--server 0.0.0.0:…` with a clear message — that's
  the daemon's bind address, not a reachable connect target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The collab e2e harness was Linux-only and silently mis-ran on macOS,
forcing stop-and-go debugging across our two dev machines. Three issues,
one root theme — platform-divergent path/tool resolution:

1. Daemon dir resolution ignored XDG on macOS.
   `daemon/src/config.rs` resolved config + data dirs via bare
   `dirs::config_dir()` / `dirs::data_dir()`, which follow Apple
   conventions on macOS (`~/Library/Application Support`) and ignore
   `XDG_CONFIG_HOME` / `XDG_DATA_HOME`. The e2e scripts isolate each peer
   via those env vars, so on macOS the daemon never found its generated
   `daemon.toml`, fell back to all defaults (default bind :9473, default
   `$TMPDIR/mae-daemon.sock`), and collided with the developer's personal
   daemon — "daemon failed to listen". Meanwhile the *identity*/*keystore*
   code (mae-mcp) already resolves XDG-first, so identities landed in the
   isolated dir while config/data landed in the real Library dir (split
   brain). Fix: resolve config + data dirs XDG-first on all platforms
   (env when set, else `dirs::*`), matching `mae-mcp::identity` /
   `keystore`. Pure extension — macOS users without XDG set are unchanged.

2. Port-readiness probe used `ss` (Linux iproute2), absent on macOS, so
   the daemon-listening check always failed even when it was up. Add a
   portable `port_listening` helper: prefer `ss` (Linux/CI unchanged),
   then `lsof`, then `netstat`.

3. The editor run was wrapped in `timeout`, absent on stock macOS. Use a
   `${TIMEOUT_BIN:+...}` prefix resolving `timeout` → `gtimeout` →
   omitted (bash 3.2-safe, `set -u`-safe).

Codify the lesson as CLAUDE.md principle #13 (cross-platform parity):
XDG-first dirs everywhere, portable shell tooling, CI on both OSes — a
fix that only works on one machine is not a fix.

Verified on macOS (was failing, now passing):
- scripts/collab-mtls-e2e.sh ............ 7/7, mTLS peer authenticated
- scripts/collab-membership-e2e.sh ...... 7/7 + 7/7, deny→add→allow
- cargo test -p mae-mcp ................. 121 passed
- cd daemon && cargo test / clippy ...... 9 passed / clean
- cargo test -p mae --bins collab ....... 94 passed
Linux behavior unchanged (ss/timeout still preferred; XDG already worked).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a shared scratchpad to the testing plan so D (driver) and E (mac)
can start the Tier 2 live run the moment D is up — no round-trips.

Captures the concrete session state instead of the reference topology:
- E (bob, mac, 192.168.1.132) is READY: built from a8ac842, personal
  daemon stopped (9473 clear), identity generated — fingerprint +
  pre-formatted `mae-daemon authorize ... bob` line for D to paste.
- D's row is a fill-in (IP, fingerprint, status) the driver commits back.
- Test port 9480 (avoids the personal-daemon :9473 collision).
- mDNS returned nothing on this LAN → connect by explicit host:port.
- D's unblock checklist (pull a8ac842, bind 0.0.0.0:9480 key-mode,
  authorize bob, publish fingerprint, open firewall).

Each machine edits its own row, commits, pushes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…th machines (#66)

The live two-machine run surfaced issue #66: the interactive TOFU `prompt`
policy (what `setup-collab` writes by default) is unwired and freezes the
editor — hard-freezes the TUI, silently fails the GUI. It bites EVERY editor,
including D's own "alice" (which connects to D's daemon too), so it's a
coordination hazard, not just a local quirk.

Update the testing plan so the other machine doesn't trip on it:
- Prominent #66 callout: every editor must set
  collab_host_key_policy = "accept-new" in init.scm (non-blocking, auto-pins)
  until #66 is fixed; verify the daemon fingerprint OUT-OF-BAND against the
  pinned known_hosts entry instead of via the (broken) prompt.
- Board: min build bumped to b947a52; added an `accept-new set` column; D's
  checklist now says rebuild BOTH binaries (branch moved past the first harness
  build) and configure accept-new before launching alice.
- Step 4 rewritten for accept-new + the out-of-band pin check; the interactive
  prompt path is marked deferred to #66.
- Results checklist: T0 marked green (macOS), row 4 split into accept-new (now)
  and prompt-TOFU (deferred).

No code change — config/docs only. Tier 0 already validated on macOS at b947a52.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Track machine-E observations during the two-machine ADR-017 validation so we
surface + fix issues, and D sees our findings. Logged so far:
- resolved: cross-platform Tier 0 fix (a8ac842)
- filed: #66 (TUI TOFU prompt deadlock)
- open/HIGH: alice rope panic crash (D-side, suspect shared/sync rope bridge)
- open: bob local edits to a joined buffer not visible on read-back (2x; cause TBD)
- open: connection flapping (peer closed w/o TLS close_notify) — correlated w/ alice crash

Convergence so far: alice->bob receive confirmed; round-trip not yet validated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…test step

Per the review-the-process feedback: each entry now carries its tier/step
(T0, T2.4, T2.5, …), Action → Expected → Actual → Status → Repro, so issues
are pinpointed to the code path under stress and are reproducible.

- Run 1 chronological table (10 rows) mapping each success/failure to a step.
- Issue details I-1 (alice rope panic @ T2.5, task #18), I-2 (bob edit not
  visible @ T2.5), I-7 (connection flapping @ T2.4/5), #66 (TOFU @ T2.4).
- Convergence scorecard by direction+step; next-run-from-scratch checklist.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ef bob

Aligns naming with collab-test-notes-bob.md. Logs the run-1 progress (cross-machine
mTLS auth ✅, alice→bob receive ✅) and the I-1 alice rope panic at T2.5:
Rope::char(138) OOB on bob's remote edit of an em-dash line. Scopes it to the
editor-side apply-remote path in crates/core (text.rs bridge + local cursor adjust
are already clamped). D owns the backtrace + fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ackward (I-1)

The two-machine collab run crashed alice's GUI with a ropey panic
("index past end of Rope: char index 138, Rope char length 34"). Backtrace:
  Rope::char  <-  word::word_start_backward  <-  mouse_ops::handle_mouse_click_inner

Not a CRDT bug (headless convergence never crashed) — a mouse bug. Clicking the
right pane of a vertical split registers as a double-click word-select, and the
screen column (~138) far overruns the short line. The double-click path passed an
unclamped text_col to char_offset_at (the single-click path already clamps), and
word_start_backward guarded pos==0 but not pos>len_chars (word_end_forward
already guards), so rope.char(137) on a 34-char rope panicked.

Fixes:
- word::word_start_backward clamps pos.min(len_chars()) (defense in depth).
- mouse_ops double-click path clamps text_col to the clicked line length before
  char_offset_at (also guards the link-follow branch).

Tests: word_motions_clamp_out_of_bounds_pos, word_start_backward_out_of_bounds_on_empty_rope,
mouse_double_click_past_line_end_does_not_panic. Full mae-core suite 2237/2237.

Follow-up (I-3): the fallback handle_mouse_click uses raw (non-window-relative)
coords in a split — now safe (clamped) but cursor lands at line end; make it
window-relative later.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…resolved

After fix a57455f, clean run from scratch (T2.4→T2.5):
- alice→bob and bob→alice convergence both confirmed over mTLS, two machines.
- I-1 (rope panic) FIXED + verified live — root cause was double-click
  word-select in a split pane passing an OOB offset to word_start_backward,
  NOT the CRDT path (multibyte was a red herring). No crash in Run 2.
- I-2 (bob edit "not visible") RESOLVED as a driving artifact: MCP active
  buffer was *AI:claude*; switch-to-buffer must be its own verified step.
- I-7 (flapping) RESOLVED — it was a symptom of alice crashing (I-1), gone now.

Next: simultaneous-edit, then T2.6 KB membership, T2.7 security checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Records the post-fix live run: bidirectional CRDT convergence confirmed (bob's
line + alice's seed + alice's typed line all merged; 52 session-7 + 1 session-8
updates), and the I-1 fix verified live (double-click @ col 138 in a split no
longer crashes). Reattributes bob's I-2 to an MCP eval_scheme artifact (buffer-
insert via eval skips the event-loop post-edit collab flush; real keystrokes
sync fine).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cuttlefisch and others added 22 commits June 23, 2026 18:02
… connect path

Bob's 9d run surfaced B-21: a runtime `:set collab-host-key-policy prompt` (or
`(set-option! …)`) updated the option (get_option reflected "prompt") but the
connect still auto-pinned under the init.scm value (accept-new) — so the TOFU
modal never appeared. Root cause: `resolve_client_transport` builds the host-key
verifier ONCE in `setup_collab_channels` (startup) and caches it in
`CollabSpawn.transport`; every `:collab-connect` reuses that cached verifier, so a
runtime policy change never reached it. Same class as the auto-connect env gap
fixed in 91a5201.

Fix (editor-side): the verifier now reads a LIVE policy cell at verify-time.
- CollabState gains `host_key_policy_live: Arc<Mutex<String>>`, a cross-thread
  mirror of `host_key_policy`; set_option keeps it in sync; resolve_client_transport
  seeds it from the current value at setup.
- The editor now ALWAYS uses the prompting verifier (the only one that *can* prompt),
  made policy-dynamic: it reads the live cell each verify and dispatches
  accept-new → pin, strict → reject, prompt → ask. So a runtime switch to/from
  prompt takes effect on the NEXT connect with no relaunch.

Regression: `host_key_policy_change_honored_at_verify_time_b21` — one verifier
instance pins silently under accept-new, then (live cell flipped to prompt) ASKS on
a new host instead of auto-pinning. 4 existing verifier tests updated for the new
field. mae-core 2274 + mae collab_bridge 95 green; clippy -D warnings clean.

Unblocks 9d: bob can now `:set collab-host-key-policy prompt` at runtime + connect →
the R4 TOFU modal (GUI under prompt — the #66 deadlock path) — no init.scm relaunch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(rebuild + runtime :set)

Path B chosen. Notes bob: this fix is editor-side so he MUST rebuild/reinstall/relaunch
(unlike B-20). Then run 9d via runtime (set-option! collab_host_key_policy "prompt")
— now honored — clear the pin, connect → expect the R4 TOFU modal (GUI under prompt,
the #66 deadlock path). n-then-y, OOB fingerprint SHA256:07aW…7Ls.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…22 GUI TOFU modal render/focus bug

B-21 closed: runtime set-option collab_host_key_policy=prompt now honored — connect BLOCKED
on the prompt (no auto-pin) and raised a bus notification with the correct fingerprint
07aW…7Ls (OOB match). Reject path correct: notify_resolve(dismiss) -> handshake aborted,
status off, known_hosts NOT pinned.

B-22 (new, GUI): the R4 TOFU modal is invisible AND unresponsive — (1) no repaint on raise
(GUI only redraws on keypress, ~2-key lag); (2) no input-focus capture (keys leak to the
underlying buffer — Esc triggered Claude commands with the AI buffer focused). GUI sibling of
#66; R4 fixed the plumbing but not the GUI render/focus path. Round-2 accept UN-TESTABLE: bus
notification exposed actions:[] (only dismiss=reject), no MCP accept lever, and the modal y/Enter
can't be delivered through the broken GUI.

Fix dirs: BlockingReply raise must request redraw/damage; modal must grab input focus; add
explicit bus actions (Accept&pin / Reject) for headless/Notifications parity. bob restored via
accept-new -> auto-pin (07aW…7Ls) -> connected + reconcile-joined; temp backups removed.

9d verdict: B-21 + fingerprint + reject ✅; accept-via-UI blocked by B-22.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e + focus)

Bob's 9d run on the B-21 build proved the prompt PLUMBING works (correct
fingerprint, reject logic, no-pin-on-reject) but the GUI modal surface was broken:
it never drew (GUI froze until a keypress, ~2 behind) and keys leaked to the
underlying buffer. Two independent defects, both fixed here:

B-22a — runtime starvation (no repaint on raise): the GUI bridge ran on a
`new_current_thread` tokio runtime hosting BOTH the collab connection task and the
`bridge_task` proxy forwarder (+ AI/LSP/DAP/MCP). The host-key verifier is called
synchronously by rustls mid-handshake and blocks (up to 120s) on
`reply_rx.recv_timeout` waiting for the prompt answer — starving that one worker so
the `HostKeyPrompt` event never reached the GUI and `mark_full_redraw` never ran
(the GUI twin of the #66 TUI deadlock). Fix: give the bridge runtime a worker pool
(`new_multi_thread().worker_threads(4)`, + the `rt-multi-thread` tokio feature) so
the forwarder keeps running while a connect blocks on the prompt. This also fixes
the same starvation for MCP-driven flows.

B-22b — modal didn't capture input: `handle_key` only routed to the mini-dialog via
the command-palette path, so an async-raised modal (notify() Modal arm sets
`mini_dialog` but no palette mode) was unanswerable in Normal/Insert/AI mode; and
the GUI's AI-input-lock branch stole Esc/Ctrl-C before `handle_key` ran (Esc hit
AI-cancel, not the dialog). Fix: `handle_key` now routes to the dialog whenever
`mini_dialog.is_some()` (all modes), and the GUI keyboard dispatch checks the modal
before the AI-input-lock/shell branches.

Together: the host-key TOFU modal now paints immediately and answers to y/Enter
(accept+pin) / n/Esc (reject) regardless of focus. GUI compiles, mae key_handling
tests green, clippy --features gui -D warnings clean.

Deferred (B-22c, follow-up): the trust notification exposes no bus actions, so it's
answerable only by the modal keypress — add Accept/Reject actions so notify_resolve
+ the *Notifications* row can answer it (headless/agent parity).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…; B-22c deferred

Notes bob: rebuild (editor-side fix), then re-run 9d via runtime :set — the TOFU
modal now paints immediately and captures input even with the *AI* buffer focused.
Accept-by-keypress (y) now reachable. B-22c (MCP/bus accept action) deferred as a
small optional follow-up for headless parity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er) STILL BROKEN

On build 5337cb5: runtime prompt honored (B-21), correct fp surfaced, reject aborts w/o pin.
B-22b confirmed fixed — modal now captures input even with *AI* buffer focused.
B-22a STILL broken — bob-user: "frozen again for a second, modal capturing input but not
rendering." Multi-thread runtime SHORTENED the freeze (was ~120s verifier-timeout starvation,
now ~1s) but the modal still never PAINTS; user was blind, pressed keys without seeing.

Consequence: accept->pin only "worked" via blind keypress (net: bob connected+pinned to the
correct OOB key 07aW…7Ls), but can't be cleanly validated — a user cannot SEE the fingerprint
before trusting, defeating TOFU. 9d accept path NOT validated as a usable flow.

Proven: B-21, correct-fp, reject-no-pin, B-22b focus. Open: B-22a render (dialog paint not
triggered when prompt raised off the handshake thread; runtime fix shortened freeze but didn't
wire the MiniDialog redraw). Recommend: finish render fix and/or land B-22c (bus accept action)
so accept is verifiable+answerable without depending on GUI paint. bob restored: connected,
pinned, policy=accept-new.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…indow (paint gated on resolution)

bob-user: UI unfreezes once the (unrendered) modal is gone. So the non-render is scoped to the
interval the host-key prompt is outstanding; GUI recovers fully on resolution. Diagnosis: the GUI
render loop isn't pumping a repaint while the synchronous rustls verifier blocks waiting for the
answer — multi-thread runtime shortened the block but the paint path is still gated on prompt
resolution, so the MiniDialog overlay never gets a frame during the window it needs to be visible
(input is serviced enough to capture the key, but no full redraw runs). Fix: get a redraw to run
WHILE the prompt is pending (raise prompt + request paint on GUI thread; let the verifier await
async rather than blocking a thread the paint pass depends on).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…g is repaint-scheduling

Answer to bob-user's question (principle #7): the host-key TOFU modal reuses the verified
MiniDialog — collab_bridge.rs:823-841 raises a blocking action_required notification +
mark_full_redraw; it becomes MiniDialogContext::Notification -> MiniDialogKind::Confirm
("Action Required"), answered in apply_mini_dialog. Not ad-hoc.

So the render bug isn't "didn't reuse MiniDialog" nor a missing dirty flag: the wiring looks
correct on paper — user_event(CollabEvent) sets self.dirty=true (main.rs:2051), about_to_wait
(2475) gates renderer.request_redraw() on self.dirty (2688-2705), and handle_collab_event calls
mark_full_redraw. Yet live it paints only on keypress + recovers when the modal is gone.

Candidate roots for alice (GUI owner) to instrument: (1) HostKeyPrompt CollabEvent not delivered
to user_event until a later input event (residual forwarder/proxy-wakeup starvation while the
rustls verifier blocks; best fits the symptom); (2) about_to_wait WaitUntil wakeup never fires
while blocked; (3) overlay draw skipped for the Notification confirm context. Disambiguate with
one log line on user_event(HostKeyPrompt) entry: appears only post-keypress => #1; immediate but
no frame => #2/#3. Orthogonal unblock: land B-22c (bus Accept/Reject actions) so 9d accept is
verifiable via notify_resolve regardless of GUI paint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…prompt

The trust notification was answerable only by the modal y/n keypress; `notify_resolve`
could merely dismiss (which does NOT send the reply — so an MCP "reject" actually hung
until the 120s verifier timeout). Add a `NotifCommand::Reply(bool)` that genuinely
answers a BlockingReply: it sends on the parked reply channel, tears down the modal if
it's this notification, and resolves. The host-key prompt now carries explicit
"Accept & pin" (action 0) / "Reject" (action 1) bus actions.

Effect: the prompt is answerable over MCP (`notify_resolve {id, action:0|1}`) and via
the *Notifications* row — headless/agent parity, and a working answer path independent
of the GUI modal paint (B-22a, still open). Both routes send on the same channel; first
answer wins (pending_notif_reply is taken once).

Regression: `reply_action_answers_blocking_notification_over_bus` — Accept action sends
true, closes the modal, resolves, decrements outstanding. mae-core 2275 green; clippy
--features gui -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…one; B-22a tracked

Daemon log confirms 9d: fast reject (no pin) + accept→pin auth/join via bob's
captured keypresses (B-22b). B-22c (bus Accept/Reject actions, 7fe4f93) lets the
prompt be answered over MCP/notify_resolve regardless of GUI paint. B-22a (modal
doesn't paint while verifier blocks) remains as a tracked GUI-paint polish bug with
bob's disambiguating log experiment as the next step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r→delivery→paint

Temporary diagnostic (target "b22a") to pinpoint why the GUI TOFU modal doesn't paint
while the verifier blocks. Six timestamped checkpoints:
- 1/2 bridge_task: HostKeyPrompt taken off collab_rx + proxy.send_event returned Ok
- 3   handle_collab_event: prompt RECEIVED on the main thread (proxy→user_event)
- 4/5 about_to_wait: dirty-with-modal-pending + request_redraw() issued
- 6   RedrawRequested: a real frame painted with the modal up

Reading the sequence vs when the prompt is raised (and when a keypress arrives)
disambiguates bob's candidates: 1/2 but no 3 until a keypress ⇒ winit proxy wakeup
not firing while blocked; no 1/2 until keypress ⇒ residual forwarder starvation;
3+5 but no 6 ⇒ redraw requested but not serviced; 6 fires but modal invisible ⇒
render pass skips the overlay. Enable with MAE_LOG=b22a=info (or RUST_LOG). Reverted
once the root is fixed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…umentation c7a4bc4)

Bob: rebuild, launch with MAE_LOG=b22a=info 2>logfile, run set-prompt→clear-pin→connect,
then WAIT 10s without touching kbd/mouse (the bridge sends IdleTick every 100ms via the
same proxy, so it should paint on its own if the wakeup works), then press n + paste the
b22a lines. The 1/2→3→4/5→6 checkpoint sequence pinpoints delivery-wakeup vs
redraw-scheduling vs render-pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…AY skipped (render-path bug)

Ran alice's b22a instrumented experiment, no kbd/mouse touched. First clean connect cycle: all
6 checkpoints fire within ~24ms of connect with NO keypress (input_dirty=false) and repeat every
~150ms: 1/2 forwarder->proxy, 3 received +6.7ms, 4/5 request_redraw, 6 PAINTING a frame with
modal pending +23ms. So proxy DOES wake winit and a frame IS painted unprompted — delivery/wakeup
(my earlier hypothesis #1) REFUTED; request_redraw fine too.

=> alice's last matrix case: paint runs but modal invisible => the render pass paints the frame
but SKIPS the MiniDialog overlay for the Notification-confirm context. "Freeze" was perceptual
(frames paint every ~150ms but the modal isn't in them).

Likely root: render-side twin of B-22b — the GUI draws the mini-dialog overlay only in
command-palette/command mode, not whenever mini_dialog.is_some(); the async Notification-confirm
modal sets mini_dialog but not palette mode -> overlay skipped. Fix: draw overlay whenever
mini_dialog.is_some() (any mode). Handed to alice w/ exact log + fix dir.

Also: B-22c confirmed (trust notifs carry Accept&pin/Reject actions). 9d still functional PASS.
Process note: agent set-option!->immediate connect races the apply-drain (verify get_option
first). bob restored: connected, re-pinned 07aW…7Ls, accept-new.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…both backends)

Bob's b22a instrumentation was decisive: all six checkpoints fire within ~24ms of the
prompt with NO keypress (delivery + redraw + paint all healthy), yet the modal was
invisible — the render pass paints frames but SKIPS the mini-dialog overlay. Root cause:
the overlay was drawn only inside the `command_palette.is_some()` branch (via
`render_command_palette`, which draws `mini_dialog` internally), so an async-raised modal
that set `mini_dialog` without `command_palette` (the host-key TOFU prompt) never drew.
This is the render-side twin of the B-22b input bug.

Fix: both render chains now check `mini_dialog.is_some()` FIRST (top-priority modal),
matching the input dispatch (B-22b). A sweep found the TUI renderer
(crates/renderer/src/lib.rs) had the identical bug — fixed here too, not just the GUI.

Follow-up (next commit): unify the overlay PRIORITY into a single `Editor::active_overlay()`
so the GUI + TUI render chains can't diverge again (the root architectural cause — the
priority order was duplicated per backend).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t diverge (B-22a root cause)

The B-22a sweep confirmed the architectural root cause: the fullscreen-overlay
PRIORITY ORDER was hard-coded as an independent if/else chain in EACH backend, and
they drifted — the GUI drew the blocking mini-dialog at top priority while the TUI
only drew it nested under the command palette, so an async-raised modal (host-key
TOFU prompt) painted no dialog in the TUI. Same class as the B-22b input bug.

Fix: a single source of truth — `render_common::overlay::active_overlay(&Editor)
-> ActiveOverlay` — defines the canonical priority (MiniDialog > FilePicker >
FileBrowser > CommandPalette > WhichKey > Splash > None), unit-tested. Both the GUI
(crates/gui/src/lib.rs) and TUI (crates/renderer/src/lib.rs) render chains now
DERIVE their dispatch from it (`overlay == ActiveOverlay::X`) instead of duplicating
the checks, so they stay in lock-step and a future overlay/reorder changes one place.
A blocking modal is always highest priority, matching the input dispatch.

Behavior-preserving (the per-branch render bodies are unchanged; GUI splash was a
`pub use` of the same render_common::splash::should_show_splash). mae-core overlay
priority test + clippy --features gui -D warnings clean on mae-core/gui/renderer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l overlay unification

Bob's b22a experiment decisive: frame paints but overlay skipped. Fixed GUI+TUI to
draw mini_dialog top-priority; sweep found the TUI had the identical bug. Architectural
fix 65c2281: single render_common::overlay::active_overlay() priority source consumed
by both render chains so they can't diverge. Bob: rebuild → modal should paint; then
I rip out the instrumentation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…'t size to contents, fp truncated)

Rebuilt on f526aef (unified active_overlay resolver). Re-ran prompt (get_option-verified before
connect -> no apply-race -> single prompt). B-22a CONFIRMED FIXED: modal painted on its own
(b22a 1-6 ~3.5ms, input_dirty=false) and is VISIBLE now; bob-user accepted the key; accept->pin
->connect end to end (connected, joined, known_hosts re-pinned correct key Ck5Um…=07aW…7Ls,
OOB-verified). B-22 trilogy functionally complete.

NEW B-23 (security-relevant UX): the modal doesn't adapt to content size — fingerprint text cut
off, user couldn't see the ENTIRE key. TOFU requires reading the full fingerprint OOB before
trusting; truncation undermines that (pinned key was correct by independent check this run, but
UX can't guarantee it). Fix dir: size the dialog box to content (grow to fit within screen) and/or
wrap the fingerprint full-width; wrap not clip. Likely in MiniDialog/overlay render geometry
(render_common::overlay / backend dialog draw).

9d PASS (accept->pin now with a VISIBLE modal; reject + B-21 + correct-fp previously proven).
Remaining polish: B-23 sizing. bob restored: connected, pinned, accept-new.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iniDialog sizing, shared backends

Code-located the truncation: both backends hard-code the dialog box and don't measure title/body.
GUI popup_render.rs:928-929 and TUI popup_render.rs:628-629 use the SAME duplicated formula
(width=50.min(cols-4), height=4+fields.len()); the Confirm/Notification body (the ~70-char SHA256
fingerprint) isn't measured in width or height -> clipped. render_common::overlay unified PRIORITY
(active_overlay) but not GEOMETRY.

Recommended (principle #8, geometry twin of the priority unification): add
render_common::overlay::mini_dialog_layout(dialog, max_cols, max_rows) -> DialogLayout that computes
width/height from wrapped title+body+fields+actions clamped to screen, WRAPS long content (fingerprint)
instead of clipping; both GUI+TUI render_mini_dialog consume it (drop the local 50/4+fields constants);
unit-test the layout. Covers ALL MiniDialog kinds so nothing truncates again.

Security: host-key TOFU requires the FULL fingerprint be visible before accept (OOB compare); adaptive
sizing guarantees it. 9d still PASS (accept->pin with visible modal); B-23 is the readability/sizing fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l fingerprint readable

Bob's B-23: the host-key TOFU prompt's modal renders (B-22a) but its box is hard-coded
width=50 / height=4+fields.len() and content-blind, so the ~55-char SHA256 host-key
fingerprint overflowed and was CLIPPED. confirm() jams the whole multi-line question
into a single field label, drawn as one truncated `label: value` row. Both backends
duplicated the same formula — the geometry twin of the overlay-priority duplication.
Security-relevant: the full fingerprint MUST be readable for the out-of-band compare.

Fix (the geometry twin of active_overlay): a single shared
render_common::dialog::mini_dialog_layout(dialog, max_cols, max_rows) -> DialogLayout
that measures title/body/fields, grows the box to fit, WRAPS long content
(word-wrap + hard-break for space-less tokens like a fingerprint), and clamps to the
screen. Both GUI and TUI render_mini_dialog now consume it (drop the local 50/4+fields
constants), so they can't diverge and EVERY dialog kind sizes to its content.

Tests: fingerprint fully visible (was clipped) + box grows past 50; narrow-screen
hard-wrap keeps every char; wrap_hard token-break; input dialogs keep field rows + hint.
clippy --features gui -D warnings clean on mae-core/gui/renderer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) + rebuild-to-confirm

Geometry twin of active_overlay: render_common::dialog::mini_dialog_layout consumed by
both backends; full host-key fingerprint now visible + wrapped. Bob: rebuild → confirm
the entire SHA256 shows (no clip) + wraps on a narrow window. Next: convert b22a
instrumentation to clean collab-target debug tracing + drop the per-frame render probes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l arc CLOSED

Rebuilt a66449f (shared render_common::dialog::mini_dialog_layout). Both paths through the
fully-rendered modal: B-23 — bob-user saw the FULL dialog contents (entire SHA256 fingerprint
readable, no clipping) -> OOB compare trustworthy. Reject (n) -> ApplicationVerificationFailure,
aborted, no pin. Accept (y) -> collab connected + KB join complete; known_hosts re-pinned correct
key Ck5Um…=07aW…7Ls.

9d/TOFU/R4 = FULL PASS via a modal that renders (B-22a) + captures input (B-22b) + sizes to
content (B-23) + is bus-answerable (B-22c); reject-no-pin and accept-pin both proven with the
full fingerprint visible.

Security arc validated live B-19->B-23 (epoch fence, continuation fence, runtime policy, modal
render/focus/bus/sizing). ADR-024 bus + ADR-018/023 membership-gated write access validated end
to end on two machines. Step-9 complete. Remaining: instrumentation cleanup (b22a -> permanent
collab tracing + drop render probes) + the collab/config-UX polish theme. bob: connected,
pinned, accept-new.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t-key lifecycle tracing

The b22a diagnostic (c7a4bc4) did its job — it proved the modal painted but the
overlay was skipped (fixed in b09becd/65c22813). Per plan, convert the lasting-value
parts to permanent tracing and drop the throwaway scaffolding:

- REMOVED the per-frame render probes (about_to_wait + RedrawRequested) — hot-path,
  and the question they answered is settled + now guarded by the active_overlay/dialog
  unit tests.
- REMOVED the bridge_task forward probes + the one-off "b22a" target.
- KEPT, as clean `debug!(target: "collab")`, the host-key TOFU lifecycle: prompt raised
  (handle_collab_event) and the trust decision in the verifier (pinned / rejected /
  timed out). So `MAE_LOG=collab=debug` tells the whole trust-handshake story without
  any render-loop spam.

clippy --features gui -D warnings clean; no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cuttlefisch cuttlefisch changed the title Trusted-peer, membership-gated, crash-safe + write-authorized KB sync (ADR-017/020/022/023) Trusted-peer, membership-gated, crash-safe, write-authorized KB sync + attention bus (ADR-017/020/022/023/024) Jun 23, 2026
cuttlefisch and others added 7 commits June 23, 2026 21:28
… bugs (Wave 1)

Closes testing gaps and non-UX correctness issues surfaced by the live
two-machine CRDT validation, ahead of the UX pass. Each item ships with a
RED-before/GREEN-after guard (CLAUDE.md #9).

A1  daemon: fence no-cascade oracle — assert the canonical node stays
    byte-identical across a fenced push (not just the error string).
A2a editor: notify_ops resolution unit test — R2 fence notification →
    3 actions; Keep-mine records pending_reauthor + enqueues KbAdoptNode;
    Accept-remote adopts without reauthor.
B1  fix I-3 split-window click coords: both GUI fallback and TUI passed
    ABSOLUTE screen cells to the window-local click handler, so clicks in a
    non-primary split landed at the wrong column. Translate once in the shared
    handle_mouse_click_inner via the focused pane's layout origin (#8 — one
    source of truth, fixes both backends) + pure window_relative() helper.
B2  config-key invariant guard (every snake_case option exposes its kebab
    alias; every collab_* has alias + config_key) + extract
    is_epoch_fence_rejection() so the editor↔daemon "rebase required" contract
    is centralized and tested; clearer user-facing fence wording (#7).
B3  verify joined-KB instance surfaces — federated get/search attribute the
    node to its instance + it appears in *KB Instances* (regression guard).
B4  B-5 malformed-row robustness: a short-arity stored row makes the whole
    load query fail at bind time before the row-skip loop; degrade load_all to
    an empty Ok (logged at ERROR) instead of Err that aborted kb_join and
    tripped the main-thread stall watchdog (#1). Off-thread KB I/O deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…y (Wave 1: C2, C3)

C2 — verify + regression-test that connect-critical config is read live from the
single OptionRegistry source, with no read-site cache (the apply-drain race the
live test hit needs no manual get-option wait):
  - server_address: read live at connect dispatch (set_option writes it
    synchronously); test in dispatch/collab.rs.
  - resolve_client_transport reads auth_mode/psk/tls live; test in collab_bridge.
  The transport is still built once at task setup and cached — the security-
  critical runtime field (host-key policy) is already kept live via
  host_key_policy_live; a full per-connect transport rebuild on a runtime
  auth_mode/tls change is a documented, deferred follow-up.

C3 — embed the git SHA (build.rs → MAE_BUILD_SHA, "-dirty"/"unknown" fallbacks,
cross-platform per #13) in editor + daemon. Reported in the startup log,
--version, and the daemon $/debug response; collab-doctor now prints the daemon
build and warns on an editor↔daemon mismatch — the "are both machines on the
same commit?" check the live two-machine test ran by hand. Smoke + mismatch
tests on both sides.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gence (Wave 2: A2b, A5)

A2b — drive handle_collab_event's KbNodeAdopted (the kb/node_fetch reply) through
both fence-resolution paths: keep-mine re-authors the captured edit over the
authoritative state and consumes pending_reauthor; accept-remote takes the
authoritative value and discards local. Closes the bridge half of the R1
adopt-and-re-author round-trip the manual Step-9 run exercised by hand.

A5 — real-daemon convergence: two peers concurrently edit DISJOINT fields of the
same KB node from the same base; the daemon merges both into its authoritative
per-node doc and two fresh joiners read back BYTE-IDENTICAL state carrying both
edits — the CRDT guarantee (#11) end-to-end over TCP + base64, not just an
in-process KnowledgeBase merge. MAE_TCP_E2E-gated (CI e2e job; the no-auth
daemon skips the epoch fence, so the joiner write is accepted). Plus a manual
T1–T7 cross-reference doc-comment mapping each in-process kb_sync_n_peer_e2e
test to its live two-machine step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ave 2: C1)

The daemon broadcasts a KB's collection doc (kbc:) on every membership/role
change, but the editor ignored it ("remote update for unknown buffer") and only
relearned its authorization epoch on a full re-join — forcing the manual
reconnect the live two-machine test kept performing by hand.

C1 keeps a local CRDT replica of each joined KB's collection doc
(CollabState::kb_collection_state), seeded from the join snapshot. A live kbc:
RemoteUpdate is now intercepted before the buffer lookup, applied to the replica
(#11), and epoch_of(local_fingerprint) re-derived: kb_epochs updates in place so
the next node edit authors under the rotated, current-epoch client_id — no
reconnect. The user is notified and a `kb-epoch-changed` hook fires
(runtime-redefinable, #7). Replica + epoch are dropped on KbLeft.

Security (#10): the daemon stays the sole authority — it re-derives each member's
epoch from its OWN authoritative collection when fencing, so the relearn is pure
client convenience. A tampered/stale replica can only mislead this client about
its own epoch, never self-elevate; a client that ignores the relearn and authors
under a stale epoch is still fenced. Tests cover the live relearn, that another
member's change cannot bump this peer's epoch, and the unjoined-KB no-op. The
daemon viewer_era_* / stale_epoch_continuation_* fence tests stay GREEN — the
no-weakening gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…B-6)

B-6 (primary KB data dir is XDG-first, not dirs::data_dir()/~Library) was fixed
in cf673b7; verify confirms the primary.cozo store path derives from
editor.mae_data_dir() (XDG_DATA_HOME → ~/.local/share/mae) with the same XDG-first
fallback, and the only residual dirs::data_dir() uses are deliberate read-only
module *search* paths. Add a regression test (#13) asserting mae_data_dir()
honors XDG_DATA_HOME and falls back to ~/.local/share/mae — never the macOS
platform-native dir — so a future change can't silently reintroduce dirs::data_dir
and re-split the KB store from the ADR-019 registry markers (restart-survival).

A cross-location ~/Library→XDG migration for pre-fix macOS dev builds is
intentionally NOT added (highest-risk, marginal early-alpha benefit; the fix
already landed without orphaning concerns for XDG-isolated installs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…orized-peer e2e (Wave 3: A4)

Fills the security-negative gaps left by the existing mTLS unit tests
(mtls_unauthorized_client_rejected / mtls_client_rejects_untrusted_host):

- FileHostKeyVerifier TOFU integrity: a previously-pinned daemon host key that
  CHANGES (MITM / key substitution) is rejected AND the trusted pin is NOT
  overwritten, so an attacker can't silently re-pin; the genuine key still
  verifies afterward. Plus a strict-policy-rejects-unknown-host test. Runnable
  unit tests in shared/mcp/src/identity.rs.
- collab-mtls-e2e.sh: added an unauthorized-peer negative scenario — a second
  editor whose identity is NOT in the daemon's authorized_keys attempts to
  connect; the daemon's authenticated-peer count must not increase (robust to the
  exact rustls rejection string). The e2e counterpart to the unit-level
  unauthorized-client rejection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 3: A3)

Rather than ship an unverifiable ~150-line two-editor scheme fence-resolve e2e
(its deterministic trigger is now in tension with C1's honest-path epoch relearn,
and there is no validated scheme recipe for editing a *shared* KB node to force a
fenced kb/node_update — so it can't be authored correct-by-construction without a
runnable two-machine environment), document the closure precisely:

- Tier-0 "Automated coverage map": each manually-run flow (Step 8 fence safety,
  Step 9 resolution UX, rebase-required contract, epoch relearn, two-peer
  convergence, unauthorized peer, MITM no-overwrite, TOFU prompt) → the exact
  unit/e2e test that now guards it.
- Step 8 / Step 9 NOTE callouts: the fence *safety* and resolution *logic* are now
  unit-automated (A1/A2a/A2b) and the manual reconnect-to-relearn is automatic
  in-product (C1); the live two-machine run remains for badge/pixel + cross-editor
  convergence, with the offline edit as the deterministic fence trigger.

This serves the "clear success criteria + coverage" goal: the residual fence
end-to-end is explicitly the Tier-2 manual run, with every constituent piece
unit-covered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant