Internal · SpaceMusic Engineering

The Engine, the Exe, and Three Reaches

SpaceMusic's only UI is served by a local executable, talks to whoever serves it, and treats the internet as the afterthought — because the same machine has to feel instant before anything else does.

The constraint that picks the architecture

A control surface that has to feel instant on the machine it runs on, keep working with the network cable pulled out, and — from the very same code — light up on a tablet across the room and a browser across the world. Pick any one of those and the architecture is obvious. The difficulty is that SpaceMusic needs all three, and they pull in opposite directions.

The engine renders frames that are a few milliseconds old. A UI that shows them 150 ms late feels broken when you are sitting at the machine, yet that same 150 ms is invisible over the public internet, where the alternative is no picture at all. So latency budget is a function of reach: what counts as "good" changes depending on where the viewer is standing relative to the engine. An architecture that ignores this either makes the local case slow to keep the remote case simple, or makes the remote case impossible to keep the local case fast. We want neither.

This document is the resolution: a single UI codebase, three transports, and one rule that decides which transport is used without the UI ever having to ask.

How streaming UIs usually get built

The default modern shape is the cloud-hosted single-page app: the UI is served from a CDN, talks to a backend over HTTPS, and any live video flows through a media server. It is a clean model — one origin, one auth system, one place to deploy — and for a product whose users are always somewhere else, it is correct.

It is also the wrong default here, and it is worth being honest about why, because it was the first thing I reached for. Routing everything through a server makes the architecture uniform by making the common case (operator at the machine) pay the cost of the rare case (someone watching remotely). For SpaceMusic that trade is backwards.

Cloud-first. One hosted UI, all traffic through the server/SFU. Uniform and simple — but every viewer eats the round-trip to a datacenter, and nothing works offline. most SaaS dashboards, cloud creative tools
Local-only. A native app bolted to the engine. Instant, but it is a second codebase, and it cannot reach a phone or a remote browser without reinventing a transport. classic desktop control surfaces
Local-first, server-optional (where we land). One web UI, primarily served and fed by a local executable, with the server bolted on only for the reach that genuinely needs the internet.

The thing that rejects the cloud-first default is a single sentence about the product: this is the only UI we have, it must feel instant on the same machine, and it must run with no internet. Once that is non-negotiable, the server cannot be on the local critical path, and the whole shape follows.

What we have already proven

We are not designing on a blank page; the highest-priority reach is also the one with the least risk left in it. A measurement spike (plan 037) put the engine's Stride output into a browser on the same machine over a localhost WebSocket, decoded with WebCodecs, at roughly 9.5 ms glass-to-glass — indistinguishable from a native window. That is the same-machine transport, already built and measured.

Three more pieces are in place. The engine is already headless: it shares its textures over Spout and its parameters over a WebSocket, and carries zero streaming load — the encoding lives in a separate process. The encoder was just moved to Main 4:2:0 H.264, which is exactly the format a WebRTC path needs, so the remote reach is unblocked at the codec level. And the streamer now does demand-driven downscale — it encodes each texture at the size and frame rate the viewer actually asks for, which turned out to be what made the picture stable on a weaker client at 30 fps. The pieces exist; what is missing is the architecture that arranges them.

Transport follows the serving origin

The load-bearing idea is small enough to state in one line: do not choose a transport globally — let the UI talk back to whoever served it.

If the page was served from localhost, the thing that served it is the local exe sitting next to the engine, so the UI streams from it directly over a WebSocket: instant, offline, no server in sight. If the page was served from a LAN address, same story, one hop further. If the page was served from the public server, the engine is unreachable directly, so the UI subscribes to the relayed stream through LiveKit. The serving origin is the reach, so binding the transport to it makes the right choice automatically, with no configuration and no second codebase.

Figure 1 · System topology — one engine, one exe, three reaches Open full size · print A3 landscape ↗

The engine on the left never knows the server exists. The local exe is the only thing that bridges local and internet, and it only ever makes outbound connections — so there is no inbound hole to punch in the studio firewall. Two of the three arrows leaving the exe stay on the local network and never touch the dashed box; only the WAN reach descends into the server, and only when someone actually opens the page from the outside.

Three reaches, three transports

With the origin rule in place, each reach gets the transport it deserves rather than the one that is easiest to share. The differences are not cosmetic — the video path, the parameter path, the latency, and even the authentication model change from column to column.

Figure 2 · The three reaches, side by side Open full size · print A3 landscape ↗

The two local reaches

Same-machine and LAN share a transport — H.264 over a WebSocket, decoded by WebCodecs — and differ only in the network hop and a browser technicality. localhost is a secure context, so WebCodecs works over plain HTTP with no certificate; the same page on a LAN IP is not a secure context, which Chrome enforces and Safari does not. That single browser quirk is why the LAN reach is "Safari today, a dedicated app tomorrow" rather than "ship it everywhere now" — and why neither local reach needs the certificate machinery that a naive reading of "we need HTTPS" would imply.

The remote reach, and the auth split

WAN is the only column that touches the server, and it is the only column with real authentication. The local exe WHIP-publishes each active texture into a LiveKit room; a remote browser, having loaded the UI from the server behind single sign-on, subscribes to that room with a short-lived token. Three credentials, each with one job: the human signs in once with SSO; the browser is handed a scoped JWT that only lets it subscribe; the headless exe authenticates with a long-lived API key it trades for a publish token. The local reaches need none of this — at most a pairing token so a stray device on the LAN cannot connect.

One sharp edge here, learned the expensive way: the browser must mint its token from its own origin, never by fetching cross-origin from the API, because the SSO layer answers a cross-origin preflight with a redirect and no CORS header — which fails silently in a fresh session and passes with a warm cookie, the worst possible failure mode to debug.

The hard edges

Three things at the margins are worth naming so they are deliberate choices rather than surprises.

The LAN certificate. The clean way to get Chrome (not just Safari) onto the LAN is a trusted certificate on the local exe — but a private LAN IP cannot get a public certificate, so that means split-horizon DNS and a provisioned cert, the one genuinely awkward piece of infrastructure in the whole design. We sidestep it: a dedicated native app for the tablet is not bound by browser secure-context rules at all, so the app removes the cert problem instead of solving it. Until the app exists, Safari covers the LAN.

The demand signal has to span the wire. The encoder only does work a viewer needs — which channels, at what size, at what frame rate. Locally that signal is a direct WebSocket message; over WAN it rides the same relay as the parameters. The mechanism is identical either way (it is the per-tile downscale already built); only the wire gets longer. This is what keeps the WAN uplink from having to carry full-resolution video nobody is looking at.

WAN is deliberately under-built. Two viewers, one room, a fan-out server that could handle thousands. That is not a mistake — it is correct sizing for a capability that is, for now, a demonstration. It becomes load-bearing the day we run cloud instances of the engine itself, and the architecture is ready for that day without being over-engineered for it today.

Why local-first is the whole game

"The server is a capability, not a dependency — pull the network and the product is exactly as good as it was a second ago."

The reason to invert the usual model and serve the UI from a local executable is not performance for its own sake. It is that the UI is the product — there is no other one — and a product whose only interface goes dark when the Wi-Fi drops, or lags when a datacenter in another country is busy, is a fragile product. Local-first makes the common case the fast case and the offline case the normal case, and lets the internet be a bonus that extends reach rather than a single point of failure the whole thing hangs from.

The discipline that buys all of this is one seam: a transport abstraction designed in from the first commit, so the same UI components do not know or care whether their pixels arrive from a localhost socket nine milliseconds ago or a LiveKit room a hundred and fifty milliseconds ago. Get that seam right early and the three reaches are three implementations behind one interface. Get it wrong and we are writing the UI twice. Everything else in this document is downstream of that one decision.

Glossary

Terms and acronyms used in this document, in plain language.

Spout: A Windows mechanism for sharing a GPU texture between processes on the same machine with no copy. How the engine hands frames to the local exe.
WebCodecs: A browser API that exposes the hardware video decoder directly to JavaScript, bypassing the usual <video> buffering — the key to low-latency decode in a page.
secure context: A browser security state (HTTPS, or the special-cased localhost) that several APIs, WebCodecs included, refuse to run outside of.
NVENC: NVIDIA's hardware H.264/H.265 encoder. The local exe uses it to turn textures into a video stream without taxing the engine.
Main 4:2:0: An H.264 profile + chroma format that is broadly decodable (browsers and WebRTC alike) and half the chroma data of 4:4:4 — the format we settled the encoder on.
WHIP: WebRTC-HTTP Ingestion Protocol. A simple HTTP handshake for pushing a WebRTC stream into a media server — how the exe publishes video to the server.
LiveKit: The open-source WebRTC media server (an SFU) deployed on our server; it ingests the exe's WHIP stream and fans it out to remote viewers.
SFU: Selective Forwarding Unit. A media server that receives one copy of a stream and forwards it to many subscribers, so the publisher uploads once.
Centrifugo: The real-time message relay on our server; carries parameters (and the demand signal) between the exe and a remote UI.
Authentik: Our single-sign-on provider. Gates the server-hosted UI; issues the session the token-mint step checks.
devpush: The self-hosted platform-as-a-service on our server that builds and hosts web apps from a git push — where the WAN UI is deployed.
JWT: JSON Web Token. A short-lived signed credential; here, scoped to "subscribe to room X" for a browser or "publish to room X" for the exe.
glass-to-glass: End-to-end latency measured from the frame rendered on the engine to that frame visible in the viewer — the number that actually matters.

Settled

The same-machine pipe

localhost WS + WebCodecs, ~9.5 ms, offline, no cert. Proven in plan 037 and the lowest-risk reach.

Next step

The real UI + the transport seam

Build the product UI on the local pipe, behind a VideoSource/ParamSource abstraction, and have the exe serve it offline.

Later

LAN app & WAN

A native LAN app to retire the cert question; WHIP→LiveKit + SSO for the ≤2-user remote case, real once cloud SM exists.