← back to log · May 17, 2026 · post 01

Architectural limits we know about.

Honest engineering means documenting where the implementation knows it falls short. This post catalogues the architectural limits I’ve verified by reading the code against the architecture doc. Organized in four classes by what kind of limit it is, tagged by whether the fix is on the roadmap, intentional at this scope, an operational constraint, or an infrastructure gap that needs bigger work.

The motivating frame: publishing this is the differentiator. RAG-based natural-language-to-code tools produce output and hope you don’t look closely. The value here is that you can look closely and the system has receipts — including for the parts that don’t work yet.

known bug: fix planned; just hasn’t landed yet
intentional: by design at the current scope; may revisit
capacity: operational limit (rate, concurrency, iteration cap)
infra gap: needs bigger architectural work; not on the immediate roadmap

01 · Semantic correctness

Cases where the system can produce wrong output even when nothing complains. Load-bearing weaknesses, mitigated only by the visual surface giving you a chance to review extracted state before the pipeline proceeds.

The spec can be wrong without anyone noticing.

intentional

Nothing in the pipeline checks “did the LLM extract every step the user asked for, in the right action types?” If the LLM silently drops a step or misclassifies one, the orchestrator and constraint checker see a smaller, well-formed spec and won’t object. The visual surface (column 2, extracted spec) exists specifically so you can catch this before the pipeline runs further. The user’s eye is the only check.

What fixing this looks like: an automated retry-with-feedback loop at extraction time would cost real tokens and may not improve precision enough to justify until we have an eval set to measure against.

Citation disambiguation is not handled.

intentional

If your instruction contains the substring “100uL” multiple times in different roles (e.g. “Add 100uL of sample, then mix at 100uL volume”), the LLM picks one occurrence as the cite and the verifier trusts that pick. There’s no formal check that the cited instance is the right instance.

What fixing this looks like: a positional cite format (cite by character offset instead of substring) would eliminate the class entirely. Real schema change.

Null fields carry no provenance.

intentional

A populated value carries a provenance object. A null value carries nothing. So step.source = None could mean “the LLM correctly inferred there was no source mentioned” or “the LLM forgot to extract a source that was actually there.” Both look identical to downstream stages. Mitigated by the orchestrator’s gap detectors flagging missing required fields, so protocols-with-required-source aren’t silently broken — but truly-optional null fields are unauditable.

What fixing this looks like: a separate OmittedField Provenance subtype that carries the reasoning for the null. Modest schema addition.

No semantic-equivalence check between spec and generated script.

intentional

The Opentrons simulator catches code-execution problems (loading failures, well-out-of-range, pipette mismatch). It does not validate that the script does what the user asked for. The user’s eye is the final arbiter on intent fidelity.

02 · Known false-positives

Verifier complaints that fire on legitimate extractions. Annoying when you hit them; the root causes are tracked.

Spread-citation wells trigger spurious fabrication warnings.

known bug

LocationRef.wells is a list, but LocationRef.wells_provenance is ONE Provenance for the whole list. The verifier checks each well against any cite entry via substring match. When wells extracted from multi-bullet instructions have cites that don’t literally name each well, false-positive fabrication fires per missing well. The detector deduplicates these into one Gap, but the gap modal’s “Current:” label shows the last offending element instead of the full field state.

What fixing this looks like: migrate wells_provenance from one Provenance to List[Provenance] aligned with wells. Per-well grounding, per-well verification. Cascading schema change — touches the extractor prompt, the verifier, the apply layer, and the visual surface. Highest-impact fix on the roadmap.

Citation values that don’t literally appear in the cite text.

known bug

Substring matching, no semantic understanding. Cite says “top row”, value is A1 → false positive. Synonyms, paraphrases, abbreviations all hit this. Fix requires either per-element cite alignment (see the wells fix above) or a semantic check; both non-trivial.

Confidence gets overwritten on fabrication-gap accept.

known bug

When you click Accept on a fabrication gap, the system restates the provenance as source=“inferred” with the suggester’s reasoning. The architecture doc says “value untouched.” But the code also overwrites confidence with the suggester’s confidence — the user endorsed the reasoning, not necessarily the suggester’s confidence calibration. Unauthorized rewrite.

What fixing this looks like: one-line code change — preserve the existing confidence and only update source/reasoning/review_status. Trivial.

03 · Capacity limits

Operational limits on the live demo at demo.nl2protocol.com. Lift each by deploying on heavier infrastructure; intentional for a portfolio-scale free demo.

One pipeline at a time, demo-wide.

capacity

The live demo runs on a single small Fly machine; pipeline state is per-process. If someone’s mid-run when you hit Start, you get a “demo busy, try in a few minutes” page. Lifting this requires session-keyed thread bridges and a worker queue.

5 pipeline runs per IP per hour.

capacity

Per-IP rate limit on POST /start. Defense in depth on top of bring-your-own-key — even though visitors pay their own Anthropic costs, we don’t want a bot to exhaust the demo machine’s CPU. Plenty of headroom for a real human clicking around.

Gap-resolution loop caps at 3 iterations.

capacity

The orchestrator’s detect → suggest → review → apply loop runs at most 3 passes. If gaps still remain, the pipeline halts with an error. In practice convergence happens in 1 pass for most protocols; the cap is a safety net against pathological loops.

04 · Infrastructure gaps

Architectural shortcuts taken to ship a working portfolio demo. Each is real and will need to be addressed before any paid / multi-user deployment.

Single-user assumption baked into the live-mode model.

infra gap

The thread-bridge between the orchestrator and the browser confirmation handler assumes one pipeline per process. Two concurrent users would race for the bridge. The single-pipeline lock prevents this from breaking, but a multi-user shape requires session-keyed bridges — a real refactor.

No persistence.

infra gap

Pipeline state lives in memory during a run; static HTML reports are written to the local disk on the Fly machine. Cloud-scale deployment needs a database for run history and object storage for artifacts.

No authentication.

infra gap

Bring-your-own-key is the only access-control mechanism today. Anyone with an Anthropic key can run the demo. Acceptable at portfolio scale; not at paid-tier scale.

Stage 4 confirmations are only logged to state_log on abort.

known bug

The three pre-orchestrator confirmation modals (initial contents, source containers, labware assignments) write a state_log entry only when the user aborts. On success, the downstream spec reflects their choices but no audit-trail entry records what they kept versus edited. Audit gap.

What fixing this looks like: three lines per modal — after each successful confirmation, write state_log[“stage_2_5_<modal>_confirmed”] = <choices>. Trivial.

What this becomes

Each item above is also a roadmap entry. The known bug-tagged ones are sized by how invasive the fix is: trivial (confidence overwrite, state_log additions), medium (per-element cite alignment for non-wells fields), cascading (the wells_provenance schema migration). The intentional ones are dial-tuning calls that get revisited as the project shape changes. The infra gap ones unlock with deploy shape decisions (multi-user, paid).

The function-level trace of stages 5-7 (where most of these live) is in the GAP_LIFECYCLE.md doc with three Mermaid sequence diagrams — the work this post came out of.