Honest engineering means documenting where the implementation
knows it falls short. This post catalogues the architectural
limits I’ve verified by reading the code against the
architecture doc.
Organized in four classes by what kind of limit it is, tagged by
whether the fix is on the roadmap, intentional at this scope, an
operational constraint, or an infrastructure gap that needs
bigger work.
The motivating frame: publishing this is the differentiator.
RAG-based natural-language-to-code tools produce output and hope
you don’t look closely. The value here is that you can look
closely and the system has receipts — including for the parts
that don’t work yet.
needs bigger architectural work; not on the immediate roadmap
01 · Semantic correctness
Cases where the system can produce wrong output even when
nothing complains. Load-bearing weaknesses, mitigated only by
the visual surface giving you a chance to review extracted
state before the pipeline proceeds.
The spec can be wrong without anyone noticing.
intentional
Nothing in the pipeline checks “did the LLM extract every
step the user asked for, in the right action types?”
If the LLM silently drops a step or misclassifies one, the
orchestrator and constraint checker see a smaller, well-formed
spec and won’t object. The visual surface (column 2,
extracted spec) exists specifically so you can catch this before
the pipeline runs further. The user’s eye is the only check.
What fixing this looks like: an automated
retry-with-feedback loop at extraction time would cost real
tokens and may not improve precision enough to justify until
we have an eval set to measure against.
Citation disambiguation is not handled.
intentional
If your instruction contains the substring “100uL”
multiple times in different roles (e.g.
“Add 100uL of sample, then mix at 100uL volume”),
the LLM picks one occurrence as the cite and the verifier
trusts that pick. There’s no formal check that the cited
instance is the right instance.
What fixing this looks like: a positional cite
format (cite by character offset instead of substring) would
eliminate the class entirely. Real schema change.
Null fields carry no provenance.
intentional
A populated value carries a provenance object. A null value
carries nothing. So step.source = None could mean
“the LLM correctly inferred there was no source mentioned”
or “the LLM forgot to extract a source that was actually
there.” Both look identical to downstream stages.
Mitigated by the orchestrator’s gap detectors flagging
missing required fields, so protocols-with-required-source
aren’t silently broken — but truly-optional null fields
are unauditable.
What fixing this looks like: a separate
OmittedField Provenance subtype that carries the
reasoning for the null. Modest schema addition.
No semantic-equivalence check between spec and generated script.
intentional
The Opentrons simulator catches code-execution problems
(loading failures, well-out-of-range, pipette mismatch). It
does not validate that the script does what the user
asked for. The user’s eye is the final arbiter on intent
fidelity.
02 · Known false-positives
Verifier complaints that fire on legitimate extractions. Annoying
when you hit them; the root causes are tracked.
LocationRef.wells is a list, but
LocationRef.wells_provenance is ONE Provenance for
the whole list. The verifier checks each well against any cite
entry via substring match. When wells extracted from
multi-bullet instructions have cites that don’t literally
name each well, false-positive fabrication fires per missing
well. The detector deduplicates these into one Gap, but the
gap modal’s “Current:” label shows the last
offending element instead of the full field state.
What fixing this looks like: migrate
wells_provenance from one Provenance to
List[Provenance] aligned with wells.
Per-well grounding, per-well verification. Cascading schema
change — touches the extractor prompt, the verifier, the
apply layer, and the visual surface. Highest-impact fix on the
roadmap.
Citation values that don’t literally appear in the cite text.
known bug
Substring matching, no semantic understanding. Cite says
“top row”, value is A1
→ false positive. Synonyms, paraphrases, abbreviations all
hit this. Fix requires either per-element cite alignment (see
the wells fix above) or a semantic check; both non-trivial.
Confidence gets overwritten on fabrication-gap accept.
known bug
When you click Accept on a fabrication gap,
the system restates the provenance as source=“inferred”
with the suggester’s reasoning. The architecture doc says
“value untouched.” But the code also overwrites
confidence with the suggester’s confidence
— the user endorsed the reasoning, not necessarily the
suggester’s confidence calibration. Unauthorized rewrite.
What fixing this looks like: one-line code
change — preserve the existing confidence and only update
source/reasoning/review_status. Trivial.
03 · Capacity limits
Operational limits on the live demo at demo.nl2protocol.com.
Lift each by deploying on heavier infrastructure; intentional for
a portfolio-scale free demo.
One pipeline at a time, demo-wide.
capacity
The live demo runs on a single small Fly machine; pipeline
state is per-process. If someone’s mid-run when you hit
Start, you get a “demo busy, try in a few
minutes” page. Lifting this requires session-keyed
thread bridges and a worker queue.
5 pipeline runs per IP per hour.
capacity
Per-IP rate limit on POST /start. Defense in
depth on top of bring-your-own-key — even though
visitors pay their own Anthropic costs, we don’t want a bot
to exhaust the demo machine’s CPU. Plenty of headroom for
a real human clicking around.
Gap-resolution loop caps at 3 iterations.
capacity
The orchestrator’s detect → suggest → review
→ apply loop runs at most 3 passes. If gaps still remain,
the pipeline halts with an error. In practice convergence
happens in 1 pass for most protocols; the cap is a safety net
against pathological loops.
04 · Infrastructure gaps
Architectural shortcuts taken to ship a working portfolio demo.
Each is real and will need to be addressed before any paid /
multi-user deployment.
Single-user assumption baked into the live-mode model.
infra gap
The thread-bridge between the orchestrator and the browser
confirmation handler assumes one pipeline per process. Two
concurrent users would race for the bridge. The single-pipeline
lock prevents this from breaking, but a multi-user shape
requires session-keyed bridges — a real refactor.
No persistence.
infra gap
Pipeline state lives in memory during a run; static HTML
reports are written to the local disk on the Fly machine.
Cloud-scale deployment needs a database for run history and
object storage for artifacts.
No authentication.
infra gap
Bring-your-own-key is the only access-control mechanism today.
Anyone with an Anthropic key can run the demo. Acceptable at
portfolio scale; not at paid-tier scale.
Stage 4 confirmations are only logged to state_log on abort.
known bug
The three pre-orchestrator confirmation modals (initial
contents, source containers, labware assignments) write a
state_log entry only when the user aborts. On success, the
downstream spec reflects their choices but no audit-trail
entry records what they kept versus edited. Audit gap.
What fixing this looks like: three lines per
modal — after each successful confirmation, write
state_log[“stage_2_5_<modal>_confirmed”] = <choices>.
Trivial.
What this becomes
Each item above is also a roadmap entry. The
known bug-tagged ones are sized by how invasive
the fix is: trivial (confidence overwrite, state_log additions),
medium (per-element cite alignment for non-wells fields),
cascading (the wells_provenance schema migration). The
intentional ones are dial-tuning calls that get
revisited as the project shape changes. The
infra gap ones unlock with deploy shape decisions
(multi-user, paid).
The function-level trace of stages 5-7 (where most of these
live) is in the
GAP_LIFECYCLE.md
doc with three Mermaid sequence diagrams — the work this
post came out of.