← back to home · limitations

What you can run and how to write instructions that work.

The scope of what nl2protocol can produce, what your config needs to declare, and the patterns in your instruction that produce clean extractions. Architectural limits live in the engineering log; this page is what you need to know before you run it.

hard limit
physical or API constraint you can’t bypass
required
something your instruction or config must provide
recommendation
not enforced, but gives noticeably better extractions
watch for
known extraction failure mode — check column 2 before proceeding

What kind of physical setup the generated scripts target.

Opentrons OT-2 only (no Flex yet).

hard limit

Generated scripts target the OT-2 Python API. The Opentrons Flex has a different protocol API surface; nl2protocol doesn’t emit Flex-compatible code in this version.

Maximum two pipettes mounted.

hard limit

Like the physical OT-2, your config can declare at most one left-mounted and one right-mounted pipette. Each must be paired with a tip rack also declared in the config.

Pipette volume ranges are checked at constraint time.

hard limit

Each Opentrons pipette has a fixed range (e.g. P20: 1–20uL, P300: 20–300uL, P1000: 100–1000uL). If your instruction asks for a volume outside any mounted pipette’s range, the constraint checker will surface a violation and ask you what to do — not silently generate broken code.

Modules supported only if declared in your config.

required

Heater-shaker, thermocycler, magnetic, temperature modules: the extractor will reference them only when your config lists them. Mentioning “heat to 95C” in your instruction without a thermocycler or temperature module in the config produces a constraint violation rather than a fabricated module reference.

The system can only reference labware your config declares. Everything else surfaces as an ambiguity for you to resolve.

Only standard Opentrons load_names are supported.

hard limit

Your config’s labware entries must use Opentrons’ official labware names (e.g. corning_96_wellplate_360ul_flat, opentrons_24_tuberack_eppendorf_2ml_safelock_snapcap). Custom or third-party labware isn’t supported in v1.

Every labware referenced in your instruction must map to your config.

required

The labware resolver tries to map your instruction’s wording (“tube rack”, “the reservoir”) to a config-declared label. Anything it can’t map confidently becomes a per-piece confirmation you resolve in the UI. Anything truly absent from the config halts the pipeline.

Wells are addressed using standard plate notation.

recommendation

A1, B2, H12 — column letters A–H (or A–P for 384-well), row numbers 1–12 (or 1–24). The extractor can sometimes infer wells from descriptions like “tube 1”, but explicit notation is more reliable.

What your natural-language instruction needs to look like.

Must pass the input validator.

required

Stage 1 is a Haiku classifier that decides whether your text is a plausible protocol instruction. Questions (“what’s a transfer?”), pure vagueness (“do an experiment”), and non-liquid-handling operations (“centrifuge this”) get rejected cheaply before any expensive call runs.

English only.

hard limit

Prompts and few-shot examples are written in English. Other languages may work in theory but aren’t tested and may produce worse extractions.

Quantitative volumes work better than hedged ones.

recommendation

“100uL” is grounded; the extractor cites it verbatim. “about 100uL” is still cited but flagged exact: false. “a little” or “some” forces the extractor to either infer a default (which you’ll see flagged in the UI) or leave the volume blank (which becomes a gap to resolve).

Multi-step protocols benefit from explicit numbering.

recommendation

Bullet points, numbered lists, or clear paragraph boundaries per step. Run-on paragraphs with multiple actions in one sentence make it more likely the extractor drops or merges steps.

Match labware wording to your config when you can.

recommendation

If your config labels a labware "sample_rack", calling it “the sample rack” in the instruction (rather than “the tube holder”) gives the labware resolver a much higher-confidence match. Not required, but reduces the number of per-piece confirmations you have to make.

LLM-driven extraction has predictable failure modes. The visual surface exists so you can catch these in column 2 (extracted spec) before the pipeline proceeds.

Dropped steps in long protocols.

watch for

Sonnet sometimes silently omits a step in protocols longer than ~15 steps. Nothing in the downstream pipeline catches this — the orchestrator and constraint checker only see what was extracted. Always count steps in column 2 against your original instruction.

Mix vs transfer misclassification.

watch for

Ambiguous wording like “pipette up and down 3 times in well A1” may extract as a transfer (source=A1, dest=A1) rather than a mix. The generated code still runs but the action semantics differ. Look for steps where source and destination are the same well.

Range expressions can mis-expand.

watch for

“wells B1 through B4” usually expands cleanly to [B1, B2, B3, B4]. “wells B1-D4” can be ambiguous — row-by-row vs column-by-column expansion order is not always inferred correctly. Use explicit lists for non-contiguous or large ranges.

Invented wells when the instruction is vague.

watch for

“Distribute the buffer” without naming targets can make the extractor invent a well set (often A1–A12 on a 96-well). Always be explicit about destinations when distributing.

Spread citations can trigger spurious fabrication warnings.

watch for

When wells in a transfer step were cited from multiple bullet points and the cite phrasing doesn’t literally name each well, the verifier may complain. Real extraction was fine; verifier is too strict in this shape. If you see a fabrication gap on a wells field that visibly matches your instruction, accept the suggestion to move on. (Tracked as a known bug; see the engineering log for the fix plan.)

Categories of protocol the system isn’t designed for. If your protocol falls in here, the system will either reject the instruction at Stage 1 or produce a constraint violation.

Multi-day workflows.

hard limit

Generated scripts are single-session. Anything that requires the user to step away for hours and come back to continue (overnight incubation, multi-day cell culture) needs to be split into separate scripts, one per session.

Manual intervention beyond pauses.

hard limit

The Opentrons API exposes pause with an optional note ("user monitors heat shock timing", "swap tube rack") for manual steps. Anything more elaborate — physical rearrangement of decks, custom hardware operations — is outside the API surface and the generated script.

Operations the Opentrons API doesn’t cover.

hard limit

Centrifugation, gel electrophoresis, autoclave cycles, plate readers, manual cell counting — the OT-2 doesn’t do these, and the extractor will reject the instruction if it’s fundamentally a non-liquid-handling protocol. Workflows that combine liquid handling with external equipment work as long as the external steps map to pause calls.

Architectural and internal-engineering limits (where the implementation drifts from intent, contract gaps, known internal bugs) live in the engineering log. The full pipeline walkthrough is in the architecture doc on GitHub.