Skip to content

Multimodal Chain

Overview

The multimodal_chain node assembles text, image, audio, and video inputs from upstream nodes into a single, ordered multimodal prompt that a vision- or audio-capable model can reason over in one pass. Use it when a workflow accumulates assets across multiple modalities — for example, a generated image, a transcription, and a text description — and a downstream Lens or model needs them combined into one coherent context. The node does not call a provider itself; it constructs and emits an ExecutionInput envelope (including the attachments array) that the next AI node consumes. If a required modality port receives no data, the node omits that slot rather than failing, so downstream provider compatibility determines whether partial inputs are acceptable.

Configuration

FieldTypeRequiredDescription
model_idstringYesModel key used by the downstream provider to process the assembled multimodal prompt (e.g. google:gemini-2.0-flash, openai:gpt-4o). Must reference a model that supports all modalities present in the chain.
prompt_templatestringYesHandlebars-style template rendered as the text part of the outgoing prompt. May reference upstream outputs via {{label}} placeholders. Combined with attached media before dispatch.
modalitiesenumNoOrdered list of modalities to include in the assembled context. Accepted values: text, image, audio, video. Defaults to including all connected input ports. Omitting a modality suppresses it even when the upstream port has data.
max_attachmentsnumberNoMaximum number of media attachments forwarded to the provider. Defaults to 4. Capped at 8 by the execution engine regardless of this value.
onParentFailureenumNoBehavior when an upstream node fails. Accepted values: propagate (default), skip, substitute_default.

Inputs

PortTypeDescription
texttextText content included as the primary text part of the assembled prompt.
imageimageImage artifact (URL + MIME type) attached as a vision content part.
audioaudioAudio artifact (URL + MIME type) attached as an audio content part.
videovideoVideo artifact (URL + MIME type) attached as a video content part.

Outputs

PortTypeDescription
resulttextText response emitted by the provider after processing the assembled multimodal prompt.
errorerrorError envelope emitted when the provider rejects the request or a required modality is unsupported. Connect to an error_catch node to handle gracefully.

Example

json
{
  "nodeType": "multimodal_chain",
  "config": {
    "model_id": "google:gemini-2.0-flash",
    "prompt_template": "Describe the scene in the image, then verify the caption matches the audio transcript: {{caption}}",
    "modalities": [
      "text",
      "image",
      "audio"
    ],
    "max_attachments": 2
  }
}