Image To Audio

Overview

The Image To Audio node accepts an image and produces a spoken audio description by passing the image through a multimodal model that generates a caption or narrative, then synthesizing that text to speech. Use it in battle workflows where visual content must be accessible via audio, or as a pre-processing step before audio-based judging. A valid multimodal model credential and a TTS voice selection are required; if the model call or synthesis fails the node emits on the error port rather than halting the workflow. Output audio is returned as a base64-encoded buffer with a configurable MIME type.

Configuration

Field	Type	Required	Description
`model`	enum	Yes	Multimodal model used to generate the image description (e.g. gpt-4o, gemini-1.5-pro, claude-3-5-sonnet).
`ttsVoice`	enum	Yes	Text-to-speech voice used to synthesize the generated description (e.g. alloy, nova, echo).
`descriptionPrompt`	string	No	Custom system prompt appended to the model request to steer the style or focus of the image description. Defaults to a neutral accessibility-style caption prompt.
`outputFormat`	enum	No	Audio encoding format for the synthesized output: mp3, opus, or wav. Defaults to mp3.
`maxDescriptionTokens`	number	No	Maximum tokens allowed for the generated description before TTS synthesis. Limits cost and audio length. Defaults to 300.

Inputs

Port	Type	Description
`image`	string or binary buffer	The source image. Accepts a URL, base64-encoded data URI, or raw binary buffer. Must be a supported format (JPEG, PNG, WebP, GIF).
`promptOverride`	string	Optional runtime override for the description prompt, taking precedence over the static descriptionPrompt config field.

Outputs

Port	Type	Description
`audio`	audio object	Emitted on success. Includes base64-encoded audio (`data`), MIME type (`mimeType`, e.g. `audio/mpeg`), and the model-generated caption (`description`) before synthesis.
`error`	error object	Emitted when the model call or TTS synthesis fails. Includes `message`, `code`, and `stage` (`description` or `synthesis`) so downstream nodes can branch on the failing step.

Example

json

{
  "nodeType": "image_to_audio",
  "config": {
    "model": "gpt-4o",
    "ttsVoice": "nova",
    "descriptionPrompt": "Describe this image concisely for a visually impaired listener. Focus on subject, composition, and mood.",
    "outputFormat": "mp3",
    "maxDescriptionTokens": 250
  }
}

Image To Audio ​

Overview ​

Configuration ​

Inputs ​

Outputs ​

Example ​