Skip to content

Image To Audio

Overview

The Image To Audio node accepts an image and produces a spoken audio description by passing the image through a multimodal model that generates a caption or narrative, then synthesizing that text to speech. Use it in battle workflows where visual content must be accessible via audio, or as a pre-processing step before audio-based judging. A valid multimodal model credential and a TTS voice selection are required; if the model call or synthesis fails the node emits on the error port rather than halting the workflow. Output audio is returned as a base64-encoded buffer with a configurable MIME type.

Configuration

FieldTypeRequiredDescription
modelenumYesMultimodal model used to generate the image description (e.g. gpt-4o, gemini-1.5-pro, claude-3-5-sonnet).
ttsVoiceenumYesText-to-speech voice used to synthesize the generated description (e.g. alloy, nova, echo).
descriptionPromptstringNoCustom system prompt appended to the model request to steer the style or focus of the image description. Defaults to a neutral accessibility-style caption prompt.
outputFormatenumNoAudio encoding format for the synthesized output: mp3, opus, or wav. Defaults to mp3.
maxDescriptionTokensnumberNoMaximum tokens allowed for the generated description before TTS synthesis. Limits cost and audio length. Defaults to 300.

Inputs

PortTypeDescription
imagestring or binary bufferThe source image. Accepts a URL, base64-encoded data URI, or raw binary buffer. Must be a supported format (JPEG, PNG, WebP, GIF).
promptOverridestringOptional runtime override for the description prompt, taking precedence over the static descriptionPrompt config field.

Outputs

PortTypeDescription
audioaudio objectEmitted on success. Includes base64-encoded audio (data), MIME type (mimeType, e.g. audio/mpeg), and the model-generated caption (description) before synthesis.
errorerror objectEmitted when the model call or TTS synthesis fails. Includes message, code, and stage (description or synthesis) so downstream nodes can branch on the failing step.

Example

json
{
  "nodeType": "image_to_audio",
  "config": {
    "model": "gpt-4o",
    "ttsVoice": "nova",
    "descriptionPrompt": "Describe this image concisely for a visually impaired listener. Focus on subject, composition, and mood.",
    "outputFormat": "mp3",
    "maxDescriptionTokens": 250
  }
}