get_job_result
Check job status and result. Poll every 60 seconds — do NOT poll more frequently.
Video processing typically takes 3-5 minutes. Progress may stay at 20% during frame
analysis for 1-3 minutes — this is completely normal. Do NOT interpret slow progress
as failure. Only report failure when status is "failed" with an error message.
Returns: status (processing | transcript_ready | completed | failed | poll_error).
poll_error means a temporary connection issue — the job is still running, just retry.
transcript_ready includes transcript. completed includes video_url.
get_upload_url
GET A SIGNED UPLOAD URL for uploading a local video to NarrateAI cloud storage.
Use this ONLY when running in HTTP/remote mode and the user has a local video file.
After getting the URL, upload the file with curl, then pass the returned temp_file_path
to any processing tool as the video_source.
For stdio/local mode this is NOT needed — tools can read local files directly.
Returns:
JSON with upload_url, temp_file_path, and a ready-to-use curl command.
generate_narration_script
NARRATION SCRIPT – generates an AI-written timed script for a SILENT video. No audio output.
Use when the user wants a timed narration script, text-only narration, or sync data for a silent video.
This does NOT extract existing speech (use transcribe_video for that).
This does NOT produce a video file (use narrate_video_full for that).
Runs as a background task with progress reporting. Processing takes 1-5 minutes.
Returns:
JSON with transcript, job_id, db_job_id.
narrate_video_full
FULL NARRATED VIDEO – produces a downloadable video with AI voiceover.
Use when the user wants: "narrate this video", "add voiceover", "make a narrated video".
VOICE OPTIONS — ask the user which they prefer:
1. AI voice: male1 (default, fastest), female1 (default, fastest), female2, female3, female4, male2, male3
2. Voice cloning: user provides an audio file (voice_sample) and their voice is cloned for the narration
If voice_sample is provided, it takes priority over voice_type.
Runs as a background task with progress reporting. Processing takes 2-5 minutes.
Returns:
JSON with video_url, transcript, job_id when done.
abandon_job
Abandon/cancel a processing job. Call this when the user cancels on the agent side.
Stops the backend from continuing audio generation and video assembly.
Use after narrate_video_transcript or when continue_to_full_video was started but user cancelled.
Returns:
JSON with success or error.
transcribe_video
TRANSCRIPTION ONLY – video with existing voice -> speech-to-text -> timed transcript.
No translation, no narrated video. Returns original speech as-is.
Use when the user wants to transcribe a video that already has spoken audio
(podcast, interview, meeting recording, etc.).
CRITICAL: source_language is REQUIRED. If the user does not specify the language
of the video, you MUST ask them concisely before calling. Supported: en, zh, yue,
fr, de, it, ja, ko, pt, ru, es (Qwen) + others via Whisper.
Runs as a background task with progress reporting. Processing takes 1-5 minutes.
Returns:
JSON with transcript, job_id, db_job_id when done.
transcribe_and_translate
TRANSCRIBE & TRANSLATE (new upload) – video with voice -> speech-to-text -> translate -> translated transcript.
No TTS, no video output. Returns translated timed transcript only.
Use when the user uploads a new video and wants a translated transcript
(e.g. Spanish podcast -> English transcript).
CRITICAL: source_language and target_language are REQUIRED. Ask user if not specified.
Runs as a background task with progress reporting. Processing takes 1-5 minutes.
translate_existing_video
TRANSLATION (existing video) – Translate transcript of a video already in the user's library.
Loads transcript from cloud, translates, returns. No upload. Sync – returns immediately.
Use when the user wants to translate a video they already narrated/dubbed with NarrateAI
(e.g. "Translate my video X to French"). job_id is the completed video's job ID.
CRITICAL: source_language and target_language are REQUIRED. For narrated videos, source is
typically the narration language (e.g. en). For dubbed videos, source is the dubbed language.
dub_video_full
FULL AUTO-DUBBING – transcribe -> translate -> extract speaker voice -> TTS with cloned voice -> dubbed video.
No refinement screen. Uses the video's own speaker voice for the dubbed audio.
Use when the user wants a complete dubbed video (e.g. Spanish video -> English dubbed).
CRITICAL: source_language, target_language, and preserve_background_music are REQUIRED.
Agent MUST ask user for all three if not specified. For preserve_background_music: ask if the video
has background music they want to keep (true) or replace with silence (false).
Runs as a background task with progress reporting. Processing takes 2-5 minutes.
generate_document
DOCUMENT GENERATION – produces a structured markdown document from a silent video.
Use when the user wants: a document, article, guide, tutorial, or written content
based on a video. NOT for narrated video or voiceover.
The agent MUST ask which document type the user wants before calling:
- user_onboarding: Step-by-step onboarding guide
- tutorial_guide: Tutorial/how-to guide
- feature_showcase: Feature showcase document
- business_overview: Business overview document
- product_documentation: Product documentation
Also returns a synced transcript as a bonus – offer it to the user after the document
is delivered ("I also have a synced transcript for this video, would you like it?").
Runs as a background task with progress reporting. Processing takes 1-5 minutes.
Returns:
JSON with document_markdown, document_data, transcript, job_id, db_job_id.
generate_tts
TEXT-TO-SPEECH – generate audio from text. Returns a downloadable audio URL.
Use when the user wants: "read this aloud", "generate speech", "text to speech",
"convert text to audio", "make an audio file from this text".
VOICE OPTIONS — ask the user which they prefer:
1. AI voice: male1 (default, fastest), female1 (default, fastest), female2, female3, female4, male2, male3
2. Voice cloning: user provides an audio file (voice_sample) and their voice is cloned
If voice_sample is provided, it takes priority over voice_type.
Returns:
JSON with audio_url, text, voice, language.
narrate_batch
BATCH NARRATION – narrate multiple videos in parallel. Each gets a full narrated video with voiceover.
Use when the user has multiple videos to narrate (e.g. "narrate these 3 videos").
Maximum 5 videos per batch. Each video is processed independently – one failure does not affect others.
If the user does not specify a voice, ask them ONCE (applies to all videos).
Voice options: male1 (default, fastest), female1 (default, fastest), female2, female3, female4, male2, male3.
CRITICAL – Context handling: Before calling, you MUST ask the user about context:
1. "Would you like to provide the same context for all videos, different context per video, or no context?"
2. If same for all: use manual_context. 3. If different: use contexts_json. 4. If none: leave both empty.
Runs as a background task. Processing takes 2-10 minutes depending on video count and length.
Returns:
JSON array of results with video_url per video.
batch_generate_scripts
BATCH SCRIPT GENERATION – generate AI narration scripts for multiple silent videos in parallel.
Each video gets a timed narration script (text only, no audio).
Maximum 5 videos per batch. One failure does not affect others.
CRITICAL – Context handling: Before calling, ask the user about context:
1. Same for all -> manual_context. 2. Different per video -> contexts_json. 3. No context -> leave both empty.
Runs as a background task. Processing takes 2-10 minutes.
batch_transcribe
BATCH TRANSCRIPTION – transcribe speech from multiple videos in parallel.
Each video must have existing spoken audio. Returns timed transcript per video.
CRITICAL: source_language is REQUIRED – ask user if not specified. Applies to all videos.
Maximum 5 videos per batch. One failure does not affect others.
Runs as a background task. Processing takes 2-10 minutes.
batch_dub
BATCH DUBBING – dub multiple videos into another language in parallel.
Each video gets full auto-dubbing (transcribe -> translate -> voice clone -> dubbed video).
CRITICAL: source_language, target_language, preserve_background_music are REQUIRED – ask user.
All videos share the same languages and music setting.
Maximum 5 videos per batch. One failure does not affect others.
Runs as a background task. Processing takes 2-10 minutes.
update_transcript
UPDATE TRANSCRIPT – edit the narration script before continuing to full video.
Use after generate_narration_script returns a transcript and the user wants to change
wording, timing, or content of specific segments. The user describes changes naturally;
you apply them and call this tool with the updated segments.
Also used in the translate-then-re-narrate flow: after translate_existing_video returns
a translated transcript, call this with the translated segments, reset_for_reprocessing=True,
and target_language set to the translation language (e.g. "hi", "fr") to prepare the
completed job for re-narration via continue_to_full_video.
The transcript_json must include ALL segments (not just changed ones) — it replaces the
full transcript. Each segment needs: start_time, end_time, text. Optionally: pause_duration, chunk_type.
After updating, the user can call continue_to_full_video with the same job_id.
Returns:
JSON with success status or error.
list_videos
LIST VIDEOS – get the user's video library (previously processed videos).
Use when the user wants to see their existing videos, re-translate a previously narrated video,
or work with videos they already processed. Returns paginated list with job IDs, filenames,
status, and timestamps.
The returned job IDs can be used with translate_existing_video to translate a completed video's
transcript, or with get_job_result to check status.
Returns:
JSON with jobs array (id, filename, status, language, created_at, updated_at),
total count, page, and per_page.
continue_to_full_video
Continue from transcript to full narrated video. Use after generate_narration_script
returns a transcript and the user is satisfied with it.
VOICE OPTIONS — ask the user which they prefer:
1. AI voice: male1 (default, fastest), female1 (default, fastest), female2, female3, female4, male2, male3
2. Voice cloning: user provides an audio file (voice_sample) and their voice is cloned for the narration
If voice_sample is provided, it takes priority over voice_type.
Runs as a background task with progress reporting. Processing takes 1-3 minutes.
Returns:
JSON with video_url, job_id when done.