Get the text and language of an audio speech
Upload an audio and get language and transcription of it.
Use case example
- - Create automatically subtitles for videos
- - Keyword-based research on audio files
- - Searching for topics within a video
Try it
Examples
Input
{"input":{"audio":"https://rbqktisnztholqojxlaf.s...""model":"large-v3""transcription":"plain text""translate":false"language":"temperature":"patience":"suppress_tokens":"-1""initial_prompt":"condition_on_previous_text":true"temperature_increment_on_fallback":0.2"compression_ratio_threshold":2.4"logprob_threshold":-1"no_speech_threshold":0.6}"id":84}
Output
{"status":"succeeded""output":{"detected_language":"english""segments":[0:{"avg_logprob":-0.2859947681427002"compression_ratio":0.8833333333333333"end":5.72"id":0"no_speech_prob":0.04505243897438049"seek":0"start":0"temperature":0"text":" Artificial intelligence will ...""tokens":[0:{}1:{}2:{}3:{}4:{}5:{}6:{}7:{}8:{}9:{}10:{}]}]"transcription":" Artificial intelligence will ...""translation":}"prediction_time":3.310847}
API Information
Input description
Audio file
This version only supports Whisper-large-v3.
Default value: large-v3
An enumeration.
Default value: plain text
Enum values: plain text, srt, vtt
Translate the text to English when set to True
An enumeration.
Enum values: af, am, ar, as, az, ba, be, bg, bn, bo, br, bs, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fo, fr, gl, gu, ha, haw, he, hi, hr, ht, hu, hy, id, is, it, ja, jw, ka, kk, km, kn, ko, la, lb, ln, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, nn, no, oc, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, sn, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, uk, ur, uz, vi, yi, yo, yue, zh, Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Burmese, Cantonese, Castilian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, Flemish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Letzeburgesch, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Mandarin, Maori, Marathi, Moldavian, Moldovan, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Panjabi, Pashto, Persian, Polish, Portuguese, Punjabi, Pushto, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Sinhalese, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Valencian, Vietnamese, Welsh, Yiddish, Yoruba
temperature to use for sampling
optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search
comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations
Default value: -1
optional text to provide as a prompt for the first window.
if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop
Default value: true
temperature to increase when falling back when the decoding fails to meet either of the thresholds below
Default value: 0.2
if the gzip compression ratio is higher than this value, treat the decoding as failed
Default value: 2.4
if the average log probability is lower than this value, treat the decoding as failed
Default value: -1
if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence
Default value: 0.6
Output JSON Schema
This represents the JSON schema that details the structure of the model's output.