Get the text and language of an audio speech

Upload an audio and get language and transcription of it.

Use case example

  • - Create automatically subtitles for videos
  • - Keyword-based research on audio files
  • - Searching for topics within a video

Added by: APIForAI

Price per call≈ 0.01

License Paper Github

Try it nowAPI information

Try it

Examples

Input

{
"input":{
"audio":"https://rbqktisnztholqojxlaf.s..."
"model":"large-v3"
"transcription":"plain text"
"translate":false
"language":
"temperature":
"patience":
"suppress_tokens":"-1"
"initial_prompt":
"condition_on_previous_text":true
"temperature_increment_on_fallback":0.2
"compression_ratio_threshold":2.4
"logprob_threshold":-1
"no_speech_threshold":0.6
}
"id":84
}

Output

{
"status":"succeeded"
"output":{
"detected_language":"english"
"segments":[
0:{
"avg_logprob":-0.2859947681427002
"compression_ratio":0.8833333333333333
"end":5.72
"id":0
"no_speech_prob":0.04505243897438049
"seek":0
"start":0
"temperature":0
"text":" Artificial intelligence will ..."
"tokens":[
0:{}
1:{}
2:{}
3:{}
4:{}
5:{}
6:{}
7:{}
8:{}
9:{}
10:{}
]
}
]
"transcription":" Artificial intelligence will ..."
"translation":
}
"prediction_time":3.310847
}

API Information

Input description

audio* uri

Audio file

model string

This version only supports Whisper-large-v3.

Default value: large-v3

transcription string

An enumeration.

Default value: plain text

Enum values: plain text, srt, vtt

translate boolean

Translate the text to English when set to True

language string

An enumeration.

Enum values: af, am, ar, as, az, ba, be, bg, bn, bo, br, bs, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fo, fr, gl, gu, ha, haw, he, hi, hr, ht, hu, hy, id, is, it, ja, jw, ka, kk, km, kn, ko, la, lb, ln, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, nn, no, oc, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, sn, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, uk, ur, uz, vi, yi, yo, yue, zh, Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Burmese, Cantonese, Castilian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, Flemish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Letzeburgesch, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Mandarin, Maori, Marathi, Moldavian, Moldovan, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Panjabi, Pashto, Persian, Polish, Portuguese, Punjabi, Pushto, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Sinhalese, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Valencian, Vietnamese, Welsh, Yiddish, Yoruba

temperature number

temperature to use for sampling

patience number

optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search

suppress_tokens string

comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations

Default value: -1

initial_prompt string

optional text to provide as a prompt for the first window.

condition_on_previous_text boolean

if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop

Default value: true

temperature_increment_on_fallback number

temperature to increase when falling back when the decoding fails to meet either of the thresholds below

Default value: 0.2

compression_ratio_threshold number

if the gzip compression ratio is higher than this value, treat the decoding as failed

Default value: 2.4

logprob_threshold number

if the average log probability is lower than this value, treat the decoding as failed

Default value: -1

no_speech_threshold number

if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence

Default value: 0.6

Output JSON Schema

This represents the JSON schema that details the structure of the model's output.

{
"type":"object"
"title":"Output"
"required":[
0:"detected_language"
1:"transcription"
]
"properties":{
"segments":{
"title":"Segments"
}
"srt_file":{
"type":"string"
"title":"Srt File"
"format":"uri"
}
"txt_file":{
"type":"string"
"title":"Txt File"
"format":"uri"
}
"translation":{
"type":"string"
"title":"Translation"
}
"transcription":{
"type":"string"
"title":"Transcription"
}
"detected_language":{
"type":"string"
"title":"Detected Language"
}
}
}

APIForAI

Build fast by accessing AI models easily