Audio & Transcription

Real-time audio transcription and audio file analysis

chef amond 1989 est un système cbr qui réalise des recettes de cuisine.

Introduction

The AlphaEdge Audio & Transcription API lets you transcribe audio files to text. This feature is optimized for high performance and accuracy.

This page guides you through using the Audio & Transcription API, from the basics to advanced use cases.

Base URL, host, and documentation

Use the gateway’s public base URL, e.g. https://api-endpoints.alphaedge-ai.com. Do not call the gateway by raw IP when a public hostname is required (otherwise 403). User documentation is at https://api-docs.alphaedge-ai.com/; the gateway does not expose interactive Swagger / OpenAPI online.

Slug and catalog

Public transcription only exposes the alpha-audio-v1 slug for this capability. GET /models returns model_slug and type (audio | ocr) for each catalog entry.

Exposed routes (transcription)

  • POST /models/{model_slug}/transcript — multipart ; réponse synchrone HTTP 200 (traitement dans le cadre de la requête).

To retrieve the model metadata (endpoints, expected file field, boolean options, accepted audio extensions, pricing), use the public endpoint GET /models/alpha-audio-v1.

Quick start

Here is a minimal example to get started with the Audio & Transcription API:

Basic example

python
import requests

url = "https://api-endpoints.alphaedge-ai.com/models/alpha-audio-v1/transcript"
headers = {"X-API-Key": "TA_CLE"}

with open("/chemin/audio.wav", "rb") as f:
    files = {"audio": ("audio.wav", f, "audio/wav")}
    data = {
        "enable_diarization": "true",
    }
    r = requests.post(url, headers=headers, files=files, data=data, timeout=300)

print(r.status_code)
print(r.json())
bash
curl https://api-endpoints.alphaedge-ai.com/models/alpha-audio-v1/transcript \
  -H "X-API-Key: TA_CLE" \
  -F "audio=@/chemin/audio.wav" \
  -F "enable_diarization=false"
javascript
import fs from "node:fs";

const form = new FormData();
form.append("audio", new Blob([fs.readFileSync("/chemin/audio.wav")]), "audio.wav");
form.append("enable_diarization", "true");

const res = await fetch("https://api-endpoints.alphaedge-ai.com/models/alpha-audio-v1/transcript", {
  method: "POST",
  headers: { "X-API-Key": "TA_CLE" },
  body: form
});

console.log(res.status, await res.json());

API parameters

Here are the available parameters for the Audio & Transcription API:

Le slug du modèle est uniquement dans l’URL (/models/{slug}/transcript). En multipart, le champ fichier doit s’appeler audio (pas file). Les booléens optionnels utilisent les noms enable_diarization et enable_postcorrect.

Valeurs booléennes acceptées (insensibles à la casse) : true/false, 1/0, yes/no, on/off. Toute autre valeur est rejetée par une 422 explicite, par exemple :

json
{
  "detail": "Champ booléen invalide pour `enable_diarization`: 'banane'. Valeurs acceptées : 1/0, true/false, yes/no, on/off."
}

Any other multipart field sent is also rejected with 422, including the list of accepted fields (audio, enable_diarization, enable_postcorrect).

Do not set Content-Type manually for multipart: let curl -F, requests (files=…), or fetch(FormData) establish multipart/form-data and the boundary.

PARAMETER TYPE REQUIRED DEFAULT DESCRIPTION
audio File Yes - Fichier audio à transcrire (multipart). Nom de champ exact : audio.
enable_diarization boolean No false Active la diarisation des locuteurs.
enable_postcorrect boolean No false Post-correction linguistique du texte ASR (ponctuation, majuscules, orthographe, suppression des répétitions), via un appel à un modèle open source externe hébergé chez Novita. Optimisée pour le français — voir la section Post-correction ci-dessous.

Post-correction (French)

La post-correction s’appuie sur un appel à un modèle open source externe, hébergé chez Novita : le texte brut de l’ASR y est traité pour ajouter la ponctuation et les majuscules, corriger les fautes d’orthographe évidentes, et supprimer les répétitions/bégaiements générés par la reconnaissance vocale. Elle ne reformule pas, ne traduit pas et ne modifie pas le sens.

When to use it

  • Transcripts intended for a human reader (meeting notes, subtitles, publishable verbatim).
  • Long ASR outputs containing hesitations and sentences without punctuation.
  • Text that will later be indexed or searched: punctuation and capitalization improve readability and sentence segmentation.

What post-correction does

  • Adds capitalization (sentence starts, proper nouns).
  • Adds missing punctuation.
  • Corrects obvious spelling mistakes.
  • Removes immediate repetitions, stutters, and typical ASR glitches.
  • Corrects proper nouns only when certainty is high.
  • Keeps diarization: any [SPEAKER_xx] markers present in the text are preserved.

What it does not do

  • Does not rephrase, does not add content, does not translate.
  • Does not change the meaning of the text.
  • Does not re-listen to the audio: it corrects only based on the ASR text.

Supported languages

Optimized for French — the same language as the alpha-audio-v1 transcription model. For other languages, do not set enable_postcorrect=true: the text may be altered or returned unchanged with no benefit.

Activation

Simply add enable_postcorrect=true to your multipart request, on the same routes as the standard transcription. No extra field is added to the response: the corrected text is returned directly in the text key of TranscriptResponse.

bash
curl https://api-endpoints.alphaedge-ai.com/models/alpha-audio-v1/transcript \
  -H "X-API-Key: TA_CLE" \
  -F "audio=@/chemin/audio.wav" \
  -F "enable_postcorrect=true"

Before / after example

Raw ASR text (enable_postcorrect absent or false):

text
aujourd'hui il y a 2 mondes qui nous entourent d'une part la ville étouffante et polluée et d'autre part la forêt qui est une vraie modèle écologique en effet les problèmes majeurs

Post-corrected text (enable_postcorrect=true):

text
Aujourd'hui, il y a deux mondes qui nous entourent. D'une part, la ville étouffante et polluée, et d'autre part la forêt qui est un vrai modèle écologique. En effet, les problèmes majeurs

Latency

Roughly 1 to 4 seconds per 1,200 characters of transcript. On audios of several minutes, post-correction runs in parallel and typically adds a few dozen seconds to the total wall-clock time. For a ~1-minute audio, the overhead is on the order of 1 to 3 seconds.

The returned inference_seconds field remains the ASR time (reported by the ASR server). Post-correction time is not exposed in the response.

Billing

During the re-enablement phase, post-correction is included at no extra charge in the audio price (€0.15/h). No billing change is required on the client side.

Combining with diarization

enable_diarization and enable_postcorrect are independent and can be combined. When diarization is active, its [SPEAKER_xx] markers are kept in the post-corrected text.

bash
curl https://api-endpoints.alphaedge-ai.com/models/alpha-audio-v1/transcript \
  -H "X-API-Key: TA_CLE" \
  -F "audio=@/chemin/audio.wav" \
  -F "enable_diarization=true" \
  -F "enable_postcorrect=true"

Behavior when unavailable

When enable_postcorrect=true is requested but post-correction is not available on the target instance (service unreachable, quota exhausted, server-side disablement), the API returns the raw ASR text with HTTP 200. No error is surfaced to the client: your integration does not need to handle any special case, the text will simply be uncorrected.

For deployments that require a “post-corrected or explicit failure” guarantee, a strict mode returning 502 on failure can be enabled in server configuration — contact support.

Best practices

  • For raw verbatim intended for programmatic processing, keep enable_postcorrect=false (default).
  • For a human-readable report, set enable_postcorrect=true and, if there are multiple speakers, enable_diarization=true.
  • Do not enable post-correction on transcripts in languages other than French.

Supported file formats

The AlphaEdge Audio & Transcription API supports a wide variety of audio formats for transcription. Here is the full list of supported formats:

Compressed audio formats

  • MP3 (.mp3) - Most common format, lossy compression
  • AAC (.aac, .m4a) - Apple format, good quality at low bitrate
  • OGG Vorbis (.ogg) - Open source format, efficient compression
  • OPUS (.opus) - Voice-optimized format, excellent for calls
  • WMA (.wma) - Windows Media Audio

Uncompressed audio formats

  • WAV (.wav) - Uncompressed PCM format, maximum quality
  • FLAC (.flac) - Lossless compression, high quality
  • AIFF (.aiff, .aif) - Uncompressed Apple format

Streaming audio formats

  • WebM Audio (.webm) - Modern web format
  • M4A (.m4a) - Apple container format

Technical specifications

  • Sampling rate: 8 kHz to 48 kHz (recommended: 16 kHz or 44.1 kHz)
  • Bit depth: 16 bit or 24 bit
  • Channels: Mono, stereo, or multi-channel (auto-converted to mono)
  • Size and duration: pas de limite stricte imposée par la passerelle. Sur les fichiers volumineux ou les audios longs, prévoyez un timeout HTTP côté client suffisamment élevé (upload + transcription).

Video formats (audio extraction)

The API can also extract and transcribe audio from video files:

  • MP4 (.mp4) - Video with audio track
  • AVI (.avi) - Video container format
  • MOV (.mov) - QuickTime format
  • MKV (.mkv) - Open source container format
  • WebM (.webm) - Web video format

Recommendations

  • For voice: MP3 at 128 kbps or WAV 16 kHz mono offer a good quality/size trade-off
  • For music with vocals: WAV or FLAC to preserve quality
  • For phone calls: OPUS or MP3 at 64 kbps mono
  • Avoid very low quality audio files (< 16 kHz) for best results
  • Sur les audios longs, augmentez le délai d’attente côté client pour couvrir l’upload et la transcription.

Response format

Synchronous HTTP 200 response — TranscriptResponse schema (useful client fields):

json
{
  "model_slug": "alpha-audio-v1",
  "text": "…",
  "inference_seconds": 1.2,
  "enable_diarization": false,
  "audio_duration_seconds": 45.3,
  "audio_filename": "audio.wav"
}

Champs de la réponse :

  • model_slug (string) — slug du modèle utilisé.
  • text (string) — texte transcrit (post-corrigé si enable_postcorrect=true et service disponible).
  • inference_seconds (number) — ASR time reported by the upstream service.
  • enable_diarization (boolean) — echo of the value actually applied.
  • audio_duration_seconds (number) — duration of the analyzed audio.
  • audio_filename (string) — filename received in the multipart.

The internal gateway_wall_ms field is not returned in the client JSON.

Advanced examples

Error handling

Here is how to handle errors properly:

python
import requests

url = "https://api-endpoints.alphaedge-ai.com/models/alpha-audio-v1/transcript"
headers = {"X-API-Key": "TA_CLE"}

with open("/chemin/audio.wav", "rb") as f:
    files = {"audio": ("audio.wav", f, "audio/wav")}
    data = {
        "enable_diarization": "true",
    }
    r = requests.post(url, headers=headers, files=files, data=data, timeout=300)

print(r.status_code)
print(r.json())
javascript
import fs from "node:fs";

const form = new FormData();
form.append("audio", new Blob([fs.readFileSync("/chemin/audio.wav")]), "audio.wav");
form.append("enable_diarization", "true");

const res = await fetch("https://api-endpoints.alphaedge-ai.com/models/alpha-audio-v1/transcript", {
  method: "POST",
  headers: { "X-API-Key": "TA_CLE" },
  body: form
});

console.log(res.status, await res.json());

Use cases

Here are some common use cases for the Audio & Transcription API:

1. Meeting transcription

Automatically transcribe meetings for archiving and search.

2. Video subtitling

Generate automatic subtitles for your video content.

3. Podcast transcription

Create transcriptions to improve accessibility and SEO.

Limitations and best practices

Limitations

  • Supported formats : MP3, WAV, M4A, FLAC, AAC, OGG, OPUS, WMA, AIFF, WebM, et formats vidéo (MP4, AVI, MOV, MKV, WebM). La liste officielle est exposée par GET /models/alpha-audio-v1 (field accepted_extensions).
  • Taille et durée : pas de limite stricte appliquée par la passerelle. Sur les audios volumineux ou longs, prévoyez un timeout HTTP côté client suffisamment élevé.
  • Rate limiting : no application rate-limit is currently enforced by the gateway (no X-RateLimit-* headers). Still, space your requests on intensive usage to preserve upstream stability.

Best practices

  • Use good quality audio files (minimum 16 kHz) for best results
  • Sur les audios longs, augmentez le délai d’attente côté client pour couvrir l’upload et la transcription.
  • Handle errors properly with try/except blocks
  • Implement exponential retry; on 429/503/504, honor the Retry-After header returned by the gateway
  • Cache results when possible to reduce costs
  • Keep the X-Request-ID header returned by the API: it eases support in case of incident

Available models

To view all available audio & transcription models with their detailed specifications, visit the Our models and filter by type.

Useful HTTP status codes

Short reference for integration (typical codes returned by the gateway):

HTTP status Typical case
401Key missing or invalid (X-API-Key or Authorization: Bearer).
403Forbidden host (access by IP or wrong domain); or GET /health via the public domain.
404Modèle inconnu (« Modèle introuvable. »).
422Multipart non conforme (champ audio manquant, valeur booléenne non reconnue, clés non autorisées, etc.). Message en français explicite.
500Internal error — the response body includes a request_id field for support.
502Erreur du service de transcription en aval. Header Retry-After: 5 conseillé.
503Service indisponible ou démarrage en cours. Header Retry-After: 10 ajouté automatiquement.
504Gateway-side timeout (upstream key verification). Retry-After: 5 header.

For a detailed list of error codes, see also Error codes.