Aditu live API

The live ASR (Automatic Speech Recognition) or transcription can be done in two ways:

For long continuous speech: using secure websocket connections
For short speech sections: using HTTP POST calls

The address for both is: live.aditu.eus

Get number of available transcribers (via secure websocket connections)

Number of available transcribers for a language. A JSON message is received at the beginning and then each time the number of available transcribers changes.

Only start live transcription if there are transcribers available, that is, if the number of available transcribers is not 0. Otherwise, the transcription will return an error.

Websocket connection example


wss://live.aditu.eus/[LANG]/client/ws/status

[LANG] can be either "eu", "es" or "eu-es"

Received messages example



Monolingual:

{
	"num_workers_available": 1
}
...
{
	"num_workers_available": 0
}
...
{
	"num_workers_available": 1
}

Bilingual:

{
	"num_workers_available": "1-1"
}
...
{
	"num_workers_available": "1-0"
}
...
{
	"num_workers_available": "0-0"
}
...
{
	"num_workers_available": "0-1"
}

Live transcription via secure websocket connections


wss://live.aditu.eus/[LANG]/client/ws/speech

[LANG] can be either "eu", "es" or "eu-es"

Starts a live transcription, where a raw audio signal is sent through the websocket and the recognised words and sentences are received.

This is the preferred use mode for continous transcribing of long speech sections. An example is live transcription from microphone input.

Websocket connection

The server assumes by default that incoming audio is sent using 16 kHz, mono, 16 bit little-endian format. This can be overriden using the 'content-type' request parameter in the websocket URL.

By default, the server does not perform any unexpansion of numbers, acronyms, abbreviations, etc. This can be overriden by sending an 'unexpand' parameter with the content 'true', in which case numbers, accronyms, abbreviations, etc. will come in their short form.

By default, the server does not punctuate or apply case to the recognized words. This can be overriden by sending a 'punctuationcase' parameter with the content 'true'.

After finishing the live transcription, the audio sent is saved in an audio file and recorded in the user's file logs together with the produced transcription, subtitles, etc. This can be overriden by sending a 'save' parameter with the content 'false', in which case no files will be saved and thus no space will be used (only a log of the time used in the transcription).

Parameters


content-type (optional) in GStreamer 1.0 caps format; default=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1
unexpand (optional); default=false
punctuationcase (optional); default=false
save (optional); default=true

Websocket connection examples


wss://live.aditu.eus/eu/client/ws/speech?content-type=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1
wss://live.aditu.eus/eu/client/ws/speech?save=false
wss://live.aditu.eus/eu/client/ws/speech?content-type=audio/x-raw,+layout=(string)interleaved,+rate=(int)16000,+format=(string)S16LE,+channels=(int)1&save=false

Authentication

Before starting to send the audio, a text message with the API id and the API key must be sent for authentication and then wait for an authorization confirmation message in JSON format. If there has been a problem with authentication or the account has no time credit left, the JSON message will say so and the websocket must be closed.

Authentication message example


api_id=<YOUR_API_ID> api_key=<YOUR_API_ID>

Authorization confirmation message examples


{
	"status": 0,
	"message": "Authentication OK"
}


{
	"status": 1,
	"message": "No workers available"
}


{
	"status": 2,
	"message": "No time credit"
}


{
	"status": 3,
	"message": "No space left"
}


{
	"status": 4,
	"message": "Authentication error: no user with those credentials"
}


{
	"status": 5,
	"message": "Authentication error: no database"
}


{
	"status": 6,
	"message": "Authentication error: credentials incorrect"
}

Sending audio

After the authorization is received, the sending of the audio can begin.

Audio should be sent to the server in raw blocks of data, using the encoding specified when session was opened. It is recommended that a new block is sent at least 4 times per second (less frequent blocks would increase the recognition lag). Blocks do not have to be of equal size.

After the last block of audio data, a text message containing the 3-byte ANSI-encoded string "EOS" ("end-of-stream") needs to be sent to the server. This tells the server that no more speech is coming and the recognition can be finalized.

After sending "EOS", the client has to keep the websocket open to receive the final recognition results from the server. Server closes the connection itself when all recognition results have been sent to the client. No more audio can be sent via the same websocket after an "EOS" has been sent. In order to process a new audio stream, a new websocket connection has to be created by the client.

Audio block examples


Öþ×ý¸þíþøþ:ÿ ÿ-ÿýÿFÿBÿùõ×ö"úr`ò@~?Å$Ö³!ÀõY3n...
á!â,¼"´*'ðS¸ïuñU@~?Å$Ö³!ÀõY3n´ÿºÿäÿøÿÞÿ·ÿÎÿðÿÍÿ§ÿÖÿ...
...
EOS

Reading results

Server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:

status: response status (integer), see codes below
message: (optional) status message
result: (optional) recognition result, containing the following fields:

hypotheses: recognized words, a list with each item containing the following:

transcript: recognized words
confidence: (optional) confidence of the hypothesis (float, 0 through 1)
Likelihood: (optional) likelihood of the hypothesis (float)
language: (optional) language of the hypothesis (for bilingual)

final: true when the hypothesis is final, i.e., doesn't change any more

The following status codes are currently in use:

0: Success. Usually used when recognition results are sent.
2: Aborted. Recognition was aborted for some reason.
1: No speech. Sent when the incoming audio contains a large portion of silence or non-speech.
9: Not available. Used when all recognizer processes are currently in use and recognition cannot be performed.

Websocket is always closed by the server after sending a non-zero status update.

Server transcribes incoming audio on the fly. For each sentence or audio segment between silences, many non-final hypotheses are sent, followed by one final hypothesis. Non-final hypotheses are used to present partial recognition hypotheses to the client. A sequence of non-final hypotheses is always followed by a final hypothesis for that segment. The final hypothesis overrides the non-final hypotheses for the sentence or segment. Client is reponsible for presenting the results to the user in a way suitable for the application.

Likewise, in bilingual live transcription hypotheses for both of the languages are sent. Client is reponsible for presenting the results to the user in a way suitable for the application.

After sending a final hypothesis for a segment, server starts decoding the next segment or closes the connection if all audio sent by the client has been processed.

Recognition results examples


{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ ERE."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ ERE PROBATZEN."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"result":
	{
		"hypotheses": [{"transcript": "BERRIZ ERE PROBATZEN ARI."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 0,
	"segment-start": 1.21,
	"segment-length": 3.29,
	"total-length": 2.1,
	"result":
	{
		"hypotheses": [{"likelihood": 1329.08, "confidence": 0.9999984230216836, "transcript": "BERRIZ ERE PROBATZEN ARI NAIZ."}],
		"final": true
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 1,
	"result":
	{
		"hypotheses": [{"transcript": "BIGARREN."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 1,
	"result":
	{
		"hypotheses": [{"transcript": "BIGARREN ESALDIA."}],
		"final": false
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}
...
{
	"status": 0,
	"segment": 1,
	"segment-start": 6.63,
	"segment-length": 8.11,
	"total-length": 1.5,
	"result":
	{
		"hypotheses": [{"likelihood": 713.558, "confidence": 0.9995114164171041, "transcript": "BIGARREN ESALDIA DA HAU."}],
		"final": true
	},
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}

Live transcription via HTTP POST


https://live.aditu.eus/[LANG]/client/http/recognize

[LANG] can be either "eu", "es" or "eu-es"

Immediately returns the transcription of a short section of speech.

This is the preferred use mode for transcription of pre-segmented sentences. An example is the transcription of a sentence for a smart speaker.

Headers

The server assumes by default that incoming audio is sent using 16 kHz, mono, 16 bit little-endian format. This can be overriden using the 'Content-Type' header.

If there is a 'Transfer-Encoding' header set to 'chunked', it indicates that the body is in chunked mode (see what this means below).

The 'unexpand' header indicates whether numbers, accronyms, abbreviations, etc. must come in their short form (defaults to false).

The 'punctuationcase' header indicates whether punctuation signs and case must be assigned (defaults to false).

Authentication credentials can be sent in the headers via the 'api_id' and 'api_key' headers, or in the body (see below).

The 'save' header indicates whether the audio and the transcription are to be saved in the user's file list (defaults to false).

Example


Content-Type (optional) in MIME format; default=audio/x-raw-int; rate=16000
Transfer-Encoding (optional)
unexpand (optional); default=false
punctuationcase (optional); default=false
api_id (optional)
api_key(optional)
save (optional); default=false

POST body

If there is a 'Transfer-Encoding' header set to 'chunked', then the body with the audio can be sent in chunks for the transcriber to start work without waiting for the whole audio.

Each chunk of the POST body is composed of:

a line with a binary message with the length of the chunk in hex format
a line with a binary message with the chunk

To end the transmission, a final chunk of 0 length must be sent, that is, a binary message with the text '0' plus two line jumps.

Without a 'Transfer-Encoding' header set to 'chunked', then the audio is sent in the body.

Authentication

The authentication can be done in the header (see above), or else at the beginning of the body as explained here.

In chunked mode, a chunk with the API id and the API key must be sent before starting to send the audio.

Authentication chunk example (remember, it is preceded by a line with the length of the chunk and followed by a line jump, and it is binary)


api_id=<YOUR_API_ID> api_key=<YOUR_API_ID>

In non-chunked mode, the API id and the API key followed by a newline must be sent before starting to send the audio.

Authentication line example


api_id=<YOUR_API_ID> api_key=<YOUR_API_ID>\r\n

Sending audio

In chunked mode, audio in raw format is sent in chunks as defined above.

Audio chunk examples (remember, it is preceded by a line with the length of the chunk and followef by a line jump, and it is binary)


Öþ×ý¸þíþøþ:ÿ ÿ-ÿýÿFÿBÿùõ×ö"úr`ò@~?Å$Ö³!ÀõY3n...

In non-chunked mode, the whole audio in raw format is sent in the body.

Audio body example


Öþ×ý¸þíþøþ:ÿ ÿ-ÿýÿFÿBÿùõ×ö"úr`ò@~?Å$Ö³!ÀõY3n...

Response

Server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:

status: response status (integer), see codes below
message: (optional) status message (only in case of error)
hypotheses: (optional) recognized words (only in case of success), a list with each item containing the following:

transcript: recognized words
confidence: confidence of the hypothesis (float, 0 through 1)
likelihood: likelihood of the hypothesis (float)
language: (optional) language of the hypothesis (for bilingual)

The following status codes are currently in use:

0: Success. Usually used when recognition results are sent.
2: Aborted. Recognition was aborted for some reason.
1: No speech. Sent when the incoming audio contains a large portion of silence or non-speech.
9: Not available. Used when all recognizer processes are currently in use and recognition cannot be performed.

In bilingual live transcription hypotheses for both of the languages are sent. Client is reponsible for presenting the results to the user in a way suitable for the application.

Recognition results examples



Monolingual:

{
	"status": 0,
	"result": {"hypotheses": [{"likelihood": 1329.08, "confidence": 0.9999984230216836, "transcript": "BERRIZ ERE PROBATZEN ARI NAIZ."}],
	"id": "9210390c-0b4b-42e6-aad7-a75567d7629f"
}

Bilingual:

{
	"status": 0,
	"result": {"hypotheses": [{"transcript": "BERRIZ ERE PROBACH EN HARINA.", "confidence": 0.9941423102866637, "likelihood": 1038.07, "language": "es"}], "final": true},
	"id": "9a165b65-b1db-4e5d-9edf-ebd529a22c91"
}
...
{
	"status": 0,
	"result": {"hypotheses": [{"transcript": "BERRIZ ERE PROBATZEN ARI NAIZ.", "confidence": 0.9999984230216836, "likelihood": 1265.48, "language": "eu"}], "final": true},
	"id": "9a165b65-b1db-4e5d-9edf-ebd529a22c91"
}

Client software examples

Websockets

Javascript

Demo of a web page that captures audio from the microphone and calls the websockets live transcription API in Javascript: https://live.aditu.eus/js-dictate-example/demo.html. The Javascript code can be downloaded from there using the browser's developer tools.

It is based on dictate.js: https://kaljurand.github.io/dictate.js, https://github.com/Kaljurand/dictate.js.

Python

Python client that calls the websockets live API with the contents of a raw audio file: https://live.aditu.eus/client.py.

It is based on client.py from kaldi-gstreamer-server: https://github.com/alumae/kaldi-gstreamer-server/blob/master/kaldigstserver/client.py.

React Native

React Native code for calling the websocket speech recognition service from a mobile app: https://live.aditu.eus/reactnative_ws.zip

HTTP POST

Python

Python clients that call the HTTP POST live API with the contents of a raw audio file:

Non-chunked, authenticaton in body: https://live.aditu.eus/client_http.py
Non-chunked, authenticaton in header: https://live.aditu.eus/client_http_authinheader.py
Chunked, authenticaton in body: https://live.aditu.eus/client_http_chunked.py
Chunked, authenticaton in header: https://live.aditu.eus/client_http_chunked_authinheader.py

React Native

React Native code for calling the HTTP speech recognition service from a mobile app: https://live.aditu.eus/reactnative_http.zip