Quick start
- Login to obtain authorization token called access_token
- Use access_token to obtain task authorization token called ntx_token
- Run recognition task using ntx_token
Tokens
For authentication and authorization JWT is used.
Access token
Encoded information about authenticated user.
curl \
-H "Content-Type: application/json" \
-X POST \
-d '{"username":"myid","password":"mysecret"}' \
$AUDIENCE/login/access-token
Ntx token
Access token extended with permission to run [selected task]. Expiration is inherited from the access token.
Selected task is identified by id
and label
.
curl \
-H "Content-Type: application/json" \
-H "ntx-token: $ACCESS_TOKEN" \
-X POST \
-d '{"id": "ntx.v2t.engine.EngineService/pl/t-broadcast/v2t", "label": "vad+v2t+ppc" }' \
https://$DOMAIN/store/ntx-token
Using ntx-token
http
Set ntx-token: $NTX_TOKEN
header
curl -H "ntx-token: $NTX_TOKEN" https://$DOMAIN/service
websocket
Set ntx-token=$NTX_TOKEN
query param
var ws = new WebSocket("wss://$DOMAIN/service?ntx-token=$NTX_TOKEN");
grpc
Set ntx-token=$NTX_TOKEN
using GRPC metadata
var meta = new grpc.Metadata();
meta.add("ntx-token", "$NTX_TOKEN");
client.send(data, meta, callback);
Caching recommendations
- should be cached for at least some fraction of valid time (e.g. 3/4)
- should attempt at least one renewal when request using this token returns
401
or403
status code - should be renewed before expiration (e.g. when less than 1/4 of the token valid time is remaining) - to gracefully handle temporal API unavailability
ntx token
- renew interval can be set shorted interval when fast automatic are configured
Token valid duration may change with every API call - always perform calculation based on actual returned values
Expiration
Each token contains expiration information as unix timestamp (number of seconds since epoch). Expiration time is returned in the response and is also present in decoded tokens exp attribute.
{
"iss": "test-2e240644-8179-44be-9339-7de137131b65",
"iat": 1538842805,
"exp": 1538846000,
"aud": [
"https://myaudience.com"
],
"sub": "admin"
}
{
"accessToken": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ0ZXN0LTJlMjQwNjQ0LTgxNzktNDRiZS05MzM5LTdkZTEzNzEzMWI2NSIsImlhdCI6MTUzODg0MjgwNSwiZXhwIjoxNTM4ODQ2MDAwLCJhdWQiOlsiaHR0cHM6Ly9teWF1ZGllbmNlLmNvbSJdLCJzdWIiOiJhZG1pbiIsInBlcm1pc3Npb25zIjpbInRhc2s6cnVuOioiXSwiZW1haWwiOiJhZG1pbkBtYWlsLmNvbSJ9.uWU4YaQyGoP99VeZ1pmPJU7Q3_YAsfRkWKvdQCut0Qo",
"expiresAt": 1508577982
}
{
"ntxToken": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ0ZXN0LTJlMjQwNjQ0LTgxNzktNDRiZS05MzM5LTdkZTEzNzEzMWI2NSIsImlhdCI6MTUzODg0MjgwNSwiZXhwIjoxNTM4ODQ2MDAwLCJhdWQiOlsiaHR0cHM6Ly9teWF1ZGllbmNlLmNvbSJdLCJzdWIiOiJhZG1pbiIsInBlcm1pc3Npb25zIjpbInRhc2s6cnVuOioiXSwiZW1haWwiOiJhZG1pbkBtYWlsLmNvbSIsInRhc2siOnsiaWQiOiJudHgudjJ0LmVuZ2luZS5FbmdpbmVTZXJ2aWNlL2N6L2QtZ2VuZXJhbC92MnQiLCJjZmciOnsiaW1hZ2UiOiJuZ2lueCIsImNwdXMiOjEuNywibWVtIjozMDcyfSwibGFiZWwiOiJ2MnQiLCJyb2xlIjoidW5saW1pdGVkIn19.JgYkfjDO96wYEvcf22_-Y-wjunt31eYOr0VRMubkLiU",
"expiresAt": 1508577982
}
Selecting task
Every task is identified by unique id and label pair.

Task id and label can be changed only when upgrading to new major version.
Overview
V2T API is implemented as bidirectional flow of EngineStream messages defined as Protobuf 3 messages defined in engine.proto. It is started by sending and receiving start message, followed by input/output data transfer (with optional flow control and terminated by sending and receiving end message.
EngineStream
message EngineStream {
oneof payload {
EngineContextStart start = 1;
EventsPush push = 2;
EventsPull pull = 3;
EngineContextEnd end = 4;
}
}
Start
message EngineContextStart {
EngineContext context = 1;
}
Send as the first message from the client and also received as the first message the from server (except in case of error). When sent from the client it must have context set). When received from the server it may not have context set.
Push
message EventsPush {
Events events = 1;
}
Used for sending and receiving data. See [Events] for detailed information.
Pull
message EventsPull {
}
Send (only when flow control enabled) when the server is requesting new data (when received by the client) or client is prepared to receive results (when sent by the client).
End
message EngineContextEnd {
string error = 1;
}
Send as the last message from the client and also received as the last message the from server. Operation completed successfully when error message is blank. Error attribute can be set by the client to signal what type of error it encountered (graceful shutdown).
Context configuration
message EngineContext {
// voice activity detection
message VADConfig {}
// punctuation config
message PNCConfig {}
// post-processing config
message PPCConfig {}
// voice to text
message V2TConfig {
// set to enable voice activity detection
VADConfig withVAD = 1;
// set to enable post-processing
PPCConfig withPPC = 3;
// modify used lexicon
Lexicon withLexicon = 4;
// set to enable automatic punctuation
PNCConfig withPNC = 5;
}
enum AudioChannel {
// downmix all channels to mono
AUDIO_CHANNEL_DOWNMIX = 0;
// select only left channel
AUDIO_CHANNEL_LEFT = 1;
// select only right channel
AUDIO_CHANNEL_RIGHT = 2;
}
AudioFormat audioFormat = 1;
AudioChannel audioChannel = 2;
oneof config {
VADConfig vad = 3;
V2TConfig v2t = 5;
PPCConfig ppc = 9;
}
}
Audio format
message AudioFormat {
enum ChannelLayout {
...
}
enum SampleFormat {
...
}
enum SampleRate {
...
}
// make best effort guess about what audio format is used
message AutoDetect {
// upper limit of bytes used to detect audio format automaticcaly, [32, INT_MAX]
uint32 probeSizeBytes = 1;
}
// raw data
message PCM {
SampleFormat sampleFormat = 1;
SampleRate sampleRate = 2;
ChannelLayout channelLayout = 3;
}
// detect audio format from provided audio header bytes
message Header {
bytes header = 1;
}
oneof formats {
AutoDetect auto = 1;
PCM pcm = 2;
Header header = 3;
}
}
Events
message Events {
// list of events, may be empty
repeated Event events = 1;
// whole non final hypothesis (will be fully replaced by final hypothesis) when set to true
bool lookahead = 2;
// optional, for client-side processing
uint64 receivedAt = 3;
// optional, for parralel client-side processing
uint32 channelId = 4;
}
All attributes representing duration and offset are represented by ticks
where 1 tick is 100ns
Event
message Event {
// point in time on the timeline (in ticks)
message Timestamp {
oneof value {
// default point in time
uint64 timestamp = 1;
// point in time used for recovery in case of failure (can be skip when recovery not implemented)
uint64 recovery = 2;
}
}
// transcribed text
message Label {
oneof label {
// recognized chunk of text (e.g. single word)
string item = 1;
// joins two items (e.g. string containing space character)
string plus = 2;
// an item that is not a speech
string noise = 3;
}
}
// audio input/output
message Audio {
// raw audio body
bytes body = 1;
// optional offset from stream start
uint64 offset = 5;
// optional raw audio body duration
uint64 duration = 6;
}
// aditional information abotu stream
message Meta {
message Confidence {
double value = 1;
}
oneof body {
Confidence confidence = 1;
}
}
oneof body {
Timestamp timestamp = 1;
Label label = 2;
Audio audio = 3;
Meta meta = 4;
}
}
Event stream is aligned to timeline using timestamps.
Mapping events to text
Input
// ts -> timestamp
// rec -> recovery timestmap
rec(0), lab("Hello"),
ts(1), plus(" ,"), item("how"), plus(" "), item("are"),
ts(2),
ts(3), item("you"), plus("?")
rec(4)
Output
// skip all events except label/item and label/plus
Hello, how are you?
0: Hello
1: , how are
3: you?
Unknown events must be skipped.
Modules
Components processing supported input events and emitting output events. Active modules are determined using task label (part of task identity when selecting task)
Voice activity detection (vad)
Output label/items when there are speech and non speech events in audio (silence, music).
Input
- audio
Output
- item/label [‘on’,’off’]
- timestamp
Voice To Text (v2t)
Transforms audio events (and optional vad events) to timestamps and labels.
Input
- audio
- vad
Output
- item/label
- timestamp
Post-processing (ppc)
Transforms labels to labels according to rules (one hundred -> 100)
Input
- label
- timestamp
Output
- label
- timestamp
Punctuation (pnc)
Transforms labels to labels according to rules (Hello World! -> Hello, World)
Input
- label
- timestamp
Output
- label
- timestamp
Combining modules
Modules can be combined by using plus sign when selecting task
vad
vad+v2t
v2t+ppc
vad+v2t+ppc+pnc
Not all modules combination are valid (e.g. vad+ppc)
Lexicon
message Lexicon {
message UserItem {
// output symbol (required)
string sym = 1;
// pronunciation in phonetic alphabet (optional)
string pron = 2;
// grapheme (optional)
string graph = 3;
// symbol already exists in lexicon (returned by server)
bool foundInLex = 4;
}
message NoiseItem {
// output symbol
string sym = 1;
// pronunciation
string pron = 2;
}
message MainItem {
// output symbol
string sym = 1;
// pronunciation
string pron = 2;
// mount, equal to sym if blank
string mnt = 3;
}
message LexItem {
oneof item {
UserItem user = 1;
MainItem main = 2;
NoiseItem noise = 3;
}
}
// number 1 reserved
repeated LexItem items = 2;
// list of allowed phonemes (returned by server)
string alpha = 3;
}
Custom words are added using user items
Flow control
Explicit control flow together with chunk size gives client control over how much data is in-flight between client and server. It also allows for simpler programming implementation as send/receive logic can be implemented in single thread as a simple state machine. It is enabled by default but can be disabled.
Flow control should be disabled in environments with varying level off latency and throughput (e.g. mobile devices, desktop applications) and should be enabled for large scale processing (e.g. inside cluster) and offline recordings.
File API does not support any type of flow control.
Disabling flow control
websocket
Set no-flow-controle=true
query param
var ws = new WebSocket("wss://test.nanogrid.cloud/service?no-flow-controle=true);
grpc
Set no-flow-controle=true
using GRPC metadata
var meta = new grpc.Metadata();
meta.add("no-flow-controle", "true");
client.send(data, meta, callback);
Communication with flow control

Client must reply to pull
after 10 seconds (send push
with empty events list if no new data available) - otherwise connection may be terminated due to client_inactivity_timeout
.
Communication without flow control

Chunk size
Maximum total size of all events in array (for audio it is usually just one event) must be below 3 MB.
Example: For 256 kbps audio good audio chunk size may be 32 kb (new chunk created every 125ms).
Increasing chunk size reduces network overhead but increases response delay.
Sending live audio with flow-control
You should not set your chunk size too low (eg. less than 125ms) when using flow control - as network round trip time may slow you audio transport considerably. Disable flow control when sending chunk e.g. every 50ms.
GRPC API
Uses engine.proto file to generate code for your target language.
Endpoint: ${DOMAIN}:443
ntx - .NET library CLI client
Websocket API
GRPC alternative for browser based applications
Endpoint: wss://${DOMAIN}/ws/v1/v2t
ntx-js - Javascript library for Websocket API with examples.
File API
Endpoint: https://${DOMAIN}/api/v1/file/v2t
ntx-js - Javascript library for File API.
Limitations
- Much higher latency compared to GRPC and Websocket API.
- Whole file needs to be uploaded before processing can start.
- Max file upload size or upload speed throttling may be enabled based on SLA.
Output is one event per line.
Form data
- file
- file to process
- lexicon (optional)
- Lexicon with user words
- channel (optional)
- which channel to use [
left
,right
,downmix
], defaults todownmix
curl --header "ntx-token: $NTX_TASK_TOKEN" -F file=@basetext.mp3 -F lexicon=@userlex.json -F channel=right https://mycluster.nanogrid.cloud/api/v1/file/v2t
{"start":{}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"5000000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"5400000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"8000000"}},{"timestamp":{"recovery":"8000000"}},{"label":{"plus":" "}},{"label":{"item":"generálové"}},{"timestamp":{"timestamp":"17800000"}},{"label":{"plus":" "}},{"label":{"item":"čárky"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"22900000"}},{"label":{"plus":" "}},{"label":{"item":"a"}},{"timestamp":{"timestamp":"23500000"}},{"timestamp":{"recovery":"23500000"}},{"label":{"plus":" "}},{"label":{"item":"vojáci"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"30600000"}},{"label":{"plus":" "}},{"label":{"item":"a"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"32300000"}},{"timestamp":{"recovery":"32300000"}},{"label":{"plus":" "}},{"label":{"item":"civilové"}},{"timestamp":{"timestamp":"39500000"}},{"timestamp":{"recovery":"39500000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"50400000"}},{"timestamp":{"recovery":"50400000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"59200000"}},{"timestamp":{"recovery":"59200000"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"62200000"}},{"timestamp":{"recovery":"62200000"}},{"label":{"item":"\u003c"}},{"label":{"item":"new_paragraph/\u003e"}},{"timestamp":{"timestamp":"69000000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"70500000"}},{"label":{"noise":"[n::longnoise]"}},{"timestamp":{"timestamp":"75500000"}},{"timestamp":{"recovery":"75500000"}},{"label":{"noise":"[n::longnoise]"}},{"timestamp":{"timestamp":"80500000"}},{"timestamp":{"recovery":"80500000"}},{"label":{"noise":"[n::longnoise]"}},{"timestamp":{"timestamp":"85500000"}},{"timestamp":{"recovery":"85500000"}},{"label":{"noise":"[n::longnoise]"}},{"timestamp":{"timestamp":"90500000"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"recovery":"90500000"}}]}}}
{"push":{"events":{"events":[{"label":{"noise":"[n::longnoise]"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"95500000"}}]}}}
{"end":{}}