Quick start

  1. Login to obtain authorization token called access_token
  2. Use access_token to obtain task authorization token called ntx_token
  3. Run recognition task using ntx_token


For authentication and authorization JWT is used.

Access token

Encoded information about authenticated user.

curl \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{"username":"myid","password":"mysecret"}' \

Ntx token

Access token extended with permission to run [selected task]. Expiration is inherited from the access token.

Selected task is identified by id and label.

curl \
  -H "Content-Type: application/json" \
  -H "ntx-token: $ACCESS_TOKEN" \ 
  -X POST \
  -d '{"id": "ntx.v2t.engine.EngineService/pl/t-broadcast/v2t", "label": "vad+v2t+ppc" }' \

Using ntx-token


Set ntx-token: $NTX_TOKEN header

curl -H "ntx-token: $NTX_TOKEN" https://$DOMAIN/service


Set ntx-token=$NTX_TOKEN query param

var ws = new WebSocket("wss://$DOMAIN/service?ntx-token=$NTX_TOKEN");


Set ntx-token=$NTX_TOKEN using GRPC metadata

var meta = new grpc.Metadata();
meta.add("ntx-token", "$NTX_TOKEN");
client.send(data, meta, callback);

Caching recommendations

  • should be cached for at least some fraction of valid time (e.g. 3/4)
  • should attempt at least one renewal when request using this token returns 401 or 403 status code
  • should be renewed before expiration (e.g. when less than 1/4 of the token valid time is remaining) - to gracefully handle temporal API unavailability

ntx token

  • renew interval can be set shorted interval when fast automatic are configured

Token valid duration may change with every API call - always perform calculation based on actual returned values


Each token contains expiration information as unix timestamp (number of seconds since epoch). Expiration time is returned in the response and is also present in decoded tokens exp attribute.

  "iss": "test-2e240644-8179-44be-9339-7de137131b65",
  "iat": 1538842805,
  "exp": 1538846000,
  "aud": [
  "sub": "admin"
  "accessToken": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ0ZXN0LTJlMjQwNjQ0LTgxNzktNDRiZS05MzM5LTdkZTEzNzEzMWI2NSIsImlhdCI6MTUzODg0MjgwNSwiZXhwIjoxNTM4ODQ2MDAwLCJhdWQiOlsiaHR0cHM6Ly9teWF1ZGllbmNlLmNvbSJdLCJzdWIiOiJhZG1pbiIsInBlcm1pc3Npb25zIjpbInRhc2s6cnVuOioiXSwiZW1haWwiOiJhZG1pbkBtYWlsLmNvbSJ9.uWU4YaQyGoP99VeZ1pmPJU7Q3_YAsfRkWKvdQCut0Qo",          
  "expiresAt": 1508577982
  "ntxToken": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ0ZXN0LTJlMjQwNjQ0LTgxNzktNDRiZS05MzM5LTdkZTEzNzEzMWI2NSIsImlhdCI6MTUzODg0MjgwNSwiZXhwIjoxNTM4ODQ2MDAwLCJhdWQiOlsiaHR0cHM6Ly9teWF1ZGllbmNlLmNvbSJdLCJzdWIiOiJhZG1pbiIsInBlcm1pc3Npb25zIjpbInRhc2s6cnVuOioiXSwiZW1haWwiOiJhZG1pbkBtYWlsLmNvbSIsInRhc2siOnsiaWQiOiJudHgudjJ0LmVuZ2luZS5FbmdpbmVTZXJ2aWNlL2N6L2QtZ2VuZXJhbC92MnQiLCJjZmciOnsiaW1hZ2UiOiJuZ2lueCIsImNwdXMiOjEuNywibWVtIjozMDcyfSwibGFiZWwiOiJ2MnQiLCJyb2xlIjoidW5saW1pdGVkIn19.JgYkfjDO96wYEvcf22_-Y-wjunt31eYOr0VRMubkLiU",          
  "expiresAt": 1508577982

Selecting task

Every task is identified by unique id and label pair.

alt text
Getting the pair from platform UI - tab Transcribe

Task id and label can be changed only when upgrading to new major version.


V2T API is implemented as bidirectional flow of EngineStream messages defined as Protobuf 3 messages defined in engine.proto. It is started by sending and receiving start message, followed by input/output data transfer (with optional flow control and terminated by sending and receiving end message.


message EngineStream {
  oneof payload {
    EngineContextStart start = 1;
    EventsPush push = 2;
    EventsPull pull = 3;
    EngineContextEnd end = 4;


message EngineContextStart {
  EngineContext context = 1;

Send as the first message from the client and also received as the first message the from server (except in case of error). When sent from the client it must have context set). When received from the server it may not have context set.


message EventsPush {
  Events events = 1;

Used for sending and receiving data. See [Events] for detailed information.


message EventsPull {

Send (only when flow control enabled) when the server is requesting new data (when received by the client) or client is prepared to receive results (when sent by the client).


message EngineContextEnd {
  string error = 1;

Send as the last message from the client and also received as the last message the from server. Operation completed successfully when error message is blank. Error attribute can be set by the client to signal what type of error it encountered (graceful shutdown).

Context configuration

message EngineContext {
  // voice activity detection
  message VADConfig {}   
  // punctuation config
  message PNCConfig {}  
  // post-processing config
  message PPCConfig {} 
  // voice to text
  message V2TConfig {  
    // set to enable voice activity detection
    VADConfig withVAD = 1; 
    // set to enable post-processing
    PPCConfig withPPC = 3; 
    // modify used lexicon
    Lexicon withLexicon = 4; 
    // set to enable automatic punctuation
    PNCConfig withPNC = 5; 

  enum AudioChannel {
    // downmix all channels to mono
    // select only left channel
    // select only right channel

  AudioFormat audioFormat = 1;
  AudioChannel audioChannel = 2;

  oneof config {
    VADConfig vad = 3;
    V2TConfig v2t = 5;
    PPCConfig ppc = 9;

Audio format

message AudioFormat {

  enum ChannelLayout {

  enum SampleFormat {

  enum SampleRate {

 // make best effort guess about what audio format is used
  message AutoDetect {
    // upper limit of bytes used to detect audio format automaticcaly, [32, INT_MAX]
    uint32 probeSizeBytes = 1;

  // raw data
  message PCM {
    SampleFormat sampleFormat = 1;
    SampleRate sampleRate = 2;
    ChannelLayout channelLayout = 3;

 // detect audio format from provided audio header bytes
  message Header {
    bytes header = 1;

  oneof formats {
    AutoDetect auto = 1;
    PCM pcm = 2;
    Header header = 3;


message Events {

  // list of events, may be empty
  repeated Event events = 1; 
  // whole non final hypothesis (will be fully replaced by final hypothesis) when set to true
  bool lookahead = 2; 
  // optional, for client-side processing
  uint64 receivedAt = 3; 
  // optional, for parralel client-side processing
  uint32 channelId = 4; 

All attributes representing duration and offset are represented by ticks where 1 tick is 100ns


message Event {

  // point in time on the timeline (in ticks) 
  message Timestamp {
    oneof value {
      // default point in time      
      uint64 timestamp = 1;
      // point in time used for recovery in case of failure (can be skip when recovery not implemented)
      uint64 recovery = 2;

  // transcribed text 
  message Label {
    oneof label {
      // recognized chunk of text (e.g. single word)
      string item = 1;
      // joins two items (e.g. string containing space character)
      string plus = 2;      
      // an item that is not a speech
      string noise = 3;
  // audio input/output
  message Audio {
    // raw audio body
    bytes body = 1;
    // optional offset from stream start
    uint64 offset = 5;
    // optional raw audio body duration
    uint64 duration = 6;

  // aditional information abotu stream
  message Meta {
    message Confidence {
      double value = 1;

    oneof body {
      Confidence confidence = 1;

  oneof body {
    Timestamp timestamp = 1;
    Label label = 2;
    Audio audio = 3;
    Meta meta = 4;

Event stream is aligned to timeline using timestamps.

Mapping events to text


// ts -> timestamp
// rec -> recovery timestmap 
rec(0), lab("Hello"), 
ts(1), plus(" ,"), item("how"), plus(" "), item("are"), 
ts(3), item("you"), plus("?")


// skip all events except label/item and label/plus
Hello, how are you?
  0: Hello
  1: , how are
  3: you?

Unknown events must be skipped.


Components processing supported input events and emitting output events. Active modules are determined using task label (part of task identity when selecting task)

Voice activity detection (vad)

Output label/items when there are speech and non speech events in audio (silence, music).


  • audio


  • item/label [‘on’,’off’]
  • timestamp

Voice To Text (v2t)

Transforms audio events (and optional vad events) to timestamps and labels.


  • audio
  • vad


  • item/label
  • timestamp

Post-processing (ppc)

Transforms labels to labels according to rules (one hundred -> 100)


  • label
  • timestamp


  • label
  • timestamp

Punctuation (pnc)

Transforms labels to labels according to rules (Hello World! -> Hello, World)


  • label
  • timestamp


  • label
  • timestamp

Combining modules

Modules can be combined by using plus sign when selecting task


Not all modules combination are valid (e.g. vad+ppc)


message Lexicon {

   message UserItem {
    // output symbol (required)
    string sym = 1;
    // pronunciation in phonetic alphabet (optional)
    string pron = 2;
    // grapheme (optional)
    string graph = 3;
    // symbol already exists in lexicon (returned by server)
    bool foundInLex = 4;

  message NoiseItem {
    // output symbol
    string sym = 1;
    // pronunciation
    string pron = 2;

  message MainItem {
    // output symbol
    string sym = 1;
    // pronunciation
    string pron = 2;
    // mount, equal to sym if blank
    string mnt = 3;

  message LexItem {
    oneof item {
      UserItem user = 1;
      MainItem main = 2;
      NoiseItem noise = 3;

  // number 1 reserved
  repeated LexItem items = 2;
  // list of allowed phonemes (returned by server)
  string alpha = 3;

Custom words are added using user items

Flow control

Explicit control flow together with chunk size gives client control over how much data is in-flight between client and server. It also allows for simpler programming implementation as send/receive logic can be implemented in single thread as a simple state machine. It is enabled by default but can be disabled.

Flow control should be disabled in environments with varying level off latency and throughput (e.g. mobile devices, desktop applications) and should be enabled for large scale processing (e.g. inside cluster) and offline recordings.

File API does not support any type of flow control.

Disabling flow control


Set no-flow-control=true query param

var ws = new WebSocket("wss://;


Set no-flow-control=true using GRPC metadata

var meta = new grpc.Metadata();
meta.add("no-flow-control", "true");
client.send(data, meta, callback);

Communication with flow control

Client must reply to pull after 10 seconds (send push with empty events list if no new data available) - otherwise connection may be terminated due to client_inactivity_timeout.

Communication without flow control

Chunk size

Maximum total size of all events in array (for audio it is usually just one event) must be below 3 MB.

Example: For 256 kbps audio good audio chunk size may be 32 kb (new chunk created every 125ms).

Increasing chunk size reduces network overhead but increases response delay.

Sending live audio with flow-control

You should not set your chunk size too low (eg. less than 125ms) when using flow control - as network round trip time may slow you audio transport considerably. Disable flow control when sending chunk e.g. every 50ms.


Uses engine.proto file to generate code for your target language.

Endpoint: ${DOMAIN}:443

ntx - .NET library CLI client

Websocket API

GRPC alternative for browser based applications

Endpoint: wss://${DOMAIN}/ws/v1/v2t

ntx-js - Javascript library for Websocket API with examples.

File API

Endpoint: https://${DOMAIN}/api/v1/file/v2t

ntx-js - Javascript library for File API.


  1. Much higher latency compared to GRPC and Websocket API.
  2. Whole file needs to be uploaded before processing can start.
  3. Max file upload size or upload speed throttling may be enabled based on SLA.

Output is one event per line.

Form data

file to process
lexicon (optional)
Lexicon with user words
channel (optional)
which channel to use [left, right, downmix], defaults to downmix
curl --header "ntx-token: $NTX_TASK_TOKEN" -F file=@basetext.mp3 -F lexicon=@userlex.json -F channel=right 
{"push":{"events":{"events":[{"timestamp":{"timestamp":"5000000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"5400000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"8000000"}},{"timestamp":{"recovery":"8000000"}},{"label":{"plus":" "}},{"label":{"item":"generálové"}},{"timestamp":{"timestamp":"17800000"}},{"label":{"plus":" "}},{"label":{"item":"čárky"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"22900000"}},{"label":{"plus":" "}},{"label":{"item":"a"}},{"timestamp":{"timestamp":"23500000"}},{"timestamp":{"recovery":"23500000"}},{"label":{"plus":" "}},{"label":{"item":"vojáci"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"30600000"}},{"label":{"plus":" "}},{"label":{"item":"a"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"32300000"}},{"timestamp":{"recovery":"32300000"}},{"label":{"plus":" "}},{"label":{"item":"civilové"}},{"timestamp":{"timestamp":"39500000"}},{"timestamp":{"recovery":"39500000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"50400000"}},{"timestamp":{"recovery":"50400000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"59200000"}},{"timestamp":{"recovery":"59200000"}}]}}}