NANOTRIX

Quick start

  1. Login to obtain authorization token called access_token
  2. Use access_token to obtain task authorization token called ntx_token
  3. Run recognition task using ntx_token

Tokens

For authentication and authorization JWT is used.

Access token

Encoded information about authenticated user.

curl \
 -H "Content-Type: application/json" \
 -X POST \
 -d '{"username":"myid","password":"mysecret"}' \
 $AUDIENCE/login/access-token

Ntx token

Access token extended with permission to run [selected task]. Expiration is inherited from the access token.

Selected task is identified by id and label.

curl \
  -H "Content-Type: application/json" \
  -H "ntx-token: $ACCESS_TOKEN" \ 
  -X POST \
  -d '{"id": "ntx.v2t.engine.EngineService/pl/t-broadcast/v2t", "label": "vad+v2t+ppc" }' \
  https://$DOMAIN/store/ntx-token

Using ntx-token

http

Set ntx-token: $NTX_TOKEN header

curl -H "ntx-token: $NTX_TOKEN" https://$DOMAIN/service

websocket

Set ntx-token=$NTX_TOKEN query param

var ws = new WebSocket("wss://$DOMAIN/service?ntx-token=$NTX_TOKEN");

grpc

Set ntx-token=$NTX_TOKEN using GRPC metadata

var meta = new grpc.Metadata();
meta.add("ntx-token", "$NTX_TOKEN");
client.send(data, meta, callback);

Caching recommendations

  • should be cached for at least some fraction of valid time (e.g. 3/4)
  • should attempt at least one renewal when request using this token returns 401 or 403 status code
  • should be renewed before expiration (e.g. when less than 1/4 of the token valid time is remaining) - to gracefully handle temporal API unavailability

ntx token

  • renew interval can be set shorted interval when fast automatic are configured

Token valid duration may change with every API call - always perform calculation based on actual returned values

Expiration

Each token contains expiration information as unix timestamp (number of seconds since epoch). Expiration time is returned in the response and is also present in decoded tokens exp attribute.

{
  "iss": "test-2e240644-8179-44be-9339-7de137131b65",
  "iat": 1538842805,
  "exp": 1538846000,
  "aud": [
    "https://myaudience.com"
  ],
  "sub": "admin"
}
{
  "accessToken": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ0ZXN0LTJlMjQwNjQ0LTgxNzktNDRiZS05MzM5LTdkZTEzNzEzMWI2NSIsImlhdCI6MTUzODg0MjgwNSwiZXhwIjoxNTM4ODQ2MDAwLCJhdWQiOlsiaHR0cHM6Ly9teWF1ZGllbmNlLmNvbSJdLCJzdWIiOiJhZG1pbiIsInBlcm1pc3Npb25zIjpbInRhc2s6cnVuOioiXSwiZW1haWwiOiJhZG1pbkBtYWlsLmNvbSJ9.uWU4YaQyGoP99VeZ1pmPJU7Q3_YAsfRkWKvdQCut0Qo",          
  "expiresAt": 1508577982
}    
{
  "ntxToken": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJ0ZXN0LTJlMjQwNjQ0LTgxNzktNDRiZS05MzM5LTdkZTEzNzEzMWI2NSIsImlhdCI6MTUzODg0MjgwNSwiZXhwIjoxNTM4ODQ2MDAwLCJhdWQiOlsiaHR0cHM6Ly9teWF1ZGllbmNlLmNvbSJdLCJzdWIiOiJhZG1pbiIsInBlcm1pc3Npb25zIjpbInRhc2s6cnVuOioiXSwiZW1haWwiOiJhZG1pbkBtYWlsLmNvbSIsInRhc2siOnsiaWQiOiJudHgudjJ0LmVuZ2luZS5FbmdpbmVTZXJ2aWNlL2N6L2QtZ2VuZXJhbC92MnQiLCJjZmciOnsiaW1hZ2UiOiJuZ2lueCIsImNwdXMiOjEuNywibWVtIjozMDcyfSwibGFiZWwiOiJ2MnQiLCJyb2xlIjoidW5saW1pdGVkIn19.JgYkfjDO96wYEvcf22_-Y-wjunt31eYOr0VRMubkLiU",          
  "expiresAt": 1508577982
}    

Selecting task

Every task is identified by unique id and label pair.

alt text
Getting the pair from platform UI - tab Transcribe

Task id and label can be changed only when upgrading to new major version.

Overview

V2T API is implemented as bidirectional flow of EngineStream messages defined as Protobuf 3 messages defined in engine.proto. It is started by sending and receiving start message, followed by input/output data transfer (with optional flow control and terminated by sending and receiving end message.

EngineStream

message EngineStream {
  oneof payload {
    EngineContextStart start = 1;
    EventsPush push = 2;
    EventsPull pull = 3;
    EngineContextEnd end = 4;
  }
}

Start

message EngineContextStart {
  EngineContext context = 1;
}

Send as the first message from the client and also received as the first message the from server (except in case of error). When sent from the client it must have context set). When received from the server it may not have context set.

Push

message EventsPush {
  Events events = 1;
}

Used for sending and receiving data. See [Events] for detailed information.

Pull

message EventsPull {
}

Send (only when flow control enabled) when the server is requesting new data (when received by the client) or client is prepared to receive results (when sent by the client).

End

message EngineContextEnd {
  string error = 1;
}

Send as the last message from the client and also received as the last message the from server. Operation completed successfully when error message is blank. Error attribute can be set by the client to signal what type of error it encountered (graceful shutdown).

Context configuration

message EngineContext {
  
  // voice activity detection
  message VADConfig {}   
  
  // punctuation config
  message PNCConfig {}  
  
  // post-processing config
  message PPCConfig {} 
  
  // voice to text
  message V2TConfig {  
    // set to enable voice activity detection
    VADConfig withVAD = 1; 
    // set to enable post-processing
    PPCConfig withPPC = 3; 
    // modify used lexicon
    Lexicon withLexicon = 4; 
    // set to enable automatic punctuation
    PNCConfig withPNC = 5; 
  }

  enum AudioChannel {
    // downmix all channels to mono
    AUDIO_CHANNEL_DOWNMIX = 0; 
    // select only left channel
    AUDIO_CHANNEL_LEFT = 1;
    // select only right channel
    AUDIO_CHANNEL_RIGHT = 2; 
  }

  AudioFormat audioFormat = 1;
  AudioChannel audioChannel = 2;

  oneof config {
    VADConfig vad = 3;
    V2TConfig v2t = 5;
    PPCConfig ppc = 9;
  }
}

Audio format

message AudioFormat {

  enum ChannelLayout {
    ...
  }

  enum SampleFormat {
    ...
  }

  enum SampleRate {
    ...
  }

 // make best effort guess about what audio format is used
  message AutoDetect {
    // upper limit of bytes used to detect audio format automaticcaly, [32, INT_MAX]
    uint32 probeSizeBytes = 1;
  }

  // raw data
  message PCM {
    SampleFormat sampleFormat = 1;
    SampleRate sampleRate = 2;
    ChannelLayout channelLayout = 3;
  }

 // detect audio format from provided audio header bytes
  message Header {
    bytes header = 1;
  }

  oneof formats {
    AutoDetect auto = 1;
    PCM pcm = 2;
    Header header = 3;
  }
}

Events

message Events {

  // list of events, may be empty
  repeated Event events = 1; 
  
  // whole non final hypothesis (will be fully replaced by final hypothesis) when set to true
  bool lookahead = 2; 
  
  // optional, for client-side processing
  uint64 receivedAt = 3; 
  
  // optional, for parralel client-side processing
  uint32 channelId = 4; 
}

All attributes representing duration and offset are represented by ticks where 1 tick is 100ns

Event

message Event {

  // point in time on the timeline (in ticks) 
  message Timestamp {
    oneof value {
      // default point in time      
      uint64 timestamp = 1;
      // point in time used for recovery in case of failure (can be skip when recovery not implemented)
      uint64 recovery = 2;
    }
  }

  // transcribed text 
  message Label {
    oneof label {
    
      // recognized chunk of text (e.g. single word)
      string item = 1;
      // joins two items (e.g. string containing space character)
      string plus = 2;      
      // an item that is not a speech
      string noise = 3;
    }
  }
  
  // audio input/output
  message Audio {
    // raw audio body
    bytes body = 1;
    // optional offset from stream start
    uint64 offset = 5;
    // optional raw audio body duration
    uint64 duration = 6;
  }

  // aditional information abotu stream
  message Meta {
    message Confidence {
      double value = 1;
    }

    oneof body {
      Confidence confidence = 1;
    }
  }

  oneof body {
    Timestamp timestamp = 1;
    Label label = 2;
    Audio audio = 3;
    Meta meta = 4;
  }
}

Event stream is aligned to timeline using timestamps.

Mapping events to text

Input

// ts -> timestamp
// rec -> recovery timestmap 
  
rec(0), lab("Hello"), 
ts(1), plus(" ,"), item("how"), plus(" "), item("are"), 
ts(2), 
ts(3), item("you"), plus("?")
rec(4)

Output

// skip all events except label/item and label/plus
Hello, how are you?
  0: Hello
  1: , how are
  3: you?

Unknown events must be skipped.

Modules

Components processing supported input events and emitting output events. Active modules are determined using task label (part of task identity when selecting task)

Voice activity detection (vad)

Output label/items when there are speech and non speech events in audio (silence, music).

Input

  • audio

Output

  • item/label [‘on’,’off’]
  • timestamp

Voice To Text (v2t)

Transforms audio events (and optional vad events) to timestamps and labels.

Input

  • audio
  • vad

Output

  • item/label
  • timestamp

Post-processing (ppc)

Transforms labels to labels according to rules (one hundred -> 100)

Input

  • label
  • timestamp

Output

  • label
  • timestamp

Punctuation (pnc)

Transforms labels to labels according to rules (Hello World! -> Hello, World)

Input

  • label
  • timestamp

Output

  • label
  • timestamp

Combining modules

Modules can be combined by using plus sign when selecting task

  vad               
  vad+v2t           
  v2t+ppc
  vad+v2t+ppc+pnc   

Not all modules combination are valid (e.g. vad+ppc)

Lexicon

message Lexicon {

   message UserItem {
    // output symbol (required)
    string sym = 1;
    // pronunciation in phonetic alphabet (optional)
    string pron = 2;
    // grapheme (optional)
    string graph = 3;
    // symbol already exists in lexicon (returned by server)
    bool foundInLex = 4;
  }

  message NoiseItem {
    // output symbol
    string sym = 1;
    // pronunciation
    string pron = 2;
  }

  message MainItem {
    // output symbol
    string sym = 1;
    // pronunciation
    string pron = 2;
    // mount, equal to sym if blank
    string mnt = 3;
  }

  message LexItem {
    oneof item {
      UserItem user = 1;
      MainItem main = 2;
      NoiseItem noise = 3;
    }
  }

  // number 1 reserved
  repeated LexItem items = 2;
  // list of allowed phonemes (returned by server)
  string alpha = 3;
}

Custom words are added using user items

Flow control

Explicit control flow together with chunk size gives client control over how much data is in-flight between client and server. It also allows for simpler programming implementation as send/receive logic can be implemented in single thread as a simple state machine. It is enabled by default but can be disabled.

Flow control should be disabled in environments with varying level off latency and throughput (e.g. mobile devices, desktop applications) and should be enabled for large scale processing (e.g. inside cluster) and offline recordings.

File API does not support any type of flow control.

Disabling flow control

websocket

Set no-flow-controle=true query param

var ws = new WebSocket("wss://test.nanogrid.cloud/service?no-flow-controle=true);

grpc

Set no-flow-controle=true using GRPC metadata

var meta = new grpc.Metadata();
meta.add("no-flow-controle", "true");
client.send(data, meta, callback);

Communication with flow control

Client must reply to pull after 10 seconds (send push with empty events list if no new data available) - otherwise connection may be terminated due to client_inactivity_timeout.

Communication without flow control

Chunk size

Maximum total size of all events in array (for audio it is usually just one event) must be below 3 MB.

Example: For 256 kbps audio good audio chunk size may be 32 kb (new chunk created every 125ms).

Increasing chunk size reduces network overhead but increases response delay.

Sending live audio with flow-control

You should not set your chunk size too low (eg. less than 125ms) when using flow control - as network round trip time may slow you audio transport considerably. Disable flow control when sending chunk e.g. every 50ms.

GRPC API

Uses engine.proto file to generate code for your target language.

Endpoint: ${DOMAIN}:443

ntx - .NET library CLI client

Websocket API

GRPC alternative for browser based applications

Endpoint: wss://${DOMAIN}/ws/v1/v2t

ntx-js - Javascript library for Websocket API with examples.

File API

Endpoint: https://${DOMAIN}/api/v1/file/v2t

ntx-js - Javascript library for File API.

Limitations

  1. Much higher latency compared to GRPC and Websocket API.
  2. Whole file needs to be uploaded before processing can start.
  3. Max file upload size or upload speed throttling may be enabled based on SLA.

Output is one event per line.

Form data

file
file to process
lexicon (optional)
Lexicon with user words
channel (optional)
which channel to use [left, right, downmix], defaults to downmix
curl --header "ntx-token: $NTX_TASK_TOKEN" -F file=@basetext.mp3 -F lexicon=@userlex.json -F channel=right https://mycluster.nanogrid.cloud/api/v1/file/v2t 
{"start":{}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"5000000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"5400000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"8000000"}},{"timestamp":{"recovery":"8000000"}},{"label":{"plus":" "}},{"label":{"item":"generálové"}},{"timestamp":{"timestamp":"17800000"}},{"label":{"plus":" "}},{"label":{"item":"čárky"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"22900000"}},{"label":{"plus":" "}},{"label":{"item":"a"}},{"timestamp":{"timestamp":"23500000"}},{"timestamp":{"recovery":"23500000"}},{"label":{"plus":" "}},{"label":{"item":"vojáci"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"30600000"}},{"label":{"plus":" "}},{"label":{"item":"a"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"32300000"}},{"timestamp":{"recovery":"32300000"}},{"label":{"plus":" "}},{"label":{"item":"civilové"}},{"timestamp":{"timestamp":"39500000"}},{"timestamp":{"recovery":"39500000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"50400000"}},{"timestamp":{"recovery":"50400000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"59200000"}},{"timestamp":{"recovery":"59200000"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"62200000"}},{"timestamp":{"recovery":"62200000"}},{"label":{"item":"\u003c"}},{"label":{"item":"new_paragraph/\u003e"}},{"timestamp":{"timestamp":"69000000"}},{"label":{"noise":"[n::silence]"}},{"timestamp":{"timestamp":"70500000"}},{"label":{"noise":"[n::longnoise]"}},{"timestamp":{"timestamp":"75500000"}},{"timestamp":{"recovery":"75500000"}},{"label":{"noise":"[n::longnoise]"}},{"timestamp":{"timestamp":"80500000"}},{"timestamp":{"recovery":"80500000"}},{"label":{"noise":"[n::longnoise]"}},{"timestamp":{"timestamp":"85500000"}},{"timestamp":{"recovery":"85500000"}},{"label":{"noise":"[n::longnoise]"}},{"timestamp":{"timestamp":"90500000"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"recovery":"90500000"}}]}}}
{"push":{"events":{"events":[{"label":{"noise":"[n::longnoise]"}}]}}}
{"push":{"events":{"events":[{"timestamp":{"timestamp":"95500000"}}]}}}
{"end":{}}