Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting Twilio live stream from WebSocket to the supported audio format to Microsoft Azure using NodeJs #162

Open
adama19 opened this issue Apr 18, 2022 · 4 comments

Comments

@adama19
Copy link

adama19 commented Apr 18, 2022

Hello, please I am trying to integrate the Twilio live media stream with Microsoft Azure STT in order to get a live transcription of the user input. My problem at the moment is I am unable to convert the payload to the wave/PCM format which is supported by azure. I saw a similar solution on this topic here (https://www.twilio.com/blog/live-transcription-media-streams-azure-cognitive-services-java) but the issue is this is using Java programming language while I am trying to do this with NodeJs. Can you please help

below is the code I am using

const WebSocket = require("ws")
const express = require("express")
const app = express();
const server = require("http").createServer(app)
const path = require("path")
const base64 = require("js-base64");
const alawmulaw = require('alawmulaw');
const wss = new WebSocket.Server({ server })

//Include Azure Speech service 
const sdk = require("microsoft-cognitiveservices-speech-sdk")
const subscriptionKey = '2195XXXXXXXXXXXXXXXXXX'
const serviceRegion = 'southeastasia'

// Hard code the variables 
//const variables = require("./config/variables")
const language = "en-US"

const azurePusher = sdk.AudioInputStream.createPushStream(sdk.AudioStreamFormat.getWaveFormatPCM(8000, 16, 1))
const audioConfig = sdk.AudioConfig.fromStreamInput(azurePusher);
const speechConfig = sdk.SpeechConfig.fromSubscription(subscriptionKey, serviceRegion);

speechConfig.speechRecognitionLanguage = language;
speechConfig.enableDictation();
const recognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);

recognizer.recognizing = (s, e) => {
  console.log(`RECOGNIZING: Text=${e.result.text}`);
};

recognizer.recognized = (s, e) => {
  if (e.result.reason == sdk.ResultReason.RecognizedSpeech) {
      console.log(`RECOGNIZED: Text=${e.result.text}`);
  }
  else if (e.result.reason == sdk.ResultReason.NoMatch) {
      console.log("NOMATCH: Speech could not be recognized.");
  }
};

recognizer.canceled = (s, e) => {
  console.log(`CANCELED: Reason=${e.reason}`);

  if (e.reason == sdk.CancellationReason.Error) {
      console.log(`"CANCELED: ErrorCode=${e.errorCode}`);
      console.log(`"CANCELED: ErrorDetails=${e.errorDetails}`);
      console.log("CANCELED: Did you update the key and location/region info?");
  }

  recognizer.stopContinuousRecognitionAsync();
};

recognizer.sessionStopped = (s, e) => {
  console.log("\nSession stopped event.");
  recognizer.stopContinuousRecognitionAsync();
};

recognizer.startContinuousRecognitionAsync(() => {
  console.log("Continuous Reco Started");
},
  err => {
      console.trace("err - " + err);
      recognizer.close();
      recognizer = undefined;
  });

// Handle Web Socket Connection
wss.on("connection", function connection(ws) {
console.log("New Connection Initiated");

   ws.on("message", function incoming(message) {
    const msg = JSON.parse(message);
    switch (msg.event) {
      case "connected":
        break;
      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);
        
        break;
      case "media":
        var streampayload = base64.decode(msg.media.payload)
        var data = Buffer.from(streampayload)
        var pcmdata = Buffer.from(alawmulaw.mulaw.decode(data))
        //console.log(msg.mediaFormat.encoding)

        // process.stdout.write(msg.media.payload + " " + " bytes\033[0G");
        // streampayload = base64.decode(msg.media.payload, 'base64');
        // let data = Buffer.from(streampayload);
        azurePusher.write(pcmdata)
        break;
      case "stop":
        console.log(`Call Has Ended`);
        azurePusher.close()
        recognizer.stopContinuousRecognitionAsync()
        break;
    }
  });

})

app.post("/", (req, res) => {
  res.set("Content-Type", "text/xml");

  res.send(
    `<Response>
       <Say>
            Leave a message
       </Say>
       <Start>
           <Stream url="wss://${req.headers.host}" />
       </Start>
       <Pause legnth ='60' />
    </Response>`
)
});

console.log("Listening at Port 8080");
server.listen(8080);

Please help in converting the media payload which comes in mu-law format to the supported PCM format by Microsoft Azure for Speech to text transcription.

@imkhubaibraza
Copy link

imkhubaibraza commented Aug 8, 2022

I'm also facing the same problem

  • The transcription is accurate if I stream audio chunks from an audio file.

  • Getting random few words If I stream audio chunks from Twilio calls.

@sahilpal0
Copy link

I'm also facing the same problem

  • The transcription is accurate if I stream audio chunks from an audio file.
  • Getting random few words If I stream audio chunks from Twilio calls.

I dont have so much knowledge on low level things but what i notice is when i save the ulaw format of twilio to wav format and try to play it. it will work perfectly but when i try to send it azure that file audio chunks for continousrecogniztion it doesn't work's but when i again convert that wav file in to a 16khz 8bit depth mono through the external websites it and give it again to azure it seems to work perfectly them so what iam trying to say it something we're doing wrong while conversion. it seems fine and working but still something is missing

@github-ai-user
Copy link

Any solution?

@ame700
Copy link

ame700 commented Sep 25, 2024

i checked the java example here https://www.twilio.com/blog/live-transcription-media-streams-azure-cognitive-services-java and converted the MulawToPcm class to nodejs and started using it , and it's working for me

/**

  • This class contains a single public method for mapping an array of 8-bit

  • µ-law values to a 16-bit linear PCM values.

  • This is needed because Twilio media-streams only produces µ-law encoded audio

  • data and some cloud speech-to-text engines only accept PCM.
    */
    export class MulawToPcm {
    private static readonly mulawMapping: Int16Array = new Int16Array([
    32124, 31100, 30076, 29052, 28028, 27004, 25980, 24956,
    23932, 22908, 21884, 20860, 19836, 18812, 17788, 16764,
    15996, 15484, 14972, 14460, 13948, 13436, 12924, 12412,
    11900, 11388, 10876, 10364, 9852, 9340, 8828, 8316,
    7932, 7676, 7420, 7164, 6908, 6652, 6396, 6140,
    5884, 5628, 5372, 5116, 4860, 4604, 4348, 4092,
    3900, 3772, 3644, 3516, 3388, 3260, 3132, 3004,
    2876, 2748, 2620, 2492, 2364, 2236, 2108, 1980,
    1884, 1820, 1756, 1692, 1628, 1564, 1500, 1436,
    1372, 1308, 1244, 1180, 1116, 1052, 988, 924,
    876, 844, 812, 780, 748, 716, 684, 652,
    620, 588, 556, 524, 492, 460, 428, 396,
    372, 356, 340, 324, 308, 292, 276, 260,
    244, 228, 212, 196, 180, 164, 148, 132,
    120, 112, 104, 96, 88, 80, 72, 64,
    56, 48, 40, 32, 24, 16, 8, 0,
    -32124, -31100, -30076, -29052, -28028, -27004, -25980, -24956,
    -23932, -22908, -21884, -20860, -19836, -18812, -17788, -16764,
    -15996, -15484, -14972, -14460, -13948, -13436, -12924, -12412,
    -11900, -11388, -10876, -10364, -9852, -9340, -8828, -8316,
    -7932, -7676, -7420, -7164, -6908, -6652, -6396, -6140,
    -5884, -5628, -5372, -5116, -4860, -4604, -4348, -4092,
    -3900, -3772, -3644, -3516, -3388, -3260, -3132, -3004,
    -2876, -2748, -2620, -2492, -2364, -2236, -2108, -1980,
    -1884, -1820, -1756, -1692, -1628, -1564, -1500, -1436,
    -1372, -1308, -1244, -1180, -1116, -1052, -988, -924,
    -876, -844, -812, -780, -748, -716, -684, -652,
    -620, -588, -556, -524, -492, -460, -428, -396,
    -372, -356, -340, -324, -308, -292, -276, -260,
    -244, -228, -212, -196, -180, -164, -148, -132,
    -120, -112, -104, -96, -88, -80, -72, -64,
    -56, -48, -40, -32, -24, -16, -8, 0
    ]);

    /**

    • Converts a Uint8Array of µ-law encoded audio data to PCM encoded.

    • @param mulawBytes Uint8Array of 8-bit µ-law values

    • @return Uint8Array of 16-bit PCM values. Each byte of µ-law

    • converts to 2 bytes of PCM, so the output array is twice

    • as long as the input. Pairs of PCM bytes are little-endian

    • ie least-significant byte is the first in the pair
      */
      public static transcode(buffer : Buffer): Uint8Array {
      let mulawBytes :Uint8Array = this.toArrayBuffer(buffer);
      const output = new Uint8Array(mulawBytes.length * 2);

      for (let i = 0; i < mulawBytes.length; i++) {
      // +128 because Java byte values are signed and array indices start from 0
      const pcmData: number = this.mulawMapping[mulawBytes[i] + 128];

       // least-significant byte first
       output[2 * i] = pcmData & 0xff;
       // most-significant byte second
       output[2 * i + 1] = pcmData >> 8;
      

      }

      return output;
      }

    private static toArrayBuffer(buffer : Buffer) : Uint8Array {
    const view = new Uint8Array(buffer.length);
    for (let i = 0; i < buffer.length; ++i) {
    view[i] = buffer[i];
    }
    return view;
    }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants