HTML5 Rocks

HTML5 Rocks

Web apps that talk - Introduction to the Speech Synthesis API

By Eric Bidelman at

The Web Speech API adds voice recognition (speech to text) and speech synthesis (text to speech) to JavaScript. The post briefly covers the latter, as the API recently landed in Chrome 33 (mobile and desktop). If you're interested in speech recognition, Glen Shires had a great writeup a while back on the voice recognition feature, "Voice Driven Web Apps: Introduction to the Web Speech API".

Basics

The most basic use of the synthesis API is to pass the speechSynthesis.speak() and utterance:

var msg = new SpeechSynthesisUtterance('Hello World');
window.speechSynthesis.speak(msg);

Try it!

However, you can also alter parameters to effect the volume, speech rate, pitch, voice, and language:

var msg = new SpeechSynthesisUtterance();
var voices = window.speechSynthesis.getVoices();
msg.voice = voices[10]; // Note: some voices don't support altering params
msg.voiceURI = 'native';
msg.volume = 1; // 0 to 1
msg.rate = 1; // 0.1 to 10
msg.pitch = 2; //0 to 2
msg.text = 'Hello World';
msg.lang = 'en-US';

msg.onend = function(e) {
  console.log('Finished in ' + event.elapsedTime + ' seconds.');
};

speechSynthesis.speak(msg);

Setting a voice

The API also allows you to get a list of voice the engine supports:

speechSynthesis.getVoices().forEach(function(voice) {
  console.log(voice.name, voice.default ? '(default)' :'');
});

Then set a different voice, by setting .voice on the utterance object:

var msg = new SpeechSynthesisUtterance('I see dead people!');
msg.voice = speechSynthesis.getVoices().filter(function(voice) { return voice.name == 'Whisper'; })[0];
speechSynthesis.speak(msg);

Demo

In my Google I/O 2013 talk, "More Awesome Web: features you've always wanted" (www.moreawesomeweb.com), I showed a Google Now/Siri-like demo of using the Web Speech API's SpeechRecognition service with the Google Translate API to auto-translate microphone input into another language:

DEMO: http://www.moreawesomeweb.com/demos/speech_translate.html

Unfortunately, it used an undocumented (and unofficial API) to perform the speech synthesis. Well now we have the full Web Speech API to speak back the translation! I've updated the demo to use the synthesis API.

Browser Support

Chrome 33 has full support for the Web Speech API, while Safari for iOS7 has partial support.

Feature detection

Since browsers may support each portion of the Web Speech API separately (e.g. the case with Chromium), you may want to feature detect each feature separately:

if ('speechSynthesis' in window) {
 // Synthesis support. Make your web apps talk!
}

if ('SpeechRecognition' in window) {
  // Speech recognition support. Talk to your apps!
}

Voice Driven Web Apps: Introduction to the Web Speech API

By Glen Shires at

The new JavaScript Web Speech API makes it easy to add speech recognition to your web pages. This API allows fine control and flexibility over the speech recognition capabilities in Chrome version 25 and later. Here's an example with the recognized text appearing almost immediately while speaking.

DEMO / SOURCE

Let’s take a look under the hood. First we check to see if the browser supports the Web Speech API by checking if the webkitSpeechRecognition object exists. If not, we suggest the user upgrades his browser. (Since the API is still experimental, it's currently vendor prefixed.) Lastly, we create the webkitSpeechRecognition object which provides the speech interface, and set some of its attributes and event handlers.

if (!('webkitSpeechRecognition' in window)) {
  upgrade();
} else {
  var recognition = new webkitSpeechRecognition();
  recognition.continuous = true;
  recognition.interimResults = true;

  recognition.onstart = function() { ... }
  recognition.onresult = function(event) { ... }
  recognition.onerror = function(event) { ... }
  recognition.onend = function() { ... }
  ...

The default value for continuous is false, meaning that when the user stops talking, speech recognition will end. This mode is great for simple text like short input fields. In this demo, we set it to true, so that recognition will continue even if the user pauses while speaking.

The default value for interimResults is false, meaning that the only results returned by the recognizer are final and will not change. The demo sets it to true so we get early, interim results that may change. Watch the demo carefully, the grey text is the text that is interim and does sometimes change, whereas the black text are responses from the recognizer that are marked final and will not change.

To get started, the user clicks on the microphone button, which triggers this code:

function startButton(event) {
  ...
  final_transcript = '';
  recognition.lang = select_dialect.value;
  recognition.start();

We set the spoken language for the speech recognizer "lang" to the BCP-47 value that the user has selected via the selection drop-down list, for example “en-US” for English-United States. If this is not set, it defaults to the lang of the HTML document root element and hierarchy. Chrome speech recognition supports numerous languages (see the “langs” table in the demo source), as well as some right-to-left languages that are not included in this demo, such as he-IL and ar-EG.

After setting the language, we call recognition.start() to activate the speech recognizer. Once it begins capturing audio, it calls the onstart event handler, and then for each new set of results, it calls the onresult event handler.

  recognition.onresult = function(event) {
    var interim_transcript = '';

    for (var i = event.resultIndex; i < event.results.length; ++i) {
      if (event.results[i].isFinal) {
        final_transcript += event.results[i][0].transcript;
      } else {
        interim_transcript += event.results[i][0].transcript;
      }
    }
    final_transcript = capitalize(final_transcript);
    final_span.innerHTML = linebreak(final_transcript);
    interim_span.innerHTML = linebreak(interim_transcript);
  };
}

This handler concatenates all the results received so far into two strings: final_transcript and interim_transcript. The resulting strings may include "\n", such as when the user speaks “new paragraph”, so we use the linebreak function to convert these to HTML tags <br> or <p>. Finally it sets these strings as the innerHTML of their corresponding <span> elements: final_span which is styled with black text, and interim_span which is styled with gray text.

interim_transcript is a local variable, and is completely rebuilt each time this event is called because it’s possible that all interim results have changed since the last onresult event. We could do the same for final_transcript simply by starting the for loop at 0. However, because final text never changes, we’ve made the code here a bit more efficient by making final_transcript a global, so that this event can start the for loop at event.resultIndex and only append any new final text.

That’s it! The rest of the code is there just to make everything look pretty. It maintains state, shows the user some informative messages, and swaps the GIF image on the microphone button between the static microphone, the mic-slash image, and mic-animate with the pulsating red dot.

The mic-slash image is shown when recognition.start() is called, and then replaced with mic-animate when onstart fires. Typically this happens so quickly that the slash is not noticeable, but the first time speech recognition is used, Chrome needs to ask the user for permission to use the microphone, in which case onstart only fires when and if the user allows permission. Pages hosted on HTTPS do not need to ask repeatedly for permission, whereas HTTP hosted pages do.

So make your web pages come alive by enabling them to listen to your users!

We’d love to hear your feedback...

Live Web Audio Input Enabled!

By Chris Wilson at

I'm really excited by a new feature that went in to yesterday's Chrome Canary build (23.0.1270.0) - the ability to get low-latency access to live audio from a microphone or other audio input on OSX! (This has not yet been enabled on Windows - but don't worry, we're working on it!)

[UPDATE Oct 8, 2012: live audio input is now enabled for Windows, as long as the input and output device are using the same sample rate!]

To enable this, you need to go into chrome://flags/ and enable the "Web Audio Input" item near the bottom, and relaunch the browser; now you're ready to roll! Note: If you're using a microphone, you may need to use headphones for any output in order to avoid feedback. If you are using a different audio source, such as a guitar or external audio feed, or there's no audio output from the demo, this may not be a problem. You can test out live audio input by checking out the spectrum of your input using the live input visualizer.

For those Web Audio coders among you, here's how to request the audio input stream, and get a node to connect to any processing graph you like!

// success callback when requesting audio input stream
function gotStream(stream) {
    window.AudioContext = window.AudioContext || window.webkitAudioContext;
    var audioContext = new AudioContext();

    // Create an AudioNode from the stream.
    var mediaStreamSource = audioContext.createMediaStreamSource( stream );

    // Connect it to the destination to hear yourself (or any other node for processing!)
    mediaStreamSource.connect( audioContext.destination );
}

navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia;
navigator.getUserMedia( {audio:true}, gotStream );

There are many rich possibilities for low-latency audio input, particularly in the musical space. You can see a quick example of how to make use of this in a simple pitch detector I threw together - try plugging in a guitar, or even just whistling into the microphone.

And, as promised, I've added live audio as an input source to the Vocoder I wrote for Google IO - just select "live input" under modulator. You may need to adjust the Modulator Gain and the Synth Level. There's a slight lag due to processing (not due to input latency). Now that I have live audio input, it's time for another round of tweaking!

Finally, you may want to take a look at the collection of my web audio demos - by the time you read this, I may have some more live audio demos up!

WebGL and Web Audio API demo roundup

By Ilmari Heikkinen at

Here’s a look at some cool WebGL and Web Audio API demos that I’ve seen over the past couple weeks.

EVE Online ship viewer is a great-looking online ship viewer app built with WebGL. Very nice way to showcase the artwork in the game universe.

Web Audio API samples page has several compelling examples on how to do audio processing using it. WebGL City is one of the demos linked from the samples page. It’s a small demo of a helicopter flying around a night cityscape. The helicopter (disable music by pressing 'm’, enable helicopter sound by pressing 'n’) uses Web Audio APIs spatial audio features to pan the helicopter audio from one speaker to the other.

Some enterprising soul implemented a snake game using nothing but a WebGL fragment shader on the GLSL Sandbox. I’m flabbergasted.

The Big Bang may look like any other WebGL particle animation, but the particle simulation is actually run on the GPU. The simulator is a fragment shader that reads the previous particle positions from a texture and writes the new particle positions into a FBO texture.

Blocky Earth takes Google Earth data and MineCrafts it. It communicates differences in height well. For example, I was looking at Australia and the Antarctic ice sheet, and you can see how the continental ice is several kilometers thick.

The Midem Music Machine is a fun music demo by Mr.doob and Paul Lamere. It’s a sort of a ball-driven music box with balls bouncing off bits 'n’ bops. CreativeJS has a good writeup on it, check it out.

Continuing on the computer music visualization theme, I recently ran across this page about bytebeat, a form of music generated by minimalistic code formula. The page links to one cool WebGL visualization of the music. Gregg Tavares took to the idea and built a bytebeat sandbox for making and sharing your own bytebeat tunes directly from the browser.

HTML5 <audio> and the Web Audio API are BFFs!

By Eric Bidelman at

DEMO

As part of the MediaStream Integration with WebRTC, the Web Audio API recently landed an undercover gem known as createMediaElementSource(). Basically, it allows you to hook up an HTML5 <audio> element as the input source to the API. In layman's terms...you can visualize HTML5 audio, do realtime sound mutations, filtering, etc!

Normally, the Web Audio API works by loading a song via XHR2, file input, whatever,....and you're off. Instead, this hook allows you to combine HTML5 <audio> with the visualization, filter, and processing power of the Web Audio API.

Integrating with<audio> is ideal for streaming fairly long audio assets. Say your file is 2 hours long. You don't want to decode that entire thing! It's also interesting if you want to build a high-level media player API (and UI) for play/pause/seek, but wish to apply some additional processing/analysis.

Here's what it looks like:

// Create an <audio> element dynamically.
var audio = new Audio();
audio.src = 'myfile.mp3';
audio.controls = true;
audio.autoplay = true;
document.body.appendChild(audio);

var context = new webkitAudioContext();
var analyser = context.createAnalyser();

// Wait for window.onload to fire. See crbug.com/112368
window.addEventListener('load', function(e) {
  // Our <audio> element will be the audio source.
  var source = context.createMediaElementSource(audio);
  source.connect(analyser);
  analyser.connect(context.destination);

  // ...call requestAnimationFrame() and render the analyser's output to canvas.
}, false);

As noted in the code, there's a bug that requires the source setup to happen after window.onload.

The next logical step is to fix crbub.com/112367. Once that puppy is ready, you'll be able to wire up WebRTC (the navigator.getUserMedia() API in particular) to pipe audio input (e.g mic, mixer, guitar) to an <audio> tag, then visualize it using the Web Audio API. Mega boom!

Web Audio FAQ

By Boris Smus at

Over the past few months, the WebKit Web Audio API has emerged as a compelling platform for games and audio applications on the web. As developers familiarize themselves with it, I hear similar questions creep up repeatedly. This quick update is an attempt to address some of the more frequently asked questions to make your experience with the Web Audio API more pleasant.

Q: Halp, I can't make sounds!

A: If you're new to the Web Audio API, take a look at the getting started tutorial, or Eric's recipe for playing audio based on user interaction.

Q. How many Audio Contexts should I have?

A: Generally, you should include one AudioContext per page, and a single audio context can support many nodes connected to it. Though you may include multiple AudioContexts on a single page, this can lead to a performance hit.

Q: I’ve got an AudioBufferSourceNode, that I just played back with noteOn(), and I want to play it again, but noteOn() doesn’t do anything! Help!

A: Once a source node has finished playing back, it can’t play back more. To play back the underlying buffer again, you should create a new AudioBufferSourceNode and call noteOn().

Though re-creating the source node may feel inefficient, source nodes are heavily optimized for this pattern. Plus, if you keep a handle to the AudioBuffer, you don't need to make another request to the asset to play the same sound again. If you find yourself needing to repeat this pattern, encapsulate playback with a simple helper function like playSound(buffer).

Q: When playing back a sound, why do you need to make a new source node every time?

A: The idea of this architecture is to decouple audio asset from playback state. Taking a record player analogy, buffers are analogous to records and sources to play-heads. Because many applications involve multiple versions of the same buffer playing simultaneously, this pattern is essential.

Q: How can I process sound from audio and video tags?

A: MediaElementAudioSourceNode is in the works! When available, it will work roughly as follows (adding a filter effect to a sample playing via the audio tag):

<audio src="sounds/sample.wav" controls>

var audioElement = document.querySelector('audio');
var mediaSourceNode = context.createMediaElementSource(audioElement);
mediaSourceNode.connect(filter);
filter.connect(context.destination);

This feature is tracked in this crbug. Note that in this setup, there is no need to call mediaSourceNode.noteOn(), the audio tag controls playback.

Q: When can I get sound from a Microphone?

A: The audio input part of this will be implemented as part of WebRTC using getUserMedia, and be available as a special source node in the Web Audio API. It will work in conjunction with createMediaElementSource.

Q: How can I check when an AudioSourceNode has finished playing?

A: Currently you have to use a JavaScript timer since Web Audio API does not support this functionality. The following snippet from the Getting Started with Web Audio API tutorial is an example of this in action:

// Assume source and buffer are previously defined.
source.noteOn(0);
var timer = setTimeout(function() {
  console.log('playback finished');
}, buffer.duration * 1000);

There is an open bug to make Web Audio API implement a more accurate callback.

Q: Loading sounds causes the whole UI thread to lock up and my UI becomes unresponsive. Help!

A: Use the decodeAudioData API for asynchronous loading to avoid blocking the main thread. See this example.

Q: Can the Web Audio API be used to process sounds faster than realtime?

A: Yes, a solution is being worked on. Please stay tuned!

Q: I’ve made an awesome Web Audio API application, but whenever the tab it's running in goes in the background, sounds go all weird!

A: This is probably because you are using setTimeouts, which behave differently if the page is backgrounded. In the future the Web Audio API will be able to callback at specific times using the web audio’s internal timer (context.currentTime attribute). For more information, please see this feature request.

In general, it may be a good idea to stop playback when your app goes into the background. You can detect when a page goes to the background using the Page Visibility API.

Q: How can I change the pitch of a sound using the Web Audio API?

A: Change the playbackRate on the source node.

Q: Can I change pitch without changing speed?

A: The Web Audio API could have a PitchNode in the audio context, but this is hard to implement. This is because there is no straightforward pitch shifting algorithm in audio community. Known techniques create artifacts, especially in cases where the pitch shift is large. There are two kinds of approaches to tackle this problem:

  • Time domain algorithms, which cause repeated segment echoes artifacts.
  • Frequency domain techniques, which cause reverberant sound artifacts.

Though there is no native node for doing these techniques, you can do it in with a JavaScriptAudioNode. This code snippet might serve as a starting point.

Q: How can I create an AudioContext at a sample rate of my choosing?

A: Currently there is no support for this, but we’re looking into it. See this feature request.

If you have additional questions, feel free to ask them on StackOverflow using the web-audio tag.

Introducing Video Player Sample

By Pete LePage at

Have you ever wanted a fun and beautiful way to publish videos on your own site like the new 60 Minutes or RedBull.tv apps from the Chrome Web Store? I'm excited to announce the release of The Video Player Sample web app! The Video Player Sample is an open source video player web app built using the same architecture as the 60 Minutes and RedBull.tv apps. It can be customized, extended, or just used out of the box and populated with your own content.

How it works

When a user opens the Video Player Sample, they can choose to watch a single video or create a playlist of videos/episodes from a list that they have uploaded and populated to the app. The Video Player Sample is configured and information about the videos is stored in JSON files (config.json and data.json respectively), both of which are located in the data directory.

Key features

  • A beautiful video watching experience, including a full screen view
  • Ability to subscribe to shows, watch episodes, create play lists
  • Support for multiple video formats depending on what the user’s browser supports (including WebM, Ogg, MP4, and even a Flash fallback)
  • A Categories page with an overview of the different shows/categories available in the app
  • Notifications of new episodes (when the app is installed via the Chrome Web Store)
  • Built in support for sharing to Google+, Twitter and Facebook
  • To ensure easy customization, all source files, including the Photoshop PSD’s, are included

How it's built

The Video Player Sample is written for the open web platform using HTML and JavaScript, broadly following the Model View Controller pattern and structure.

Browser Support

In addition to working as an app that can be installed through the Chrome Web Store, the Video Player Sample has been tested and works in all of the modern browsers.

Try it out

You can see a demo of the video player in action in the demo app, or by Adding It To Chrome through the Chrome Web Store. To learn more about how the app works, check out the documentation.

You can grab the code from Google Code.

Enjoy!

"Stream" video using the MediaSource API

By Eric Bidelman at

The MediaSource API extends the HTMLMediaElement to allow JavaScript to generate media streams for playback. Allowing JavaScript to generate streams facilitates a variety of use cases like adaptive streaming and time shifting live streams.

Here is a quick demo and example usage of API:

const NUM_CHUNKS = 5;

var video = document.querySelector('video');
video.src = video.webkitMediaSourceURL;

video.addEventListener('webkitsourceopen', function(e) {
  var chunkSize = Math.ceil(file.size / NUM_CHUNKS);

  // Slice the video into NUM_CHUNKS and append each to the media element.
  for (var i = 0; i < NUM_CHUNKS; ++i) {
    var startByte = chunkSize * i;

    // file is a video file.
    var chunk = file.slice(startByte, startByte + chunkSize);

    var reader = new FileReader();
    reader.onload = (function(idx) {
      return function(e) {
        video.webkitSourceAppend(new Uint8Array(e.target.result));
        logger.log('appending chunk:' + idx);
        if (idx == NUM_CHUNKS - 1) {
          video.webkitSourceEndOfStream(HTMLMediaElement.EOS_NO_ERROR);
        }
      };
    })(i);

    reader.readAsArrayBuffer(chunk);
  }
}, false);

DEMO

The example splits a .webm video into 5 chunks using the File APIs. The entire movie is then 'streamed' to a <video> tag by appending each chunk to the element using the MediaSource API.

If you're interested in learning more about the API, see the specification.

Support: Currently, the MediaSource API is only available in Chrome Dev Channel 17+ with the --enable-media-source flag set or enabled via about:flags.

Let Your Content Do the Talking: Fullscreen API

By Eric Bidelman at

Most browsers have the ability to enter a fullscreen or kiosk mode for a while now. Basically, the browser's chrome UI gets out of the way, and the content takes over. For apps installed from the Chrome Web Store, it's even been possible for users to manually configure an app to run fullscreen when it's opened from the New Tab Page. Manual fullscreen is good. Programmatic fullscreen is better!

The Fullscreen API allows web apps to programmatically tell any content on the page to enter the browser's fullscreen viewing mode, from JavaScript. This means WebGL and <canvas> games can finally become fully immersive, videos can feel like the silver screen, and online magazines can feel like the real deal. I love the browser, but we shouldn't be constrained by it :)

If you want to skip the details, here's a demo:

So how does the API work? If you wanted a <div>, for example, to go fullscreen, simple tell it to:

div.webkitRequestFullscreen(Element.ALLOW_KEYBOARD_INPUT);
div.mozRequestFullScreen();
div.msRequestFullscreen();
div.requestFullscreen(); // standard

Then to exit fullscreen, the document exposes a method for that:

document.webkitExitFullscreen();
document.mozCancelFullscreen();
document.msExitFullscreen();
document.exitFullscreen();

Note: the exitFullscreen() is the spec version. However, both FF and WebKit still expose prefixed cancelFullScreen().

Content in fullscreen-mode is centered in the browsers viewport. It's left to the developer to style that content in the most appropriate way for viewing. Typically, you'll want your <div> to take up the entire screen real-estate. Good news is that the API includes new CSS pseudo-selectors for this:

div:-webkit-full-screen {
  width: 100% !important;
}
div:-moz-full-screen {
  width: 100% !important;
}
div:-ms-fullscreen {
  width: 100% !important;
}
div:fullscreen {
  width: 100% !important;
}

/* While in fullscreen, hide any children with class 'tohide' */   
:-webkit-full-screen .tohide {
  display: none;
}
:-moz-full-screen .tohide {
  display: none;
}
:-ms-fullscreen .tohide {
  display: none;
}
:fullscreen .tohide {
  display: none;
}

CSS pseudo-selectors make it very easy to style fullscreen content any way you want.

The Fullscreen API is enabled by default in Chrome 15, FF 14, IE 11 and Opera 12.1. For more information on the rest of the API, see the spec.

Updated 2012-10-11: to be inline with spec changes. Lowercased the "S" in requestFullscreen() and changed document.webkitCancelFullScreen() to document.webkitExitFullscreen(). Updated browser compatibility comment.

Updated 2014-02-11: to include prefixes for IE, add the standard CSS syntax, and update the browser compatibility comment. Thanks @dstorey and @patrickkettner.