Say It Ain't So: Implementing Speech Recognition in Your App

SiriKit was one of the more talked about features announced at WWDC this year; unfortunately, its initial implementation is limited to a small number of use cases. But all is not lost! Apple introduced a collection of general purpose Speech APIs in iOS 10 that provide simple speech-to-text conversion from streaming voice or audio files in over 50 languages. In this talk from try! Swift, Marc Brown walks you through the new Speech APIs, discusses their limitations, and delivers a practical use case by adding speech recognition to a text-based search app.


Introduction (0:00)

I manage the mobile team at Blue Apron. Our mission is to make incredible home cooking accessible to everyone, and our mobile team is doing that in a 100% Swift code base. I also organize a Brooklyn Swift Developers meetup. Today, I am going to be talking about implementing speech recognition in your app.

A (Brief) History of Speech Recognition (1:03)

  • In 1952, Bell Labs built the first speech recognition device. It was limited to numbers only.
  • By the late ’50s, they were able to get it up to about 10 words. However, it could only recognize a single speaker - it would only work for the person who trained it.
  • In the ’60s, a group of Soviet researchers created the dynamic time warping algorithm, which made it easier to find the same word at different speaking speeds. With dynamic time warping, we were also able to see a decent increase in word count to about 200.
  • In the ’70s, DARPA funded the Speech Understanding Research Program, which was a five year initiative with the lofty goal of creating a system capable of a 1,000 word vocabulary (the vocabulary of a three year old). A group at Carnegie Mellon achieved this in 1976 through the use of beam search.
  • The ’80s brought us many advancements, one of them being a voice activated typewriter called Tangora, created by IBM. I was capable of a 20,000 word vocabulary. We also started seeing the first uses of the hidden Markov model in speech recognition.
  • In the ’90s, the system vocabularies grew larger than a typical adult’s. Some other breakthroughs were:
    • Continuous speech recognition: you could have a more conversational tone with the device, versus having to pause after every single word.
    • Speaker independence: if one person trained it, anyone else could use it, albeit with some minimum voice-onboarding. This opened the floodgates for commercially viable speech recognition products with a company called Dragon leading the way with its “NaturallySpeaking” program.
  • In the early 2000s, speech recognition was hovering around 80% accuracy (which sounds high until you think about having to autocorrect one out of every five words). This is the first introduction of long short term memory, or LSTM, which is the first deep learning method applied to speech recognition.
  • By 2007, LSTM was outperforming all of the traditional methods. 2007 was also a banner year because the advent of smartphones. Additional commercial products became available - Google Voice search and later Siri were able to acquire massive amounts of data in order to train their models, to the point where today, accuracy rates are closer to mid-90%.

WWDC 2016: Introducing SiriKit (5:00)

Fast forward to WWDC this year - Apple finally opens up Siri to third party developers (something that we have wanted for a number of years).

In order to grasp the potential impact of this, let’s pretend that you work for the hot new start up: Weezer Pay. This allows you to send and receive payments with the band members of Weezer. It is a niche market, but highly underserved. If you added Siri kit support, you could say something like, “Hey Siri, send Rivers Cuomo five dollars on Weezer Pay for recording the Blue Album”. Siri kit will then deconstruct that. It can determine the domain is “payments” and that the intent is to actually send a payment. From there, it figures out that you want to use the Weezer Pay app, so it digs into more payment-specific details (e.g. the actual payee, the amount you want to send, and the reason you want to send it).

This is great, except for the list of initially supported domains. This is great if you work for Slack, Uber, or even Weezer Pay, but not so great for others.

However, all is not lost, because at WWDC they also introduced a collection of speech APIs.

Speech Framework (7:21)

Speech Framework is the same speech tech used in Siri. You can use live audio from a live stream for your microphone, or pull in a prerecorded file from disk. Not only does it return the recommended transcription, but it also returns an array of alternative transcriptions, as well as their associated confidence values. It supports 50+ languages and dialects.

Info.plist (7:57)

Beginning in iOS 10, you are required to explain why you need access to other services, e.g. location and push notifications. For our purposes, we need access to two services: Speech Recognition and Microphone access. You need to make sure that you add both NSSpeechRecognitionUsageDescription and NSMicrophoneUsageDescription to your Info.plist.

Import Speech (8:26)

Next step, import Speech. This is where you get all the fun APIs.

Speech Recognizer (8:55)

// Specific language
private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))

// Native language
private let speechRecognizer = SFSpeechRecognizer(locale: Locale.current)

The speech recognizer allows you to define the context of the query by setting the Locale. You can set a specific language (e.g. US English), or you could even use the user’s current locale.

If you are wondering whether or not the speech APIs will support these different locales, you are probably covered: we are talking 58 total locales. This is great because it allows you to connect with your users in their language.

Request Authorization (9:21)

SFSpeechRecognizer.requestAuthorization { authStatus in
    OperationQueue.main.addOperation {

        switch authStatus {
            case .authorized:

            case .denied:

            case .restricted:

            case .notDetermined:
        }
    }
}

When a user initiates their first request, The appropriate time to prompt for access to both speech recognition and the microphone is when the user initiates their first request; launch probably is not the right time to do it. When they push a mic button and they’re ready to speak, that is when you want to do it.

Let’s walk through constructing a live speech-to-text query.

Recognition Request (9:52)

The recognitionRequest defines your audio source.

private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?

...

recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

recognitionRequest.shouldReportPartialResults = true

There’s a difference between live and prerecorded speech here. Use SFSpeechAudioBufferRecognitionRequest() if you are doing a live stream. If you want to reference a file on disk, use SFSpeechURLRecognitionRequest(). Make sure to set shouldReportPartialResults to true if you want to display the query during processing. This is helpful when the user is saying a long sentence and you want to show immediate feedback.

Recognition Task (10:31)

A recognitionTask handles the result of your recognitionRequest.

private var recognitionTask: SFSpeechRecognitionTask?

...

recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest) {
    result, error in
    var isFinal = false

    if let result = result {
        print(result.bestTranscription.formattedString)
        isFinal = result.isFinal
    }

    if error != nil || isFinal {
        // 60 sec limit reached
    }
}

Here, we display the results back to the user. Apple sets a 60 sec limit to an individual speech request, so if you happen to reach that threshold, it is going to return isFinal as true.

Audio Session (11:05)

Use AVAudioSession to initialize your audio:

let audioSession = AVAudioSession.sharedInstance()

try audioSession.setCategory(AVAudioSessionCategoryRecord)

try audioSession.setMode(AVAudioSessionModeMeasurement)

try audioSession.setActive(true, with: .notifyOthersOnDeactivation)

Capture Audio Stream (11:28)

This is another thing you do only when you are ready for input, because it will interrupt most of the audio streams on the device. There are certain audio streams that have higher priority (e.g. making a phone call).

private let audioEngine = AVAudioEngine()

...

let recordingFormat = audioEngine.inputNode.outputFormat(forBus: 0)

audioEngine.inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) {
    (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
    self.recognitionRequest?.append(buffer)
}

audioEngine.prepare()

try audioEngine.start()

The AVAudioEngine manages the state of your audioEngine. This is where you define your output format, install your audio tap, and kick off the whole thing. When you are ready to go and everything is initialized, call audioEngine.start.

SFSpeechRecognizerDelegate (11:51)

I also wanted to bring up SFSpeechRecognizerDelegate:

import Speech

public class ViewController: UIViewController, SFSpeechRecognizerDelegate {

    ...

    speechRecognizer?.delegate = self

    ...

    public func speechRecognizer(_ speechRecognizer: SFSpeechRecognizer,
        availabilityDidChange available: Bool) {

There is only one method to use, which is availabilityDidChange. If you are processing a request, and you go offline for some reason, the server stops responding. You will get a change in your availability, and then you can respond appropriately.

Gotchas (13:40)

  1. Make sure you set your correct permissions in your Info.plist. You need to explain why you need access to these services. Otherwise, your app will crash.
  2. There are usage limits on a per device and per app basis. Unfortunately, these are not published by Apple. If you are anticipating lots of usage for your particular use case, you might want to reach out to them.
  3. To reiterate, you only have up to 60 seconds of live audio.

UX Considerations (14:25)

  • Choose your context wisely. Think about the scenarios where it makes sense to use speech recognition. Text search makes some sense, and anywhere you have short dictation (e.g. a to-do list) makes sense. If you are making an app to dictate your novel, it’s less likely to work.
  • Your user’s sense of security is incredibly important. If you break their trust, it is going to be very hard to gain it back. As such, you always want to alert the user when you are recording their voice. Additionally, if you think that people will be hitting the max threshold of 60 secs, alert them when they are approaching it. Otherwise, they’ll just think the app stopped working.
  • International language support: instead of defaulting to English, this is your chance to allow users to speak their native language. That will improve the overall experience for everyone. Two caveats with this:
    1. If you have not translated your app to other native languages, message somewhere in your app that you support native language speech recognition.
    2. If you are hitting end points, third party APIs, or private APIs, you need to make sure that those end points will support negative languages as well.

References (16:18)

I recommend watching the speech recognition video from WWDC - it is very short (12 minutes!), by Henry Mason. I pulled much of my information from there. And then the sample app, Speech Recognition Demo; feel free to play around with it!


Marc Brown

Marc Brown

Marc is the Mobile Engineering Manager at Blue Apron and has been building iOS apps since 2009. Previously, he worked for Etsy and a handful of startups. Marc runs the Brooklyn Swift Meetup and loves encouraging others to learn Swift. In his spare time, he enjoys retweeting Arrested Development quotes.