Speech Recognition

Sub-field of computational linguistics that develops methodologies and technologies that enables recognition and translation of spoken language into text by computers

About Speech Recognition

Speech recognition is the task of detecting spoken words but there is more to speech recognition than recognizing individual sounds in the audio: sequences of sounds need to match existing words, and sequences of words should make sense in the language. This is called “language modelling.” Language models are typically trained over very large corpora of text, often orders of magnitude larger than the acoustic data.

Whilst speech recognition has been around for decades, recent advances in deep learning finally made speech recognition accurate enough to be useful outside of carefully controlled environments. Speech recognition is  built into our phones, our game consoles and our smart watches. It’s even automating our homes.

Common Tools and Libraries

AI Speech Lab

AI Singapore (AISG) has set up an AI Speech Lab to develop a speech recognition system that could interpret and process the unique vocabulary used by Singaporeans – including Singlish and dialects – in conversations.

SpeechLab technology is available as a service for both batch and near-real-time processing. Please contact AI Singapore for further information.


Kaldi is an open source toolkit made for dealing with speech data. it’s being used in voice-related applications mostly for speech recognition but also for other tasks — like speaker recognition and speaker diarisation.

Kaldi GStreamer: https://github.com/jcsilva/docker-kaldi-gstreamer-server


Porcupine is a self-service, highly-accurate, and lightweight wake word (voice control) engine. It enables developers to build always-listening voice-enabled applications/platforms.

Developer's Resource: https://github.com/Picovoice/Porcupine ​


Speech-to-text conversion powered by machine learning and available for short-form or long-form audio.

Developer's Resource: https://cloud.google.com/speech-to-text/

Azure Cognitive Services

Create apps, websites and bots with intelligent algorithms to see, hear, speak, understand and interpret your user needs through natural methods of communication.

Developer's Resource: https://azure.microsoft.com/en-us/services/cognitive-services/​


Open source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on Baidu's Deep Speech 2 paper, with PaddlePaddle platform.

Developer's Resource: https://github.com/PaddlePaddle/DeepSpeech

100E Use Cases

  1. SCDF – Use SpeechLab  technology to support verbatim transcription of calls so that call-takers can focus more on listening rather than typing and transcribing into English so that call-takers can better understand the conversation. Transcripts can also be used for further analysis.
  2. Socibot – AI demonstration platform will use SpeechLab technology to support integration with Azure Cognitive services knowledgebase to answer questions from local Singaporeans more accurately. Socibot also uses Porcupine for wakeword detection to reduce latency and improcve user experience.

Open Datasets

National Speech Corpus

Contains 2,000 hours of locally accented audio and text transcriptions.

Free Spoken Digit Dataset

A simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz.


Dataset consists of a large-scale corpus of around 1000 hours of English speech.

The Spoken Wikipedia Corpora

Corpus of aligned spoken Wikipedia articles from the English, German, and Dutch Wikipedia.


A collection of recordings of 630 speakers of American English.

Google Audioset

Large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos.

Related Articles

  1. How to start with Kaldi and Speech Recognition
    Link to article: https://towardsdatascience.com/how-to-start-with-kaldi-and-speech-recognition-a9b7670ffff6
  2. Simple guide to Kaldi – an efficient open source speech recognition tool for extreme beginners
    Link to article: https://medium.com/@nikhilamunipalli/simple-guide-to-kaldi-an-efficient-open-source-speech-recognition-tool-for-extreme-beginners-98a48bb34756
  3. Creating voice assistant for games tutorial for Fifa
    Link to articlehttps://towardsdatascience.com/creating-voice-assistant-for-games-tutorial-for-fifa-71cfbe428bd1