Building up a sentences database for voice assistant tasks #212

fquirin · 2022-11-12T12:01:28Z

fquirin
Nov 12, 2022
Maintainer

Hello fellow open-source voice assistant creators and enthusiasts 😃 ,

Open-source speech recognition has come a long way, but even with the latest "production-ready" systems the real-time transcription quality quickly degrades if your microphone isn't the best, if your environment is noisy, if you stand more than 2m away from the microphone (smart-speaker) or if you work in a speech domain that was not very prominent in the training data. Unfortunately the latter is often the case, even for something so obvious as "voice assistants". A simple example: "set a timer" becomes "set a time".

Since v2.7.0 the SEPIA client app and smart-services can set an active "task" parameter that can be used (among other things) to dynamically switch speech recognition models. This is very useful if your service is part of a domain that is typically very complicated to handle because of a very large vocabulary like navigation (thousands of street and city names) or music (thousands of artists, titles and genres).
For example in the default mode the client could use the "general" speech recognition model, optimized for voice assistant input (command & control etc.). Then if you say "play some music" it will switch to the "music" task, activate the optimized model and ask the user "what do you want to hear?". If the service is finished it switches back to the general model.

To make this work I'm planning to train a hand full of new language models for the SEPIA STT-Server and this will require to build up a larger database of sentences.
Since this list of sentences could be equally useful to train NLU models I thought about adding additional info to each sentence like intent and maybe even parameters (entities, actions, variables, ... whatever you want to call them ;-)).
The existing SEPIA sentences are very limited, since most of the NLU is rules-based and the format hasn't really aged well 🙈 , so I think it is best to start from scratch and develop the new system hand-in-hand with a new NLU module for the SEPIA pipeline (complementing the existing ones).

In this thread I'd like to discuss ideas about the file format, how to store, use and expand the data. Here are some ideas:

I've create a new repository to store the data: https://github.com/SEPIA-Framework/sepia-training-data
File format could be a subset of JSpeech Grammar Format (JSGF) (keep it simple)
Data should be stored in text files organized somehow by language and task
Tasks I currently use: assistant, yes_or_no, smart_home, control, time, schedule, numbers, math, web_search, news_search, music, navigation, conversation, translation
There should be scripts to expand the JSGF data into a list of sentences for language model training
A web-UI to edit the files would be nice 😎
I can support 2 languages myself (German and English), but for other languages I need some help. Translations should be relatively easy though, after one dataset exists

I hope this data will become useful to more open-source projects, so questions, ideas, contributions are very welcome! 🙂

@synesthesiam I hope this will be interesting for Rhasspy as well. Maybe you already have a corpus to start from? 😁
@thorstenMueller It would be awesome to have you on board for this 😃. Maybe there is some data in the OpenVoice-Tech Wiki?

I'll start to add some examples and suggestions to the mentioned repository the next days.

thorstenMueller · 2022-11-12T16:25:10Z

thorstenMueller
Nov 12, 2022

My first thought was taking a look to Common Voice text corpus samples, but if it should be more in a smartspeaker use-case maybe it's worth a look checking Mycroft Skills. There are texts (including templates/placeholders) for many languages available.
https://github.com/MycroftAI/mycroft-skills

1 reply

fquirin Nov 13, 2022
Maintainer Author

I'll have a look what I can gather from there 👍

synesthesiam · 2022-11-12T18:29:51Z

synesthesiam
Nov 12, 2022

Rhasspy's template language is a subset of JSGF, with a few extras to make life easier 🙂
The main highlights are:

Sentence templates are grouped by intent
Words and groups of words can have substitutions, like foo:bar where you say "foo" but "bar" comes out in the transcript (great for normalization)
JSGF tags are used to identify entities, e.g. (living room lamp){light_name}
External lists of words and phrases can be inserted with $listName
- Each line in the list is actually a sentence template too! So you can do substitutions, and even reference other $lists.

I think the sentence templates and the word lists could be grouped by task/language with a directory structure like <task>/<language>/sentences.ini and <task>/<language>/lists/mylist

@fquirin A useful feature of the Rhasspy Template Language for your new language models: I have tools that will convert a set of sentence templates and word lists directly into n-gram counts for use with KenLM, etc. I emphasize directly because there is no need for a step where all possible text strings are generated, which could be millions. This lets you compactly describe a corpus, since you can do things like set brightness to 0..100 percent and get the appropriate n-gram counts for all 100 sentences without expanding them.

@thorstenMueller @fquirin One feature I'd like to talk about is the need to modify certain words in a sentence due to the language's grammar constraints. German's gender is a good example, where the determiner changes depending on the noun's gender (der/die/das).

In English, I can write a template like turn on the $lightNames and it will work for any noun in $lightNames. But replacing the with (der | die | das) in a German sentence template doesn't work, because it depends on the gender of each name in $lightNames. I think it would be possible to guess the gender (or use a look up table), but the sentence template would need some new syntax that means "replace with der/die/das depending on the gender of the next noun". For example (@det $lightNames) might guess/look up the correct determiner for each value in $lightNames.

I'm trying to imagine if this kind of thing could be generalized to help solve some of the date and time issues that @fquirin has mentioned in the past.

6 replies

fquirin Nov 13, 2022
Maintainer Author

@synesthesiam Did you write the JSGF parser with extra features yourself? I'm thinking about ways to implement this in Java 🤔

synesthesiam Nov 16, 2022

I did write the JSGF parser myself. I tried once to implement in with ANTLR, but it was so slow it was basically unusable. Maybe I didn't do it right :/ I tried again with pyparsing, but ran into problems with recursive groupings.

fquirin Nov 17, 2022
Maintainer Author

I'm thinking about translating JSGF to regular expressions, to make use of the internal Java optimizations for quick search.
An tips on what is a fast way to compare a sentence against a lost of JSGF rules? :-)

synesthesiam Nov 17, 2022

I wrote a JSGF sentence generator in Java years ago using code from Sphinx: https://github.com/synesthesiam/jsgf-gen
Transforming that into a search instead of generator wouldn't be too difficult, and it could even be parallelized across rules with some tweaking.

fquirin Nov 17, 2022
Maintainer Author

Oh nice. I used Sphinx in ILA way back, but was looking for modern, lightweight, standalone JSGF Java libs few weeks ago. Unfortunately I did not find yours but only this (which seemed a bit experimental).
I'll check it out and see if I can build on top of it 🙂 .

synesthesiam · 2022-11-16T02:30:02Z

synesthesiam
Nov 16, 2022

@fquirin Have you ever looked into VoiceXML?

7 replies

synesthesiam Nov 17, 2022

I see some similarities to AIML, but VoiceXML looked to me like it brought together a bunch of W3C standards, like JSGF, SSML, XML lexicons, and even the state charts thingy. I've debated whether VoiceXML would be worth pursing to increased accessibility, but it seems like a ton of work.

Your parsers remind me of some of the work done in Lingua Franca. I like the library overall, but I feel like we should try and move this stuff out into Rust or C/C++ to make it usable in more languages.

fquirin Nov 17, 2022
Maintainer Author

"ton of work" yeah, that's what I thought too 🙈. It could be very useful though ...

Your parsers remind me of some of the work done in Lingua Franca

Indeed. Tbh I didn't know they had this ^^. Anyway, as you said, it should be rewritten in C++/Rust or something to be able to wrap it for Python, Java, Node.js ... and gain some speed 🙂. Unfortunately I don't really know C++ or Rust (yet).

fquirin Nov 17, 2022
Maintainer Author

Btw is Lingua Franca English only?

thorstenMueller Nov 17, 2022

Btw is Lingua Franca English only?

According to here it's working in multiple languages.

fquirin Nov 17, 2022
Maintainer Author

Interesting, maybe I should check out what they did for German because these tasks are usually a major pain in the a** 😁

synesthesiam · 2022-12-15T21:15:22Z

synesthesiam
Dec 15, 2022

@fquirin @thorstenMueller Paulus from Home Assistant and I have started on a sentence database and file format. Would either of you be interested in helping out with German?

6 replies

fquirin Dec 16, 2022
Maintainer Author

Quick question, do you build these parts dynamically from user devices and rooms etc.?

expansion_rules:
  name: "[the] {name}"
  area: "[the] {area}"
...

I'm assuming in the end you put everything together into the JSGF grammar or maybe directly into a graph?

I thought about a new implementation for SEPIA to handle the grammar files and a possible approach could be to build a custom graph in Java. I think this is what you did in Rhasspy right?

synesthesiam Dec 16, 2022

I think this is pretty close to what we had discussed. The template language is similar to Rhasspy's, but simplified a bit and embedded in YAML. So things are grouped by language and intent. But the YAML files are all loaded into a single dictionary in the end, so they can be grouped however.

Yes, the {name} and {area} lists will be built dynamically by Home Assistant (or whatever host). We have some test lists that are used in the pytest scripts.

You should be able to easily parse the ANTLR grammar in Java. I created a custom listener that converts the ANTLR parse into Python objects, but it could just as easily become a graph like I did originally in Rhasspy.

synesthesiam Dec 16, 2022

I'm also writing a script to generate the possible sentences (given a list of names and areas), so that will be usable for training a machine learning NLU system 🙂

fquirin Dec 17, 2022
Maintainer Author

Sounds awesome so far. I'm currently developing some concepts for the new SEPIA NLU module and will try to implement a reader for HA files as well :-).

Just wanted to fork your HA repository but it seems the community was faster 😁 home-assistant/intents#13 .
Translations look ok to me. Reading the sentences I realized that there will probably be a lot of new vocabulary for the ASR system (e.g. "Schlafzimmerlampe" etc.). With the right scripts and data for G2P (or anything similar) it should be possible to automate the process, but it is something I often failed to do because of incompatible phoneme sets or buggy results :-(

fquirin Dec 18, 2022
Maintainer Author

Some more thoughts:
How do you plan to handle arbitrary numbers? 😅 . I think to handle "all" numbers you have to add them as text to be able to combine things like "one hundred eighty two" to 182, but in your grammar you'll likely want the numbers not the words. Do you plan to expand the numbers for the ASR LM?

fquirin · 2022-12-30T16:33:15Z

fquirin
Dec 30, 2022
Maintainer Author

Hi @synesthesiam,

I've had some time to work on the Rhasspy sentences.ini support (still in planning phase) and was wondering if the format supports multiple optionals like:

([show me] | [open]) [the] news

or maybe:

[(show me | open)] [the] news

This is something I use a lot in my regular expressions module.

6 replies

fquirin Jan 12, 2023
Maintainer Author

Top 👍

Something else I was wondering. How do you expand {0..100}? From your docs I read that you use num2text to get the words for numbers, but in this case you would need to translate it to (zero | one | three | ... | ninety nine | one hundred | hundred) or something for the ASR system, which seems very inefficient. Do you actually apply this restriction to the LM or simply accept all numbers and add some kind of placeholder for training? (Not sure how this would affect accuracy)

synesthesiam Jan 12, 2023

I actually generate the entire list of words, but they're cached. Unless you use a lot of unique number ranges, it ends up being pretty efficient (and even then, you can re-use shared segments).

For Kaldi specifically, it's also possible to use GrammarFSTs to make this very efficient. You produce the FST of the number range separately, and then just reference its name in the intent grammar. Kaldi expands this at runtime, so you can alter the number range FST without recompiling the whole thing.

fquirin Jan 13, 2023
Maintainer Author

Sounds good :-)

For Kaldi specifically, it's also possible to use GrammarFSTs to make this very efficient

This will generate a LM that strictly follows the rules right? I usually prefer LMs with reduced vocabulary but flexible paths, at least as default model that handles the initial input since SEPIA's NLU does not only use fixed phrases but a lot of rules with open vocabulary.

synesthesiam Jan 14, 2023

Right, that would only be for the strict grammar. Have you ever come across a system that has a flexible grammar but lets you have placeholders like <NUMBER> that are defined externally?

fquirin Jan 15, 2023
Maintainer Author

Unfortunately not :-( To me it looks like dynamic LMs are not explored very well in the open-source world yet. In Vosk you can have models with ad-hoc vocabulary restriction and in Coqui you can boost words, but in practice it hasn't been very useful yet, especially since the words have to be part of the model initially.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEPIA

Building up a sentences database for voice assistant tasks #212

{{title}}

Replies: 5 comments 26 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Building up a sentences database for voice assistant tasks #212

fquirin Nov 12, 2022 Maintainer

Replies: 5 comments · 26 replies

fquirin Nov 13, 2022 Maintainer Author

fquirin Nov 13, 2022 Maintainer Author

fquirin Nov 17, 2022 Maintainer Author

fquirin Nov 17, 2022 Maintainer Author

fquirin Nov 17, 2022 Maintainer Author

fquirin Nov 17, 2022 Maintainer Author

fquirin Nov 17, 2022 Maintainer Author

fquirin Dec 16, 2022 Maintainer Author

fquirin Dec 17, 2022 Maintainer Author

fquirin Dec 18, 2022 Maintainer Author

fquirin Dec 30, 2022 Maintainer Author

fquirin Jan 12, 2023 Maintainer Author

fquirin Jan 13, 2023 Maintainer Author

fquirin Jan 15, 2023 Maintainer Author

fquirin
Nov 12, 2022
Maintainer

Replies: 5 comments 26 replies

fquirin Nov 13, 2022
Maintainer Author

fquirin Nov 13, 2022
Maintainer Author

fquirin Nov 17, 2022
Maintainer Author

fquirin Nov 17, 2022
Maintainer Author

fquirin Nov 17, 2022
Maintainer Author

fquirin Nov 17, 2022
Maintainer Author

fquirin Nov 17, 2022
Maintainer Author

fquirin Dec 16, 2022
Maintainer Author

fquirin Dec 17, 2022
Maintainer Author

fquirin Dec 18, 2022
Maintainer Author

fquirin
Dec 30, 2022
Maintainer Author

fquirin Jan 12, 2023
Maintainer Author

fquirin Jan 13, 2023
Maintainer Author

fquirin Jan 15, 2023
Maintainer Author