Project Ears #169

StuartIanNaylor · 2022-03-05T08:55:38Z

StuartIanNaylor
Mar 5, 2022

Hopefully you are all ears :)

I am going to be starting on my little project Ears aiming at 1st ESP32-S3 but the same system for likely a PiZ2 maybe even a Z0

So what is ears, basically a very simple interoperable client/server KWS system that for an ASR looks essentially like a local KWS as it plays back through a loopback (asndloop) sink and ends up as a normal source or if queued just dropped as files.
It connects via websockets due to the ease of binary and text header detection but has a very simple protocol of just a couple of control messages.
Audio out is a completely different thing and that is left to choice to run an audio RTP of choice be it Airplay, Snapcast, Squeezelite the services run in complete isolation.

It goes old school to what linux is as a file system as its very transparent has a huge upstream of support and is extensible through enterprise proven NFS/Samba and works knowing zero or needing any knowledge of further down stream the ASR->Skill pipeline.
As per usual I don't really give a damn about branding or ownership and its called ears because that is a term I have been using of late to distance deliberately from satellite bloat.

It will also use on server device training to ship out models OTA and I am hoping at one stage to get a targetted speech extraction model to run server side as for KWS there isn't such need, but the same goes where intensive DSP processing can be shared centrally.

I have been really lucky as the microcontroller model a DSCNN that I have been watching for some time has by the same author done just about everything I need for a basis.

https://github.com/42io/dataset
https://github.com/42io/tflite_kws
https://github.com/42io/c_keyword_spotting
https://github.com/42io/esp32_kws

Pretty damn amazing for me as somehow we are attuned as every time I go back the author has added another element that will make the whole implementation so much easier and also in wonderfully optimized C.
But I might just keep watching as if suddenly 42io goes client/sever I wouldn't now no longer be surprised :)

Having https://mlcommons.org/en/multilingual-spoken-words/ has just made KWS a doddle and its just some simple training methods of dataset production that prob need a bit of tweaking to what is being done but the mlcommons addition makes things so much easier as I gave up using an aligner myself to extract words.
I was originally thinking of using MFFC as a codec as very lossy but hugely compressive but with many algs now going targeted speaker speech extraction based mainly on spectrograms standard audio codecs are a better option.
For dataset storage though nothing loads as fast or is small as a numpy.npz MFFC array and should be no problem to hold hours of dataset and user capture for on server device training.
A 1,102,660 16Khz mono 1sec wav dataset is a whopping 35.4 Gb of 306 hours duration whilst stored as numpy.npz MFFC shrinks down to 2.6 Gb which really is a x10 KW dataset which is out of the normal but gives an indication to impressive compression ratios ideally suited for embedded and for training exceptionally fast as already in Numpy MFFC format.

Anyway still waiting for new hardware releases and will be checking 42io for new updates but there is absolutely no need to embed system or branding specifics into a KWS system but guess someone at one stage tried to sell MS office only keyboards & mice that wouldn't work with anything else, we prob don't hear much now as you can guess what happens with those sort of ideas.

fquirin · 2022-03-10T09:36:40Z

fquirin
Mar 10, 2022
Maintainer

Interesting project! 🙂
I think there is a strong need for better open-source KWS systems and the 'multilingual-spoken-words' dataset caught my eye as well ^^. My hope was that someone can use it to train a system similar to Porcupine where you add new wake words just by text and without the need for custom user recordings.

So in your current concept you are planning to use an ESP32-S3 with microphones on board? Some kind of special version?

2 replies

StuartIanNaylor Mar 10, 2022
Author

https://github.com/espressif/esp-box They actually do all-in-one already but think the asr and tts can move out to be a more powerful central so that gives space for a standardized rtp audio package.
They also supposedly are selling just boards with the lite version but if the software changes on the lite come through its likely you could just connect 2x I2S mics to a standard dev kit but waiting to see.
As even if the a single ear might have no audio out the network synced audio might very well feed a non linear aec as in situations simple NS can cope with static noise but often its 3rd party media.
With everything going HDMI ARC and wireless audio its very possible to have media as a reference signal rather than processing complex RTXvoice like needs.
If the AEC can cope with the clock drift of NTP broadcast sync then you have AEC 3rd party noise tolerance but again something I need to check.

If I had the $ I prob would order x1000 as pcb and build is very cheap now and likely could make a very cost effective board but that is just a wish but no Espressif actually make ready built kit.
Porcupine you don't just add new wake words by text they just have a gpu farm and a very big word dataset with TTS that can assimilate words.
For me that has always been a relatively trivial option in regards to accuracy as you can get phonetic based KWS but the accuracy sucks and its that KW that is all important.

One of the main aspects of project KW is that its own server captures KW & Command sentence for what I call off device training where after a certain amount of capture additional epochs of usage audio is used to train in specific weighting of the users.
I have done this myself as I can create a very accurate noise tolerant KWS by recording a simple dataset using 4 or 5 phonetic pangram sentences.
Problem is it will be great for me and no good for others or you can go the otherway and take a general model and apply specific training that will increase accuracy and its choice to how much you overfit your model to usage training.

Porcupine just charge for model training but apart from that than being a slim C based KWS its no different to any model based KWS apart from the have all the servers and web front end to automate it all.

KWS drop off as SNR increases so the more accurate a model is the more tolerant it is of noise and subsequently better for the user. If you start with a low accuracy things will just get much worse in the presence of noise and hence why phonetic non trained KWS are not really a thing.
google use transfer leaning on their Pixels but they do it without accuracy loss which I don't know how but it occured to me it doesn't matter as you can just retrain a model which really is another form of transfer leaning.
https://www.tensorflow.org/lite/examples/on_device_training/overview

Having a KWS server means you can train on another device just additional epochs and not a whole model. You have always been able to do that by loading a checkpoint and starting again.
which is curious to why no-one has employed but guess it stops lucrative black box model sales.

https://github.com/hyperconnect/TC-ResNet is supposedly a really accurate KWS but really at a certain level 97%+ you are talking fractions as they near 99%
The esp32-S3 is a bit like tensorflow-lite where stage one stage it didn't support GRU & LSTM layers which it doesn't so it limits choice.
Also it could contain dynamic like pytorch feature that will not translate to mobile or static libs like tensorflow.
I have been looking as a DS_CNN for some while as it will run on micro but to be honest would prefer the .5% less accurate CRNN as it has 50% less parameters which equates to load and latency.
Xtensa do have LSTM layers but closed commercial offering of their NN package.
Its been weird watching Mycroft as a CRNN is just a Precise GRU fed by a CNN so the CNN filters down the parameters so its lighter but also its more accurate and guess they had a ML guru at one stage but no longer as sticking with a GRU is the load reduction and latency is over 50% less even if only .5% more accurate but also faster to train or seems to be.

StuartIanNaylor Mar 10, 2022
Author

Also with porcupine they make it out to be far better than it is as they bench with all the early discarded projects and ignore any comparison with any new KWS its far from what they make it out to be.
It could actually be phonic based thinking about it as if you do some testing with noise it doesn't do well and there benches are clean speech and nowhere near state of the art.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEPIA

Project Ears #169

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

SEPIA

Project Ears #169

StuartIanNaylor Mar 5, 2022

Replies: 1 comment · 2 replies

fquirin Mar 10, 2022 Maintainer

StuartIanNaylor Mar 10, 2022 Author

StuartIanNaylor Mar 10, 2022 Author

StuartIanNaylor
Mar 5, 2022

Replies: 1 comment 2 replies

fquirin
Mar 10, 2022
Maintainer

StuartIanNaylor Mar 10, 2022
Author

StuartIanNaylor Mar 10, 2022
Author