Project Ears #169
StuartIanNaylor
started this conversation in
Ideas
Replies: 1 comment 2 replies
-
Interesting project! 🙂 So in your current concept you are planning to use an ESP32-S3 with microphones on board? Some kind of special version? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hopefully you are all ears :)
I am going to be starting on my little project Ears aiming at 1st ESP32-S3 but the same system for likely a PiZ2 maybe even a Z0
So what is ears, basically a very simple interoperable client/server KWS system that for an ASR looks essentially like a local KWS as it plays back through a loopback (asndloop) sink and ends up as a normal source or if queued just dropped as files.
It connects via websockets due to the ease of binary and text header detection but has a very simple protocol of just a couple of control messages.
Audio out is a completely different thing and that is left to choice to run an audio RTP of choice be it Airplay, Snapcast, Squeezelite the services run in complete isolation.
It goes old school to what linux is as a file system as its very transparent has a huge upstream of support and is extensible through enterprise proven NFS/Samba and works knowing zero or needing any knowledge of further down stream the ASR->Skill pipeline.
As per usual I don't really give a damn about branding or ownership and its called ears because that is a term I have been using of late to distance deliberately from satellite bloat.
It will also use on server device training to ship out models OTA and I am hoping at one stage to get a targetted speech extraction model to run server side as for KWS there isn't such need, but the same goes where intensive DSP processing can be shared centrally.
I have been really lucky as the microcontroller model a DSCNN that I have been watching for some time has by the same author done just about everything I need for a basis.
https://github.com/42io/dataset
https://github.com/42io/tflite_kws
https://github.com/42io/c_keyword_spotting
https://github.com/42io/esp32_kws
Pretty damn amazing for me as somehow we are attuned as every time I go back the author has added another element that will make the whole implementation so much easier and also in wonderfully optimized C.
But I might just keep watching as if suddenly 42io goes client/sever I wouldn't now no longer be surprised :)
Having https://mlcommons.org/en/multilingual-spoken-words/ has just made KWS a doddle and its just some simple training methods of dataset production that prob need a bit of tweaking to what is being done but the mlcommons addition makes things so much easier as I gave up using an aligner myself to extract words.
I was originally thinking of using MFFC as a codec as very lossy but hugely compressive but with many algs now going targeted speaker speech extraction based mainly on spectrograms standard audio codecs are a better option.
For dataset storage though nothing loads as fast or is small as a numpy.npz MFFC array and should be no problem to hold hours of dataset and user capture for on server device training.
A 1,102,660 16Khz mono 1sec wav dataset is a whopping 35.4 Gb of 306 hours duration whilst stored as numpy.npz MFFC shrinks down to 2.6 Gb which really is a x10 KW dataset which is out of the normal but gives an indication to impressive compression ratios ideally suited for embedded and for training exceptionally fast as already in Numpy MFFC format.
Anyway still waiting for new hardware releases and will be checking 42io for new updates but there is absolutely no need to embed system or branding specifics into a KWS system but guess someone at one stage tried to sell MS office only keyboards & mice that wouldn't work with anything else, we prob don't hear much now as you can guess what happens with those sort of ideas.
Beta Was this translation helpful? Give feedback.
All reactions