• by vunderba on 1/22/2024, 4:11:32 AM

    All the major smart speaker manufacturers have plans to eventually back their services using LLM's.

    https://www.theverge.com/2023/9/20/23880764/amazon-ai-alexa-...

    I've already put one together for personal usage using the ESP32-S3. The advantage of using something like this instead of a raw raspberry PI is that you've got a basic far field microphone, screen, wake word support, etc. and then it's just a matter of wiring up the voice to go to deep whisper for recognition, pass it on to a large language model (in my case I'm using mistral.), generate TTS using Mycroft, and sending it back.

    Biggest annoyance is having to have a dedicated server, since the ESP is simply not powerful enough to do real time voice recognition and LLM inference.

    It's relatively easy to do - I had a workable prototype within a weekend.

  • by freddealmeida on 1/22/2024, 5:38:41 AM

    A number of companies are and have tried to build this. I myself did once in 2018. Alyc.ai if you care to look. We had voice, vision, llm, domain specific training (called fine tuning today I suppose), emotion detection, gesture detection, pose estimation, action recognition (ie. guy is drinking a beer), multiple stereoscopic cameras, microphone mesh, infrared camera (used for depth and night vision), and used nvidia chipsets (jetson) to run models at the edge. Other models ran in the cloud. Our LLM was about 2B parameters trained on about 800MB of Japanese data. Also our interface was a sophisticated peppers ghost (Today I would use lightfield displays) but it was fun. We didn't have RLHF feedback loops so Alyc was a bit complex. Still people loved her. Covid killed this project.