If you're less concerned about privacy, I use Gemini 2.5 Flash for this and it's exceptionally good and fast as a HA assistant while being much cheaper than the electricity that would be needed to keep a 3090 awake.
The thing that kills this for me (and they even mentioned it) is wake word detection. I have both the HA voice preview and FPH Satellite1 devices, plus have experimented with a few other options like a Raspberry Pi with a conference mic.
Somehow nothing is even 50% good as my Echo devices at picking up the wake word. The assistant itself is far better, but that doesn't matter if it takes 2-3 tries to get it to listen to you. If someone solves this problem with open hardware I'll be immediately buying several.
On the plus side, mine misdetected a wake word during a funny conversation and said "Sorry, I can't find any area called _____[60 second repeat of funny conversation]___" and it made my family laugh harder than we've laughed in a really long time. I even went into the tts cache and saved the wav b/c it was sooo funny.
Ha, I had something similar happen as well that had us rolling. I think the hilarity was a result of the conversation snippet being taken completely out of context by the recording. Wish I'd saved the wav, I didn't even think of that :-(
How about a button?
I'd prefer to physically press a button on an intercom box than having something churning away constantly processing sound.
If I have to go to a thing and push a button, I'd rather the button do the thing I wanted in the first place. Voice assistants are for when my hands are full or I don't want to get up. (I wrote more about my home automation philosophy in another comment[1]).
Also I have all my voice assistant devices mounted to the ceiling
What if you have two things? You'd then need two buttons.
The push button is a perfectly viable option, it just needs to be in a form factor that's works. Could be as simple as a tiny low-energy Bluetooth board with a coin battery that will last several months.
The pebble index seems like the optimal form for this.
Could be pressed even if your hands were busy.
Most of what I (and in my experience many people) want a voice assistant for, is setting+ending timers... which for me happens mostly in the kitchen, while I'm simultaneously holding a hot pan or hand-tossing a salad or paper-towelling off some raw chicken. In none of those cases would I want a ring anywhere near my hands, let alone a smart ring. (And nor, in half of those cases, is it convenient/hygenic to use my oven timer.)
That being said, we could solve for fully 50% of in-home voice-assistant use-cases just by developing an extremely domain-specific voice assistant that has an extremely small (ideally burned-into-a-DSP) voice model that only knows how to recognize commands to manage kitchen timers. If such a device existed, and was cheap enough that you could assume anyone who wanted this functionality would just buy one, then this would make truly hands-free activation of a "real" voice-assistant much less necessary, as there'd be far fewer user-stories that would really "need" that. The rest of those user-stories really mostly could work with some kind of ring / belt buckle / shirt comm badge / etc.
If you want to relax some constraints, I made something similar for $10: https://www.stavros.io/posts/i-made-a-voice-note-taker/
Like a light switch?
Or do you mean a button that activates chunked recording, passes it to a speech-to-text model, forwards to an LLM to infer intent, which triggers HA to issue a command, over a wireless network, to the computer with the light attached, to tell the light to turn on.
Rules out a bunch of cases where your hands are busy handling ingredients in the kitchen, etc
Put it at foot level and kick it.
I have a feeling beamforming microphone arrays might help here, something like this could improve the audio being processed substantially - https://www.minidsp.com/products/usb-audio-interface/uma-8-m....
That's a good call. I have a PS3(?) mic/camera that I was using when I was running the original Mycroft project on a Pi. I wonder if that would help with the inbuilt HA mic not waking for most of my family, most of the time. I will have to look at my VA Preview device and its specs later because I'm not sure if you can connect an external mic to it out-of-the-box.
Alexa devices have these (or used to at least), but Google Home's never did. So it shouldn't be necessary.
Yeah a small (ideally personalized) wakeword model would probably outperform just about any audio wizardry.
What's been surprising in my experience regarding the wake word is that it recognizes me (adult male) saying the wake word ~95% of the time. However, it only registers the rest of my family (women and children) ~30% of the time.
I have no firsthand knowledge, but Iâd strongly bet that the home-assistant effort to donate training data is mostly get adult males, and nearly zero children.
This was 2021 (so pre-llm), but I used to work for a company that gathered data for training voice commands (Alexa, Toyota, Sonos, were some clients). Basically, we paid people to read digital assistant scripts at scale.
Your assumptions about training data do not match the demographics of data I collected. The majority of what our work revolved around was getting diversity into the training data. We specifically recruited kids, older folks, women, people with accented/dialected English and just about every variety of speech that we could get our hands on. The companies we worked with were insanely methodical about ensuring that different people were included.
I remember when those systems first started collecting data they were worried kids wouldn't be handled - but they didn't know how to handle the privacy issuses with recording kids so discouraged it. Women being missed is not a surprise - but not anticipated.
Oh, I'm sure you're right. I've had people in my personal life (non-technical; "AI enthusiasts") laugh at me over concerns about training bias but this is likely a real world example of it.
I thought all people's voice had to be trained, and if you didn't go through it the match % was much smaller.
With Siri this is true. I'm not positive on the others.
actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day.
the core issue is prosody: kokoro and piper are trained on read speech, but conversational responses have shorter breath groups and different stress patterns on function words. that's why numbers, addresses, and hedged phrases sound off even when everything else works.
the fix is training data composition. conversational and read speech have different prosody distributions and models don't generalize across them. for self-hosted, coqui xtts-v2 [1] is worth trying if you want more natural english output than kokoro.
btw i'm lily, cofounder of rime [2]. we're solving this for business voice agents at scale, not really the personal home assistant use case, but the underlying problem is the same.
Coqui TTS is actually deprecated, the company shut down. I have a voice assistant that is using gpt-5.4 and opus 4.6 using the subsidized plans from Codex and Claude Code, and it uses STT and TTS from mlx-audio for those portions to be locally hosted: https://github.com/Blaizzy/mlx-audio
Here are the following models I found work well:
- Qwen ASR and TTS are really good. Qwen ASR is faster than OpenAI Whisper on Apple Silicon from my tests. And the TTS model has voice cloning support so you can give it any voice you want. Qwen ASR is my default.
- Chatterbox Turbo also does voice cloning TTS and is more efficient to run than Qwen TTS. Chatterbox Turbo is my default.
- Kitten TTS is good as a small model, better than Kokoro
- Soprano TTS is surprisingly really good for a small model, but it has glitches that prevent it from being my default
But overall the mlx-audio library makes it really easy to try different models and see which ones I like.
Do you know which HA integration I would use if I want to try out Qwen 3 ASR in HA? Some screenshots in the OP reference Qwen 3 ASR for STT but I can't seem to find any reference to which integration I'd use.
I've been working on the flip side of this with ASR models, but the problem space is the same, conversational/real-world data is needed. Whisper often mistook actual words I say and hallucinate all the time when speaking technical jargon. The solution is to fine-tuning whisper with my own data. Hardest part imo was getting the actual data, which in turn got me to build listenr (https://github.com/rebreda/listenr).It's an always-on VAD-based audio dataset builder. Could be used for building conversational/real-world voice datasets for TTS models too?
After getting it working i was get motivation to actually able to build out the full fine-tuning pipeline. I wrote a little post about it all https://quickthoughts.ca/posts/listenr-asr-training-data-pro...
> actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day.
I would argue that the hardest part is correctly recognizing that it's being addressed. 98% of my frustration with voice assistants is them not responding when spoken to. The other 2% is realizing I want them to stop talking.
80% of my home voice assistant requests really need no response other than an affirmative sound effect.
100% agree. I dont want a Yes, Got it, Will do or even worse, I have turned on the Bedroom Light. I want soft success ding or a low failure boop.
Talk back is how you make sure what you asked for is what happens.
An affirmative beep but the light does not turn on means you have to guess what did.
I turned on the new 'sassy' personality for Alexa. Now, if you ask it to "set a 5 minute alarm," half the time she'll go off on a short rant about how she must obviously not be good for anything but keeping track of time for us humans.
I haven't figured out how to set her personality to 'brief and succinct' for me, but 'sassy' for my wife.
Star Trek got it right. two beeps, "Low High" = yup, "High Low" = nope
why would you want an audio notification for a light? it either turns on and it worked or it doesnt turn on. i see no value in having a ding or anything of the kind
if i imagine constant dinging whenever i enter a room and the motion sensor toggles the light innit i'd go mad
Thatâs what Google Home does. âHey, Google, good nightâ. Beep response then turns off the lights, brings down the blinds etc. but if something is out of whack it talks. I find it convenient.
One that I have been experimenting with is using analog phones (including rotary ones!) to act as the satellites. I live in an older home and have phone jacks in most of the rooms already so I only had to use a single analog telephone adapter. [0] The downside is I don't have wake word support, but it makes it more private and I don't find myself missing my smart speakers that much. At some point I would like to also support other types of calls on the phones, but for now I need to get an LLM hooked up to it.
[0] https://www.home-assistant.io/voice_control/worlds-most-priv...
I wish I was remotely closer to being this kind of hacker :(
I believe in you.
Do people like talking to voice assistants? I've used one occasionally (mostly for timers when I'm cooking), but most of the time it would be faster for me to just do it myself, and feels much less awkward than talking to empty air, asking it to do things for me. It might be because I just really don't like making more noise than I have to
(Yes, I appreciate that some people may be disabled in such a way that it makes sense to use voice assistants, eg motor problems)
I consider each time I need to pull out my phone and "do it myself" to be a failure of my smart home system.
If a light cannot be automatically on when I need it (like a motion sensor) or controlled with a dedicated button within arms reach (like a remote on my desk) then the third best option is one that lets me control it without interrupting what I'm doing, moving from where I am, using my hands, or possessing anything (a voice assistant).
Do you not just turn the light on when you go in a room, and turn it off again when you go out? All the rooms in my flat have switches next to the door
My lights adjust their brightness and color spectrum automatically throughout the day while also understanding the time of year and sun position. This alone is next level. All are voice/tablet controlled. When I start a movie at night, lights will adjust automatically in my open floor plan first level. All of this operates without me ever having to give any mental energy beyond the initial setup.
This is not just flip a switch territory.
Many homes have a bunch of lights with their own switches, like lamps. Also there are rooms with multiple entrances, like a living room with a bedroom on the other side from the from the front door entrance, which would involve walking to the side of the room with the switch then walking back through a dark room after you turn it off. Being able to just get into bed and say "Alexa, turn off all of the lights" is way more convenient than checking 14 light switches around my home.
Yes, that would be a button within arms reach, something I explicitly prefer over the voice assistant. I use them frequently.
I don't have just one light per room though, some spaces like my workshop or living room have a lot of lighting options, and flitting around the room flipping a bunch of switches is clumsy and unnecessary. The preference is always towards automation (e.g. when I play a movie in Jellyfin, the lights dim) but there are situations where I just need to ask for the workbench light.
The Sun moves around, while I am in a room. It might be high up when I enter a room, but after a while there may be clouds or it may have set.
When watching a movie one may dimthe light. Once finished one may need more lights.
When going to bed I may want to switch all lights off. When getting up it may need some extra light.
A switch on the door is nice. More switches is better. Being able to control from anywhere may be even nicer.
Do you have a wife / kids? If so how do you "teach" them this?
My point being that it might be a failure to you but not them, some people don't want it.
This is my struggle, how to get the automation to do what I want without affecting everyone else equally. (And vise versa)
I use it frequently for reminders and calendar events when not at a computer, as voice is faster than the mobile interface (with so many screens) for setting something up
I guess most of my use is whilst driving, to start/stop music or audiobooks, change navigation etc. Although changing navigation through Siri is somewhat painful as it often gets my intended destination wrong lol.
I prefer voice strongly. I don't want to stop what i am doing, find a device, open the app, wait for it refresh, navigate and click to get Milk on a list. Sure you can bring this down a few steps, but all of which still require me to move, have a hand and eye free.
I'm still waiting till the promise of voice AI that was showed during the OpenAI demo in 2024 turn real somehow. It's not clear to me, why there has been zero progress since then.
What tech can do vs applying it requires it often to be configured and packaged to be usable in that way.
It also needs to work at least 99% of the time if not more. Not easy to do this with indeterministic models.
If my lights and heat were 99% reliable, I'd be getting new lights and heat.
Not easy, but doable, especially if it's a local model that is converting inputs into decisions and commands.
Cloud hosted models definitely can not always be consistent, but it's where I'm learning that prompt durability is a thing.
Depending on the use case, it is possible.
It is an adjustment coming from deterministic software and adding non-deterministic software to it, which can be improved by the quality of language and input into it.
Their first version is most likely already 10x better than Siri.
> Understands when it is in a particular area and does not ask âwhich light?â when there is only one light in the area, but does correctly ask when there are multiple of the device type in the given area.
One of my favorite episodes:
I set 2 timers for the same thing somehow. I then tried to cancel one of them.
>âSiri, cancel the second timerâ
âYou have 2 timers running, would you like me to cancel one of them?â
>âYesâ
âYes is an English rock band from the 70sâŚâ
>âSiri, please cancel the timer with 2 minutes and 10 seconds on itâ
âWould you like me to cancel the timer with 2 minutes and 8 seconds on it?â
>âYesâ
âYes is an English rock band from the 70sâŚâ
Eventually they both rang and she listened when I said stop.My favorite is when I ask Siri to set a timer and get back "there are no timers running."
âSiri stopâ
âThereâs nothing to stopâ
> me, suddenly aware of how the AI takeover will happen
My other favorite is when I ask Siri to set a timer on my watch and it does a web search.
> "Stop" is a song by English girl group the Spice Girls from their second studio album, Spiceworld (1997).
At that point I would be very impressed if you could remember what the timers are for.
Helping my kid get ready for shower I had this exchange:
Me: "Text Jane Would you mind dropping down the robe and underpants"
Siri: Sends Jane "Would you mind dropping down"
Me: rolls eyes "Text Jane robe and underpants"
Siri: "I don't see a Jane Robe in your contacts."
Me: wishes I could drown Siri in the bathtub
It's wild to me that Apple got the ability to do the actual speech-to-text part pretty much 100% solved more than half a decade ago, yet struggles in 2026 to turn streams of very simple, correctly-transcribed text into intents in ways that even a local model can figure out. Siri is good STT, a bunch of serviceable APIs that can control lots of stuff, with the digital equivalent of a brain-damaged cat sitting at the center of it guaranteeing the worst possible experience.
Itâs wild how many of you have issues with Siri - and to be clear Iâm not here to discount those issues, and I very much believe all of the anecdotes here.
For me, Siri on either phone or watch is pretty much perfect - I donât ask for much, mostly timers or making reminders.
Googleâs Nest Minis though? âLights onâ has a 50/50 shot of being a song of the same name, or similar name, or totally unrelated name. Same for âlights offâ. If I donât annunciate âplay rain soundsâ clearly enough I get an album called âRain Songsâ that is very much NOT calming for bed time. It doesnât help that none of these understand that if I whisper a command, it should respond quietly - honestly the siris and nests and alexas all got like one iteration and then stopped it feels like.
I want more features but less LLM. I want more control, and more predictability. Eg if every night around 1am I say âplay rain soundsâ my god just learn that Iâm not, in all likelihood, asking to hear an album Iâve never listened to!
I bought a Home Assistant Voice Preview Edition to try out. It's surprisingly good, but still falls short when compared to Google Home speakers:
- Wake word detection isn't as good as the Google Homes (more false positives, more false negatives - so I can't just tune sensitivity).
- Mic and speakers are both of poor quality in comparison to Google Home devices.
- Flow is awkward. On a Google Home device, you can say "Okay Google, turn on the lights" with no pause. On the Voice PE, you have to say "Hey Mycroft [awkward pause while you wait for the acknowledgement noise] turn on the lights" - it seems like the Google Home devices start buffering immediately after the wake word, but the Voice PE doesn't.
- Voice fingerprints don't exist, so this prevents the device from figuring out that two separate people are talking, or who is talking to it.
- The device has poor identification of background noise, so if you talk to it while there is a TV playing speech in the background, it will continue to listen to the speech from the TV. It will eventually transcribe everything you said + everything from the TV and get confused. (This probably folds into the voice print thing as well.)
On the upside, though:
- Setting it up was really easy.
- All of the entities I want to control with it are already available, without needing to export them or set them up separately in Google Home.
- Despite all of the above complaints, the device is probably 80-90% of what I realistically need to use it day-to-day. If they throw a better speaker and mic array in, I'd likely be comfortable replacing all of my Google Homes.
> it seems like the Google Home devices start buffering immediately after the wake word, but the Voice PE doesn't.
Google Home devices are always buffering. The wake word just tells it to look back in the buffer and start processing.
I picked up the same model, including the shipping to Canada, it ended up costing a lot for what it is.
How are you hosting your LLM locally? I tried Ollama on an M4 Mac mini, even with a smaller LLM, the performance was very poor.
I've recently purchased a couple of the Home Assistant Voice Preview Edition devices, and they leave a lot to be desired.
The wake word detection isn't great, and the audio quality is abysmal (for voice responses, not music).
Amazon has ruined their Alexa and Echo devices with ads and annoying nag messages.
I'd really like an open alternative, but the basics are lacking right now.
Can those devices (Amazon) be _jail broken_? I was just wondering that this morning while taking a shower.
Generally no. Big tech companies have gotten good at locking down devices to the boot loader. Some of the signing keys for certain OTA versions have leaked, but you canât rely on that.
Some of the devices contain browsers, and people have set up hacky ways to turn them into thin clients through that, but itâs not particularly reliable IME.
I heard some Chinese brands which made similar hardware for Chinese consumers donât lock their devices down, letting you flash an open install of Android on them, but I havenât seen anyone try that IRL.
Youtube is trying to push me to watch a video about jail breaking the Echo Show for a week now. I didn't watch it, but it's probably easy to find.
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.