• by Mizza on 6/11/2025, 9:52:09 PM

    Demos here: https://resemble-ai.github.io/chatterbox_demopage/ (not mine)

    This is a good release if they're not too cherry picked!

    I say this every time it comes up, and it's not as sexy to work on, but in my experiments voice AI is really held back by transcription, not TTS. Unless that's changed recently.

  • by xnx on 6/11/2025, 9:27:00 PM

  • by travisvn on 6/12/2025, 6:15:01 AM

    Chatterbox is fantastic.

    I created an API wrapper that also makes installation easier (Dockerized as well) https://github.com/travisvn/chatterbox-tts-api/

    Best voice cloning option available locally by far, in my experience.

  • by teraflop on 6/11/2025, 10:42:57 PM

    > Every audio file generated by Chatterbox includes Resemble AI's Perth (Perceptual Threshold) Watermarker - imperceptible neural watermarks that survive MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.

    Am I misunderstanding, or can you trivially disable the watermark by simply commenting out the call to the apply_watermark function in tts.py? https://github.com/resemble-ai/chatterbox/blob/master/src/ch...

    I thought the point of this sort of watermark was that it was embedded somehow in the model weights, so that it couldn't easily be separated out. If you're going to release an open-source model that adds a watermark as a separate post-processing step, then why bother with the watermark at all?

  • by pryelluw on 6/11/2025, 10:22:37 PM

    Silly question, what’s the lowest spec hardware this will run ?

  • by ineedasername on 6/11/2025, 11:11:10 PM

    The emotional exaggeration is interesting, though I don't think I've come across anything quite so versatile and easy to "sculpt" as Elevenlabs and it's ability to generate a voice on the basis of a description of how you want the voice to sound. SparkTTS allows some additional parameters, and it's project on GitHub has placeholders in its code that indicate the model might be refined for more fine grained emotional control. As it is, I've had some success with it and other models by trying to influence prosody and tonality with some heavy handed queues in the text, which can then be used with VC to get closer to desired results, but it's a much more cumbersome process than Eleven.

  • by nmstoker on 6/11/2025, 10:25:51 PM

    I've found it excellent with really common accents but with other accents (that are pretty common too) it can easily get stuck picking a different accent. For instance several Scottish recordings ended up Australian, likewise a fairly mild Yorkshire accent

  • by abraxas on 6/11/2025, 9:51:42 PM

    Are these things good enough to narrate a book convincingly or does the voice lose coherence after a few paragraphs being spoken?

  • by iambateman on 6/12/2025, 12:44:31 AM

    Just a regular reminder to tell your friends and family to be extra skeptical about phone conversations.

    It’s becoming much more likely that the friend who desperately needs a gift card to Walmart isn’t the friend at all. :(

  • by audiala on 6/12/2025, 5:17:39 AM

    What is the current state of the art for open source multilingual TTS? I have found Kokoro to be great as English as well, but am still searching for a good solution for French, Japanese, German...

  • by philipkiely on 6/12/2025, 12:47:55 AM

    Example implementation with sample inference code + voice cloning example:

    https://github.com/basetenlabs/truss-examples/tree/main/chat...

    Still working on streaming

  • by tevon on 6/12/2025, 5:03:36 AM

    I just tested it out locally, really excellent quality, the server was easy to set up and well documented.

    I'd love to get to real-time generation if that's in the pipeline? Would like to use it along with Home Assistant.

  • by stevage on 6/11/2025, 11:25:13 PM

    Interesting demo. A few observations, having uploaded a snippet of my own voice, and testing with some of my own text:

    - the output had some of the qualities of my voice, but wasn't super similar. (Then again, the fact it could even do this from such a tiny snippet was impressive)

    - increasing "CFG/pace" (whatever CFG is) even a little bit often just breaks down into total gibberish

    - it was very inconsistent whether it would come out with a kind of British accent or an American one. (My accent is Australian...)

    - the emotional exaggeration was interesting, but it seemed to vary a lot exactly what kind of emotion would come out

  • by j2kun on 6/11/2025, 10:02:55 PM

    They should put the meaning of "TTS" in the readme somewhere, probably near the top. Or their website.

  • by lukeinator42 on 6/12/2025, 11:02:28 PM

    Does anyone know of an open-source TTS like this that can also encode speech to do voice conversion alongside TTS? i.e. a model that would take speech as input and convert it to one of the pretrained TTS voices.

  • by causality0 on 6/11/2025, 11:14:40 PM

    Anyone know how this compares to Kokoro? I've found Kokoro very useful for generating audiobook but it almost always pronounces words with paired vowels incorrectly. Daisy becomes die-zee, leave becomes lay-ve, etc.

  • by b0a04gl on 6/12/2025, 6:42:48 PM

    > the emotion intensity control is killer. actual param you can tune per line. > and the perth watermarking baked into every output, that’s the part most people are sleeping on. survives mp3, editing, even resampling. no plugin, no postprocess. > also noticed the chatterboxtoolkitui floating in the org, with audiobook mode and batch voice conversion already wired in.

    is it a banger??? yes ig so, a full setup ready for indies shipping voicefirst products right now.

  • by pzo on 6/12/2025, 1:37:36 AM

    It's only for English sadly

  • by palmfacehn on 6/12/2025, 9:09:08 AM

    Has anyone developed a way to annotate the input to provide emotional context?

    In the past I've used different samples from the same speaker for this.

  • by racecar789 on 6/12/2025, 4:51:17 AM

    I’d sign up for a service that calls a pharmacy on my behalf to refill prescriptions. In certain situations, pharmacies will not list prescriptions on their websites, even though they have the prescriptions on file, which forces the customer to call by phone — a frustrating process.

    I do feel bad for pharmacists, their job is challenging in so many ways.

  • by Shopper0552 on 6/11/2025, 11:39:14 PM

    Anyone know a good free open source speech to text? Looking for something for my laptop which is running Fedora KDE plasma.

  • by MrThoughtful on 6/12/2025, 4:58:20 AM

    How do you set the voice?

    On the Huggingface demo, there seems to be no option for it.

    It has a female voice. Any way to set it to a male voice?

  • by DHolzer on 6/13/2025, 8:00:31 AM

    I love chatterbox, it's my favourite. While the generation speed is quick, i wonder what performance optimization i could try on my 3090 to improve throughput. It's not quite enough for realtime.

  • by ipsum2 on 6/12/2025, 5:08:45 AM

    The voice cloning is okay, not as good as Eleven Labs. There's a Rick (from Rick and Morty) voice example, and the generated audio sounds muffled and low quality. I appreciate that its open source though.

  • by kiririn7 on 6/11/2025, 11:57:14 PM

    definitely worse than the new elevenlabs model(v3). that model is really good

  • by andy_xor_andrew on 6/11/2025, 10:48:38 PM

    in my experience, TTS has been a "pick two" situation:

    - fast / cheap to run

    - can clone voices

    - sounds super realistic

    from what I can tell, Chatterbox is the first that apparently lets you pick 3! (have not tried it myself yet, this is just what I can deduce)

  • by SV_BubbleTime on 6/12/2025, 10:28:49 PM

    Fun stuff... I don't know how or why, but connecting bluetooth while on this site, made all of the audio clips play at once (Firefox, Linux). Not the best listening experience.

  • by bachittle on 6/12/2025, 5:57:12 PM

    I always have issues with TTS models that do not allow you to send large chunks of text. Seems this one does not resolve this either. Always has a limit of like 2-3 sentences.

  • by andymcsherry on 6/12/2025, 3:20:49 AM

  • by 3ds on 6/12/2025, 9:32:31 AM

    There are only english voices, even in the paid version. Using them in other languages results in an accent.

  • by ojw0816 on 6/13/2025, 5:09:28 AM

    Looks good! What is the difference between the open-source version and the priced version?

  • by az226 on 6/11/2025, 10:26:08 PM

    How does one train a TTS model with an LLM backbone? Practically, how does this work?

  • by monksy on 6/13/2025, 4:24:37 AM

    How would I install this alongside librechat or ollama using docker?

  • by init0 on 6/12/2025, 3:31:40 PM

  • by decide1000 on 6/11/2025, 10:21:31 PM

    How does it perform on multi-lingual tasks?

  • by benob on 6/12/2025, 5:52:06 AM

    Watermarking is easily disabled in the code. I a wondering when they will release model weights with embedded watermarking.

  • by andrewstuart on 6/12/2025, 11:29:36 AM

    There’s been surprisingly little advancement in TTS after a rapid leap forward three years ago or so.

    There’s eleven labs which is quite good but not incredible and very expensive.

    Everything else ……. all the big AI companies …. have TTS systems that are kinda meh.

    Everything else in AI has advanced in leaps and bounds, TTS remains deep in the uncanny valley.

  • by pradeepodela on 6/12/2025, 5:25:28 AM

    What is the latency?

  • by tuananh on 6/12/2025, 12:31:36 AM

    for this, what does it take to support another language?

  • by ash1224 on 6/13/2025, 8:13:04 AM

    wow! 200mms very good!

  • by internet_points on 6/12/2025, 8:49:17 AM

    > Supported Lanugage

    > Currenlty only English.

    meh

  • by _andrei_ on 6/12/2025, 12:52:41 PM

    very cherry picked

  • by hsavit1 on 6/12/2025, 2:11:18 AM

    another TTS that is only supporting English. This really irritates me

  • by andyferris on 6/12/2025, 3:36:59 AM

    It took me ages to understand what TTS means!