Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open-source TTS model support with timestamps #77

Open
alesstracker21 opened this issue Nov 30, 2024 · 1 comment
Open

Open-source TTS model support with timestamps #77

alesstracker21 opened this issue Nov 30, 2024 · 1 comment

Comments

@alesstracker21
Copy link

First of all, I'd like to say this is a great project! I am looking for ways to integrate it into my project.
I have an open ended question here, It feels to me like this project heavily relies on cloud services, but I am running everything locally because I aim to have then a self contained service I can use myself and I find that no open source model (at least that I know of, i.e. Melo/Parler/Coqui and so on) support timestamps and might require converting the outputted phonemes by the TTS manually into timestamps.
Has anybody managed to make this run with open-source TTS models, while keeping timepoint data available for lipsync?

Thanks for being a part of the conversation!

@alesstracker21 alesstracker21 changed the title Open-source TTS model support Open-source TTS model support with timestamps Nov 30, 2024
@met4citizen
Copy link
Owner

met4citizen commented Dec 2, 2024

I've been looking for such a TTS project for some time (with no luck), so thank you for starting this thread.

One potential candidate on my radar is Piper. I haven't had time to explore it in depth, but based on the documentation and demos, it has several qualities I like: it uses neural voices, it is fast, it is released under an MIT License, and the project seems active. There are even WASM versions available, so you could run it entirely in a browser. Additionally, related to the TalkingHead project, there appears to be an open PR for generating word-level alignment data.

(Edit: It seems that the referenced PR generates audio twice: once for the entire sentence and then again separately for each word. This approach is not optimal. The duration of each phoneme is already available from the first run, but this information is simply not exposed in the lower-level API. More about this here. It seems, however, that one of the maintainers is currently rewriting Piper due to some licensing concerns and new features, so for now it might be best to wait and see how the project evolves.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants