Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Fish model #58

Open
jmtatsch opened this issue Sep 13, 2024 · 6 comments
Open

New Fish model #58

jmtatsch opened this issue Sep 13, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@jmtatsch
Copy link

Have you seen the new fish speech model https://github.com/fishaudio/fish-speech ?
Wonderful voice cloning and intonation performance.
Would you consider supporting it?

@matatonic
Copy link
Owner

I am considering it, so far I've heard it's not as good as xtts, but haven't tried it myself yet.

@jmtatsch
Copy link
Author

Imho its far superior to xtts - less robotic and more emotional.
https://www.youtube.com/watch?v=Ghc8cJdQyKQ
Only catch is its non-commercial license.

@thiswillbeyourgithub
Copy link

I see a major reason to implement support for Fish: it seems to support quantization.

I have an old GPU with 8G of RAM so every byte matters to me and I really struggled to find any good information on how to quantize XTTS. I conclude that it's not something that can be relied upon so seeing this PR that adds quantization support for Fish Speech makes me very interested!

PS: what's up with deepspeed for XTTS btw? I see that it takes a pip install deepspeed. If you can't support in the official image could you give me some pointers to use it on my side? XTTS is pretty slow for me, too much for interactivity.

@matatonic
Copy link
Owner

I see a major reason to implement support for Fish: it seems to support quantization.

I have an old GPU with 8G of RAM so every byte matters to me and I really struggled to find any good information on how to quantize XTTS. I conclude that it's not something that can be relied upon so seeing this PR that adds quantization support for Fish Speech makes me very interested!

That's a great point, thanks for that.

Re: deepspeed, can you start a new issue or discussion? it's worth its own space, I know it would help low VRAM folks a lot but it's a bit complex, especially for windows.

@thiswillbeyourgithub
Copy link

@thiswillbeyourgithub
Copy link

thiswillbeyourgithub commented Oct 12, 2024

Hi, I took a quick look at fish audio again. I'm sharing this to make it easier to give it a try!

Their reference is there https://speech.fish.audio/ but I ended up doing my thing:

git clone https://github.com/fishaudio/fish-speech/
cd fish-speech

Then create docker-compose.yml with content:

    services:
      fish-speech:
        image: fishaudio/fish-speech:latest-dev  # avoid building it
        volumes:
          - ./:/exp
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [gpu]
        network_mode: host  # to access their gradio

docker compose up then go to localhost:7860 to check out their gradio.

My takeaway is that its of super high quality, and quite fast. Hard to quantify but I never saw it take more than 2.2G of VRAM, whereas xtts often took all my 8Go (might actually be a bug come to think of it?!). Fish on my old gpu seems to take 60s to generate 30s of audio. But have done zero optimization. I don't really understand how to enable quantization. There seems to be some args to setup --compile and --half but I don't have the time right now.

I think to go further I would need to compile it from the repo to modify the entry point to the other python gradio scripts. There are some related to quantization directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants