Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/llamafile: adding llamafile as engine & ModelFactory mechanism rewrite suggestion & haystack parsing/write enchancements #10

Merged
merged 38 commits into from
Oct 8, 2024

Conversation

sariola
Copy link
Collaborator

@sariola sariola commented Oct 6, 2024

Added:

  • Llamafile as engine

    • Spawns a server with robust checking implementation
    • Llamafile downloading
    • Uses with-context to spawn and clean the server process during usage on-demand
    • Handles nested contexts and context and no-context invocations subsequently
    • Intelligent checking if server already spawned
    • Generate features
      • Uses OpenAI as the client library
      • Sync has been tested with Haystack integration and it passes parsing 20/20, generate and batch_generate.
      • Caveat: for batch_generate we need to make changes in batch_evaluate at the higher abstraction level since this approach needs the server and pass it within the context, where in the previous implementation the batch_evaluate just calls generate repeatedly. We check for instance Llamafile and use the functionality of the class.
      • TODO: Test async generate and async batch generate. Untested and also might not make sense for most systems that this llamafile approach targets.
  • ModelConfig can be used directly to instantiate an engine

    • TODO: Show how this can be used, in a notebook example
    • TODO: Show how ModelConfig templates can be imported and altered conveniently in a notebook
    • This change implies removing ModelFactory and having that very slight logic in the init of the engine classes instead. This is my suggestion. This also helps with the importing and auto-completion of the engines by the user, removing the referring by string.
    • If this approach is something that is liked by our dev, then TODO: to update the inits of the VLLM and HF engines to benefit and standardize on the same.

Standard usage, with-context spawns the server and automatically cleans it up.
image

This also works without the context because both generate and batch_generate ensure a server exists and select the already running one. Additionally Llamafile class exposes useful methods for checking and controlling the server if you want to use it in a custom way.
image

Creating a custom configured engine looks like this
image

  • Haystack parsing was throwing errors
    • Fixed a parsing case where score existed within the feedback tags and/or the latter feedback tag didn't exist
    • Fixed the code refusing to produce results in a case where you had parsing errors even if boolean False was given to stop producing in the case of parsing errors
    • Added try block to result writer

This PR also implies the usage of OpenAI client for the Baseten sync and async functionality as it shares the same OpenAI API as does Llamafile and incurs no custom work to implement other than for the async batching to receive results via the proxy.

Additional todos to the already mentioned:

  • Create a few unit tests
  • Test on MacOS (8 GB and 16 GB RAM)
  • Potential improvements to the init values based on VRAM & RAM available
  • Test on CPU only (will be very slow)

@sariola sariola requested a review from bergr7 October 6, 2024 08:24
@sariola sariola marked this pull request as draft October 7, 2024 06:24
@bergr7
Copy link
Contributor

bergr7 commented Oct 7, 2024

REMINDER: We must update all notebooks and readme to accommodate the ModelFactory replacement.

cc: @sariola

@bergr7
Copy link
Contributor

bergr7 commented Oct 7, 2024

Another thing that should be fixed with the ModelFactory replacement is that currently, the model factory forces you to have all the extras installed. No bueno

@sariola
Copy link
Collaborator Author

sariola commented Oct 7, 2024

Reverted haystack.py and flow-judge.py: Notebook works still normally like it should ✔️

Kept the parsing addition where score can be found inside or outside feedback tags.

Base automatically changed from feat/haystack-integration to main October 7, 2024 11:19
@bergr7
Copy link
Contributor

bergr7 commented Oct 7, 2024

REMINDER - Update model cards on HF

Copy link

codecov bot commented Oct 7, 2024

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

@sariola
Copy link
Collaborator Author

sariola commented Oct 7, 2024

Okay seems like so far the shortcut imports from init.py seem to be forcing to install all extras. Will need to fix this in the morning.

For now just power through it with python -m pip install -e .[dev,vllm,hf,llamafile]

Screenshot 2024-09-30 at 14 30 13

Additional

Added self-host github actions test runner in tsukuyomi (Hetzner) which we can have gpu enabled for E2E tests, runs right now python 3.10/3.11/3.12 and calculates code coverage & creates the batches for README. Pretty neat.

Quite a bit of tests to create to close coverage gap but we could storm through that collaboratively I think. Registered also to TestPypi and we can use my Pypi account that we used for autoeval to publish to the index.

@sariola
Copy link
Collaborator Author

sariola commented Oct 7, 2024

Note to self:

  • Macbook with 8GB RAM and VSCode open needs -ctk q4_0 -ctv q4_0 -fa on line 213 in llamafile.py
  • nkvo enable/disable does not work currently due to args not passed correctly but it seems generating works even without it using 7.2 GB out of 8 GB RAM, if we use q4 cache and flash attention on Metal. Evaluation scores parse correctly! 🎉
Screenshot 2024-10-08 at 0 40 14 Screenshot 2024-10-08 at 0 39 03

@bergr7
Copy link
Contributor

bergr7 commented Oct 8, 2024

Okay seems like so far the shortcut imports from init.py seem to be forcing to install all extras. Will need to fix this in the morning.

For now just power through it with python -m pip install -e .[dev,vllm,hf,llamafile]

Screenshot 2024-09-30 at 14 30 13 ### Additional Added self-host [github actions](https://github.com/flowaicom/flow-judge/actions) test runner in tsukuyomi (Hetzner) which we can have gpu enabled for E2E tests, runs right now [python 3.10/3.11/3.12](https://github.com/flowaicom/flow-judge/blob/feat/llamafile/.github/workflows/python-package.yml) and calculates code coverage & creates the batches for README. Pretty neat.

Quite a bit of tests to create to close coverage gap but we could storm through that collaboratively I think. Registered also to TestPypi and we can use my Pypi account that we used for autoeval to publish to the index.

Ahh I see. Sorry that didn't pay enough attention to these imports.

@bergr7
Copy link
Contributor

bergr7 commented Oct 8, 2024

Starting the review already

@bergr7
Copy link
Contributor

bergr7 commented Oct 8, 2024

Added self-host github actions test runner in tsukuyomi (Hetzner) which we can have gpu enabled for E2E tests, runs right now python 3.10/3.11/3.12 and calculates code coverage & creates the batches for README. Pretty neat.

Quite a bit of tests to create to close coverage gap but we could storm through that collaboratively I think. Registered also to TestPypi and we can use my Pypi account that we used for autoeval to publish to the index.

This is great to include model tests as well, not just code tests.

@sariola
Copy link
Collaborator Author

sariola commented Oct 8, 2024

Great!!

Completely untested on the async & llamaindex tutorial, I'll also get into that right away.

Copy link
Contributor

@bergr7 bergr7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very solid!

I added minor comments here and there.

I executed Haystack with success but the regex is resulting in many parsing errors. I haven't pushed the executed notebook.

README.md Show resolved Hide resolved
flow_judge/models/common.py Outdated Show resolved Hide resolved
flow_judge/models/huggingface.py Outdated Show resolved Hide resolved
flow_judge/models/huggingface.py Outdated Show resolved Hide resolved
flow_judge/models/huggingface.py Show resolved Hide resolved
flow_judge/utils/result_writer.py Outdated Show resolved Hide resolved
tests/README.md Outdated Show resolved Hide resolved
tests/README.md Show resolved Hide resolved
tests/README.md Show resolved Hide resolved
examples/1_quickstart.ipynb Outdated Show resolved Hide resolved
@sariola sariola marked this pull request as ready for review October 8, 2024 11:01
Copy link
Contributor

@bergr7 bergr7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved ✅

Only a minor comment on them metadata file extension

flow_judge/utils/result_writer.py Outdated Show resolved Hide resolved
@sariola
Copy link
Collaborator Author

sariola commented Oct 8, 2024

Approved ✅

Only a minor comment on them metadata file extension

Thanks I'll double check that and also I'll check the llamafile server lingering to make it dependent on the existence of the kernel process and create a trap exit for that to take care of edge cases.

Edit: Checked it out and also I think now with 11514bd we have more robust server clean up as the server spawns as part of a child process group which stays tied to the existence of the object and that group is wiped out.

@sariola sariola merged commit b20ca8d into main Oct 8, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants