Feat/llamafile: adding llamafile as engine & ModelFactory mechanism r…

…ewrite suggestion & haystack parsing/write enchancements (#10) * llamafile & modelfactory removal suggestion & haystack parsing * revert flow_judge.py and haystack.py & make eval_data_types closer to original * revert haystack notebook & first edits for vllm and hf; combine vllm sync & async * model configs into engine files; model types and base into common * changed imports from modelfactory to direct init * harmonized the configs and provided structure where they are extended classes, and can take sensible args * updated notebooks to load the models using the new init * added import checks for the extras and reverted eval_data_types parsing * Create python-package.yml * ruff format & isort run * ruff format & isort run & test [dev,vllm,hf,llamafile] * updated readme & added tests readme with icicle viz * updated codecov action * added test results upload to codecov * upgrade actions setup-python to v5 * update youtube badge * test codecov badge * python versions badge * rm py versions badge * clean up a misplaced title * realign * chore: update readme * init fix for extras * chore: executed notebooks + minor update * standardized genparams & non-supported model warning & llamafile quant kv + fa * add torch to llamafile extra as dep * add gpu check into ci flow * fixed tests README * fixed redundant vllm import error * fixed model to model_id in vllm and llamafile * fixed llamafile args quoting * fixed metadata file writing to json from jsonl * fixed default model name for Llamafile & tests graph to starburst * small fix in the readme from old usage of Flow-Judge-v0.1_HF to Hf() * testing out llamafile server cleanup from abrupt situations * fixed gen params passing for vllm and hf init * change vllm default param dtype back to bfloat16 from auto --------- Co-authored-by: Bernardo Garcia <bernardo@flow-ai.com>
flowaicom · Oct 8, 2024 · b20ca8d · b20ca8d
1 parent d86fbb6
commit b20ca8d
Show file tree

Hide file tree

Showing 25 changed files with 2,461 additions and 1,121 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -0,0 +1,52 @@
+name: Python package
+
+on:
+  push:
+    branches: [ "main", "feat/llamafile" ]
+  pull_request:
+    branches: [ "main", "feat/llamafile" ]
+
+jobs:
+  build:
+    runs-on: self-hosted
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.10", "3.11", "3.12"]
+
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v5
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        python -m pip install .[dev,vllm,hf,llamafile]
+    - name: Verify GPU availability
+      run: |
+        nvidia-smi
+        python -c "import torch; print(torch.cuda.is_available())"
+    - name: Lint with ruff
+      run: |
+        ruff check . || true
+    - name: Format with black
+      run: |
+        black --check --diff . || true
+    - name: Sort imports with isort
+      run: |
+        isort --check-only --diff . || true
+    - name: Test with pytest and generate coverage
+      run: |
+        pytest --cov=./  --junitxml=junit.xml
+    - name: Upload coverage to Codecov
+      uses: codecov/codecov-action@v4
+      with:
+        token: ${{ secrets.CODECOV_TOKEN }}
+        fail_ci_if_error: true
+    - name: Upload test results to Codecov
+      if: ${{ !cancelled() }}
+      uses: codecov/test-results-action@v1
+      with:
+        token: ${{ secrets.CODECOV_TOKEN }}
diff --git a/.gitignore b/.gitignore
@@ -45,3 +45,8 @@ output/
 
 # data
 data/
+
+.cache
+
+flake.nix
+flake.lock
diff --git a/README.md b/README.md
@@ -1,4 +1,5 @@
 # `flow-judge`
+
 <p align="center">
   <img src="img/flow_judge_banner.png" alt="Flow Judge Banner">
 </p>
@@ -16,6 +17,27 @@
   <code>flow-judge</code> is a lightweight library for evaluating LLM applications with <code>Flow-Judge-v0.1</code>.
 </p>
 
+<p align="center">
+<a href="https://github.com/flowaicom/flow-judge/stargazers/" target="_blank">
+    <img src="https://img.shields.io/github/stars/flowaicom/flow-judge?style=social&label=Star&maxAge=3600" alt="GitHub stars">
+</a>
+<a href="https://github.com/flowaicom/flow-judge/releases" target="_blank">
+    <img src="https://img.shields.io/github/v/release/flowaicom/flow-judge?color=white" alt="Release">
+</a>
+<a href="https://www.youtube.com/@flowaicom" target="_blank">
+    <img alt="YouTube Channel Views" src="https://img.shields.io/youtube/channel/views/UCo2qL1nIQRHiPc0TF9xbqwg?style=social">
+</a>
+<a href="https://github.com/flowaicom/flow-judge/actions/workflows/python-package.yml" target="_blank">
+    <img src="https://github.com/flowaicom/flow-judge/actions/workflows/python-package.yml/badge.svg" alt="Build">
+</a>
+<a href="https://codecov.io/gh/flowaicom/flow-judge" target="_blank">
+    <img src="https://codecov.io/gh/flowaicom/flow-judge/branch/feat%2Fllamafile/graph/badge.svg?token=AEGC7W3DGE" alt="Code coverage">
+</a>
+<a href="https://github.com/flowaicom/flow-judge/blob/main/LICENSE" target="_blank">
+    <img src="https://img.shields.io/static/v1?label=license&message=Apache%202.0&color=white" alt="License">
+</a>
+</p>
+
 ## Model
 `Flow-Judge-v0.1` is an open, small yet powerful language model evaluator trained on a synthetic dataset containing LLM system evaluation data by Flow AI.
 
@@ -41,22 +63,31 @@ pip install 'flash_attn>=2.6.3' --no-build-isolation
 ```
 
 Extras available:
-- `dev` for development dependencies
-- `hf` for Hugging Face Transformers support
-- `vllm` for vLLM support
+- `dev` to install development dependencies
+- `hf` to install Hugging Face Transformers dependencies
+- `vllm` to install vLLM dependencies
+- `llamafile` to install Llamafile dependencies
 
 ## Quick Start
 
 Here's a simple example to get you started:
 
 ```python
-from flow_judge.models.model_factory import ModelFactory
-from flow_judge.flow_judge import EvalInput, FlowJudge
+from flow_judge import Vllm, Llamafile, Hf, EvalInput, FlowJudge
 from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
 from IPython.display import Markdown, display
 
-# Create a model using ModelFactory
-model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ")
+# If you are running on an Ampere GPU or newer, create a model using VLLM
+model = Vllm()
+
+# If you have other applications open taking up VRAM, you can use less VRAM by setting gpu_memory_utilization to a lower value.
+# model = Vllm(gpu_memory_utilization=0.70)
+
+# Or if not running on Ampere GPU or newer, create a model using no flash attn and Hugging Face Transformers
+# model = Hf(flash_attn=False)
+
+# Or create a model using Llamafile if not running an Nvidia GPU & running a Silicon MacOS for example
+# model = Llamafile()
 
 # Initialize the judge
 faithfulness_judge = FlowJudge(
@@ -88,14 +119,57 @@ display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score
 
 ## Usage
 
-### Supported Model Types
+### Inference Options
+
+The library supports multiple inference backends to accommodate different hardware configurations and performance needs:
+
+1. **vLLM**:
+   - Best for NVIDIA GPUs with Ampere architecture or newer (e.g., RTX 3000 series, A100, H100)
+   - Offers the highest performance and throughput
+   - Requires CUDA-compatible GPU
+
+   ```python
+   from flow_judge import Vllm
+
+   model = Vllm()
+   ```
+
+2. **Hugging Face Transformers**:
+   - Compatible with a wide range of hardware, including older NVIDIA GPUs
+   - Supports CPU inference (slower but universally compatible)
+   - It is slower than vLLM but generally compatible with more hardware.
+
+    If you are running on an Ampere GPU or newer:
+   ```python
+   from flow_judge import Hf
+
+   model = Hf()
+   ```
+
+   If you are not running on an Ampere GPU or newer, disable flash attention:
+   ```python
+   from flow_judge import Hf
+
+   model = Hf(flash_attn=False)
+   ```
+
+3. **Llamafile**:
+   - Ideal for non-NVIDIA hardware, including Apple Silicon
+   - Provides good performance on CPUs
+   - Self-contained, easy to deploy option
+
+   ```python
+   from flow_judge import Llamafile
+
+   model = Llamafile()
+   ```
+
+Choose the inference backend that best matches your hardware and performance requirements. The library provides a unified interface for all these options, making it easy to switch between them as needed.
 
-- Hugging Face Transformers (`hf_transformers`)
-- vLLM (`vllm`)
 
 ### Evaluation Metrics
 
-Flow-Judge-v0.1 was trained to handle any custom metric that can be expressed as a combination of evaluation criteria and rubric, and required inputs and outputs.
+`Flow-Judge-v0.1` was trained to handle any custom metric that can be expressed as a combination of evaluation criteria and rubric, and required inputs and outputs.
 
 #### Pre-defined Metrics
 
@@ -114,21 +188,20 @@ For efficient processing of multiple inputs, you can use the `batch_evaluate` me
 ```python
 # Read the sample data
 import json
-from flow_judge.models.model_factory import ModelFactory
-from flow_judge.flow_judge import EvalInput, FlowJudge
+from flow_judge import Vllm, EvalInput, FlowJudge
 from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
 from IPython.display import Markdown, display
 
-# Create a model using ModelFactory
-model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ")
+# Initialize the model
+model = Vllm()
 
 # Initialize the judge
 faithfulness_judge = FlowJudge(
     metric=RESPONSE_FAITHFULNESS_5POINT,
     model=model
 )
 
-# Load data
+# Load some sampledata
 with open("sample_data/csr_assistant.json", "r") as f:
     data = json.load(f)
 
@@ -157,13 +230,9 @@ for i, result in enumerate(results):
 
 ## Advanced Usage
 
-### Model configurations
 > [!WARNING]
-> There is a reported issue with Phi-3 models that produces gibberish outputs with contexts longer than 4096 tokens, including input and output. This issue has been recently fixed in the transformers library so we recommend using the `Flow-Judge-v0.1_HF` model configuration for longer contexts at the moment. For more details, refer to: [#33129](https://github.com/huggingface/transformers/pull/33129) and [#6135](https://github.com/vllm-project/vllm/issues/6135)
-
-We currently support vLLM engine (recommended) and Hugging Face Transformers.
+> There exists currently a reported issue with Phi-3 models that produces gibberish outputs with contexts longer than 4096 tokens, including input and output. This issue has been recently fixed in the transformers library so we recommend using the `Hf()` model configuration for longer contexts at the moment. For more details, refer to: [#33129](https://github.com/huggingface/transformers/pull/33129) and [#6135](https://github.com/vllm-project/vllm/issues/6135)
 
-We are working on adding API-based usage as well as better options for CPU.
 
 ### Custom Metrics
-Original file line number
+Diff line change
@@ Expand Up / @@ -45,3 +45,8 @@ output/ @@
     # data
     data/
+    .cache
+    flake.nix
+    flake.lock