Skip to content

Commit

Permalink
Feat/llamafile: adding llamafile as engine & ModelFactory mechanism r…
Browse files Browse the repository at this point in the history
…ewrite suggestion & haystack parsing/write enchancements (#10)

* llamafile & modelfactory removal suggestion & haystack parsing

* revert flow_judge.py and haystack.py & make eval_data_types closer to original

* revert haystack notebook & first edits for vllm and hf; combine vllm sync & async

* model configs into engine files; model types and base into common

* changed imports from modelfactory to direct init

* harmonized the configs and provided structure where they are extended classes, and can take sensible args

* updated notebooks to load the models using the new init

* added import checks for the extras and reverted eval_data_types parsing

* Create python-package.yml

* ruff format & isort run

* ruff format & isort run & test [dev,vllm,hf,llamafile]

* updated readme & added tests readme with icicle viz

* updated codecov action

* added test results upload to codecov

* upgrade actions setup-python to v5

* update youtube badge

* test codecov badge

* python versions badge

* rm py versions badge

* clean up a misplaced title

* realign

* chore: update readme

* init fix for extras

* chore: executed notebooks + minor update

* standardized genparams & non-supported model warning & llamafile quant kv + fa

* add torch to llamafile extra as dep

* add gpu check into ci flow

* fixed tests README

* fixed redundant vllm import error

* fixed model to model_id in vllm and llamafile

* fixed llamafile args quoting

* fixed metadata file writing to json from jsonl

* fixed default model name for Llamafile & tests graph to starburst

* small fix in the readme from old usage of Flow-Judge-v0.1_HF to Hf()

* testing out llamafile server cleanup from abrupt situations

* fixed gen params passing for vllm and hf init

* change vllm default param dtype back to bfloat16 from auto

---------

Co-authored-by: Bernardo Garcia <bernardo@flow-ai.com>
  • Loading branch information
sariola and bergr7 authored Oct 8, 2024
1 parent d86fbb6 commit b20ca8d
Show file tree
Hide file tree
Showing 25 changed files with 2,461 additions and 1,121 deletions.
52 changes: 52 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: Python package

on:
push:
branches: [ "main", "feat/llamafile" ]
pull_request:
branches: [ "main", "feat/llamafile" ]

jobs:
build:
runs-on: self-hosted
strategy:
fail-fast: false
matrix:
python-version: ["3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install .[dev,vllm,hf,llamafile]
- name: Verify GPU availability
run: |
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"
- name: Lint with ruff
run: |
ruff check . || true
- name: Format with black
run: |
black --check --diff . || true
- name: Sort imports with isort
run: |
isort --check-only --diff . || true
- name: Test with pytest and generate coverage
run: |
pytest --cov=./ --junitxml=junit.xml
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
fail_ci_if_error: true
- name: Upload test results to Codecov
if: ${{ !cancelled() }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,8 @@ output/

# data
data/

.cache

flake.nix
flake.lock
111 changes: 90 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# `flow-judge`

<p align="center">
<img src="img/flow_judge_banner.png" alt="Flow Judge Banner">
</p>
Expand All @@ -16,6 +17,27 @@
<code>flow-judge</code> is a lightweight library for evaluating LLM applications with <code>Flow-Judge-v0.1</code>.
</p>

<p align="center">
<a href="https://github.com/flowaicom/flow-judge/stargazers/" target="_blank">
<img src="https://img.shields.io/github/stars/flowaicom/flow-judge?style=social&label=Star&maxAge=3600" alt="GitHub stars">
</a>
<a href="https://github.com/flowaicom/flow-judge/releases" target="_blank">
<img src="https://img.shields.io/github/v/release/flowaicom/flow-judge?color=white" alt="Release">
</a>
<a href="https://www.youtube.com/@flowaicom" target="_blank">
<img alt="YouTube Channel Views" src="https://img.shields.io/youtube/channel/views/UCo2qL1nIQRHiPc0TF9xbqwg?style=social">
</a>
<a href="https://github.com/flowaicom/flow-judge/actions/workflows/python-package.yml" target="_blank">
<img src="https://github.com/flowaicom/flow-judge/actions/workflows/python-package.yml/badge.svg" alt="Build">
</a>
<a href="https://codecov.io/gh/flowaicom/flow-judge" target="_blank">
<img src="https://codecov.io/gh/flowaicom/flow-judge/branch/feat%2Fllamafile/graph/badge.svg?token=AEGC7W3DGE" alt="Code coverage">
</a>
<a href="https://github.com/flowaicom/flow-judge/blob/main/LICENSE" target="_blank">
<img src="https://img.shields.io/static/v1?label=license&message=Apache%202.0&color=white" alt="License">
</a>
</p>

## Model
`Flow-Judge-v0.1` is an open, small yet powerful language model evaluator trained on a synthetic dataset containing LLM system evaluation data by Flow AI.

Expand All @@ -41,22 +63,31 @@ pip install 'flash_attn>=2.6.3' --no-build-isolation
```

Extras available:
- `dev` for development dependencies
- `hf` for Hugging Face Transformers support
- `vllm` for vLLM support
- `dev` to install development dependencies
- `hf` to install Hugging Face Transformers dependencies
- `vllm` to install vLLM dependencies
- `llamafile` to install Llamafile dependencies

## Quick Start

Here's a simple example to get you started:

```python
from flow_judge.models.model_factory import ModelFactory
from flow_judge.flow_judge import EvalInput, FlowJudge
from flow_judge import Vllm, Llamafile, Hf, EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display

# Create a model using ModelFactory
model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ")
# If you are running on an Ampere GPU or newer, create a model using VLLM
model = Vllm()

# If you have other applications open taking up VRAM, you can use less VRAM by setting gpu_memory_utilization to a lower value.
# model = Vllm(gpu_memory_utilization=0.70)

# Or if not running on Ampere GPU or newer, create a model using no flash attn and Hugging Face Transformers
# model = Hf(flash_attn=False)

# Or create a model using Llamafile if not running an Nvidia GPU & running a Silicon MacOS for example
# model = Llamafile()

# Initialize the judge
faithfulness_judge = FlowJudge(
Expand Down Expand Up @@ -88,14 +119,57 @@ display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score

## Usage

### Supported Model Types
### Inference Options

The library supports multiple inference backends to accommodate different hardware configurations and performance needs:

1. **vLLM**:
- Best for NVIDIA GPUs with Ampere architecture or newer (e.g., RTX 3000 series, A100, H100)
- Offers the highest performance and throughput
- Requires CUDA-compatible GPU

```python
from flow_judge import Vllm

model = Vllm()
```

2. **Hugging Face Transformers**:
- Compatible with a wide range of hardware, including older NVIDIA GPUs
- Supports CPU inference (slower but universally compatible)
- It is slower than vLLM but generally compatible with more hardware.

If you are running on an Ampere GPU or newer:
```python
from flow_judge import Hf

model = Hf()
```

If you are not running on an Ampere GPU or newer, disable flash attention:
```python
from flow_judge import Hf

model = Hf(flash_attn=False)
```

3. **Llamafile**:
- Ideal for non-NVIDIA hardware, including Apple Silicon
- Provides good performance on CPUs
- Self-contained, easy to deploy option

```python
from flow_judge import Llamafile

model = Llamafile()
```

Choose the inference backend that best matches your hardware and performance requirements. The library provides a unified interface for all these options, making it easy to switch between them as needed.

- Hugging Face Transformers (`hf_transformers`)
- vLLM (`vllm`)

### Evaluation Metrics

Flow-Judge-v0.1 was trained to handle any custom metric that can be expressed as a combination of evaluation criteria and rubric, and required inputs and outputs.
`Flow-Judge-v0.1` was trained to handle any custom metric that can be expressed as a combination of evaluation criteria and rubric, and required inputs and outputs.

#### Pre-defined Metrics

Expand All @@ -114,21 +188,20 @@ For efficient processing of multiple inputs, you can use the `batch_evaluate` me
```python
# Read the sample data
import json
from flow_judge.models.model_factory import ModelFactory
from flow_judge.flow_judge import EvalInput, FlowJudge
from flow_judge import Vllm, EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display

# Create a model using ModelFactory
model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ")
# Initialize the model
model = Vllm()

# Initialize the judge
faithfulness_judge = FlowJudge(
metric=RESPONSE_FAITHFULNESS_5POINT,
model=model
)

# Load data
# Load some sampledata
with open("sample_data/csr_assistant.json", "r") as f:
data = json.load(f)

Expand Down Expand Up @@ -157,13 +230,9 @@ for i, result in enumerate(results):

## Advanced Usage

### Model configurations
> [!WARNING]
> There is a reported issue with Phi-3 models that produces gibberish outputs with contexts longer than 4096 tokens, including input and output. This issue has been recently fixed in the transformers library so we recommend using the `Flow-Judge-v0.1_HF` model configuration for longer contexts at the moment. For more details, refer to: [#33129](https://github.com/huggingface/transformers/pull/33129) and [#6135](https://github.com/vllm-project/vllm/issues/6135)
We currently support vLLM engine (recommended) and Hugging Face Transformers.
> There exists currently a reported issue with Phi-3 models that produces gibberish outputs with contexts longer than 4096 tokens, including input and output. This issue has been recently fixed in the transformers library so we recommend using the `Hf()` model configuration for longer contexts at the moment. For more details, refer to: [#33129](https://github.com/huggingface/transformers/pull/33129) and [#6135](https://github.com/vllm-project/vllm/issues/6135)
We are working on adding API-based usage as well as better options for CPU.

### Custom Metrics

Expand Down
Loading

0 comments on commit b20ca8d

Please sign in to comment.