Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/llamafile: adding llamafile as engine & ModelFactory mechanism rewrite suggestion & haystack parsing/write enchancements #10

Merged
merged 38 commits into from
Oct 8, 2024
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
7af7b1b
llamafile & modelfactory removal suggestion & haystack parsing
sariola Oct 6, 2024
ca85a43
revert flow_judge.py and haystack.py & make eval_data_types closer to…
sariola Oct 7, 2024
761f3d6
revert haystack notebook & first edits for vllm and hf; combine vllm …
sariola Oct 7, 2024
f4f7124
model configs into engine files; model types and base into common
sariola Oct 7, 2024
fad0f23
changed imports from modelfactory to direct init
sariola Oct 7, 2024
b34a64b
harmonized the configs and provided structure where they are extended…
sariola Oct 7, 2024
4a3ad23
updated notebooks to load the models using the new init
sariola Oct 7, 2024
50c173d
added import checks for the extras and reverted eval_data_types parsing
sariola Oct 7, 2024
afa3660
Create python-package.yml
sariola Oct 7, 2024
3276739
Merge pull request #12 from flowaicom/python-package-workflow
sariola Oct 7, 2024
fe6aa2f
ruff format & isort run
sariola Oct 7, 2024
64c411c
ruff format & isort run & test [dev,vllm,hf,llamafile]
sariola Oct 7, 2024
0aa829a
updated readme & added tests readme with icicle viz
sariola Oct 7, 2024
a0421c2
updated codecov action
sariola Oct 7, 2024
cbf8075
added test results upload to codecov
sariola Oct 7, 2024
7641e03
upgrade actions setup-python to v5
sariola Oct 7, 2024
985ff41
update youtube badge
sariola Oct 7, 2024
ed6ace7
test codecov badge
sariola Oct 7, 2024
edf91ef
python versions badge
sariola Oct 7, 2024
2c630ec
rm py versions badge
sariola Oct 7, 2024
1d028b5
clean up a misplaced title
sariola Oct 7, 2024
84b7e61
realign
sariola Oct 7, 2024
60036bc
chore: update readme
bergr7 Oct 8, 2024
3435564
init fix for extras
sariola Oct 8, 2024
992ca46
chore: executed notebooks + minor update
bergr7 Oct 8, 2024
a54bbee
standardized genparams & non-supported model warning & llamafile quan…
sariola Oct 8, 2024
6a6b772
add torch to llamafile extra as dep
sariola Oct 8, 2024
bad2065
add gpu check into ci flow
sariola Oct 8, 2024
3dd4bf3
fixed tests README
sariola Oct 8, 2024
f0598d3
fixed redundant vllm import error
sariola Oct 8, 2024
1b56a93
fixed model to model_id in vllm and llamafile
sariola Oct 8, 2024
2d7f16b
fixed llamafile args quoting
sariola Oct 8, 2024
5b6719d
fixed metadata file writing to json from jsonl
sariola Oct 8, 2024
5968caa
fixed default model name for Llamafile & tests graph to starburst
sariola Oct 8, 2024
91937b8
small fix in the readme from old usage of Flow-Judge-v0.1_HF to Hf()
sariola Oct 8, 2024
11514bd
testing out llamafile server cleanup from abrupt situations
sariola Oct 8, 2024
c309fc9
fixed gen params passing for vllm and hf init
sariola Oct 8, 2024
ebc9ef6
change vllm default param dtype back to bfloat16 from auto
sariola Oct 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: Python package

on:
push:
branches: [ "feat/llamafile" ]
pull_request:
branches: [ "feat/llamafile" ]

jobs:
build:
runs-on: self-hosted
strategy:
fail-fast: false
matrix:
python-version: ["3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install .[dev,vllm,hf,llamafile]
- name: Lint with ruff
run: |
ruff check . || true
- name: Format with black
run: |
black --check --diff . || true
- name: Sort imports with isort
run: |
isort --check-only --diff . || true
- name: Test with pytest and generate coverage
run: |
pytest --cov=./ --junitxml=junit.xml
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
fail_ci_if_error: true
- name: Upload test results to Codecov
if: ${{ !cancelled() }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,8 @@ output/

# data
data/

.cache

flake.nix
flake.lock
109 changes: 89 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# `flow-judge`

<p align="center">
<img src="img/flow_judge_banner.png" alt="Flow Judge Banner">
</p>
Expand All @@ -16,6 +17,27 @@
<code>flow-judge</code> is a lightweight library for evaluating LLM applications with <code>Flow-Judge-v0.1</code>.
</p>

<p align="center">
<a href="https://github.com/flowaicom/flow-judge/stargazers/" target="_blank">
<img src="https://img.shields.io/github/stars/flowaicom/flow-judge?style=social&label=Star&maxAge=2592000" alt="GitHub stars">
</a>
<a href="https://github.com/flowaicom/flow-judge/releases" target="_blank">
sariola marked this conversation as resolved.
Show resolved Hide resolved
<img src="https://img.shields.io/github/v/release/flowaicom/flow-judge?color=white" alt="Release">
</a>
<a href="https://www.youtube.com/@flowaicom" target="_blank">
<img alt="YouTube Channel Views" src="https://img.shields.io/youtube/channel/views/UCo2qL1nIQRHiPc0TF9xbqwg?style=social">
</a>
<a href="https://github.com/flowaicom/flow-judge/actions/workflows/python-package.yml" target="_blank">
<img src="https://github.com/flowaicom/flow-judge/actions/workflows/python-package.yml/badge.svg" alt="Build">
</a>
<a href="https://codecov.io/gh/flowaicom/flow-judge" target="_blank">
<img src="https://codecov.io/gh/flowaicom/flow-judge/branch/feat%2Fllamafile/graph/badge.svg?token=AEGC7W3DGE" alt="Code coverage">
</a>
<a href="https://github.com/flowaicom/flow-judge/blob/main/LICENSE" target="_blank">
<img src="https://img.shields.io/static/v1?label=license&message=Apache%202.0&color=white" alt="License">
</a>
</p>

## Model
`Flow-Judge-v0.1` is an open, small yet powerful language model evaluator trained on a synthetic dataset containing LLM system evaluation data by Flow AI.

Expand All @@ -41,22 +63,31 @@ pip install 'flash_attn>=2.6.3' --no-build-isolation
```

Extras available:
- `dev` for development dependencies
- `hf` for Hugging Face Transformers support
- `vllm` for vLLM support
- `dev` to install development dependencies
- `hf` to install Hugging Face Transformers dependencies
- `vllm` to install vLLM dependencies
- `llamafile` to install Llamafile dependencies

## Quick Start

Here's a simple example to get you started:

```python
from flow_judge.models.model_factory import ModelFactory
from flow_judge.flow_judge import EvalInput, FlowJudge
from flow_judge import Vllm, Llamafile, Hf, EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display

# Create a model using ModelFactory
model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ")
# If you are running on an Ampere GPU or newer, create a model using VLLM
model = Vllm()

# If you have other applications open taking up VRAM, you can use less VRAM by setting gpu_memory_utilization to a lower value.
# model = Vllm(gpu_memory_utilization=0.70)

# Or if not running on Ampere GPU or newer, create a model using no flash attn and Hugging Face Transformers
# model = Hf(flash_attn=False)

# Or create a model using Llamafile if not running an Nvidia GPU & running a Silicon MacOS for example
# model = Llamafile()

# Initialize the judge
faithfulness_judge = FlowJudge(
Expand Down Expand Up @@ -88,14 +119,57 @@ display(Markdown(f"__Feedback:__\n{result.feedback}\n\n__Score:__\n{result.score

## Usage

### Supported Model Types
### Inference Options

The library supports multiple inference backends to accommodate different hardware configurations and performance needs:

1. **vLLM**:
- Best for NVIDIA GPUs with Ampere architecture or newer (e.g., RTX 3000 series, A100, H100)
- Offers the highest performance and throughput
- Requires CUDA-compatible GPU

```python
from flow_judge import Vllm

model = Vllm()
```

2. **Hugging Face Transformers**:
- Compatible with a wide range of hardware, including older NVIDIA GPUs
- Supports CPU inference (slower but universally compatible)
- It is slower than vLLM but generally compatible with more hardware.

If you are running on an Ampere GPU or newer:
```python
from flow_judge import Hf

model = Hf()
```

If you are not running on an Ampere GPU or newer, disable flash attention:
```python
from flow_judge import Hf

model = Hf(flash_attn=False)
```

3. **Llamafile**:
- Ideal for non-NVIDIA hardware, including Apple Silicon
- Provides good performance on CPUs
- Self-contained, easy to deploy option

```python
from flow_judge import Llamafile

model = Llamafile()
```

Choose the inference backend that best matches your hardware and performance requirements. The library provides a unified interface for all these options, making it easy to switch between them as needed.

- Hugging Face Transformers (`hf_transformers`)
- vLLM (`vllm`)

### Evaluation Metrics

Flow-Judge-v0.1 was trained to handle any custom metric that can be expressed as a combination of evaluation criteria and rubric, and required inputs and outputs.
`Flow-Judge-v0.1` was trained to handle any custom metric that can be expressed as a combination of evaluation criteria and rubric, and required inputs and outputs.

#### Pre-defined Metrics

Expand All @@ -114,21 +188,20 @@ For efficient processing of multiple inputs, you can use the `batch_evaluate` me
```python
# Read the sample data
import json
from flow_judge.models.model_factory import ModelFactory
from flow_judge.flow_judge import EvalInput, FlowJudge
from flow_judge import Vllm, EvalInput, FlowJudge
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
from IPython.display import Markdown, display

# Create a model using ModelFactory
model = ModelFactory.create_model("Flow-Judge-v0.1-AWQ")
# Initialize the model
model = Vllm()

# Initialize the judge
faithfulness_judge = FlowJudge(
metric=RESPONSE_FAITHFULNESS_5POINT,
model=model
)

# Load data
# Load some sampledata
with open("sample_data/csr_assistant.json", "r") as f:
data = json.load(f)

Expand Down Expand Up @@ -157,13 +230,9 @@ for i, result in enumerate(results):

## Advanced Usage

### Model configurations
> [!WARNING]
> There is a reported issue with Phi-3 models that produces gibberish outputs with contexts longer than 4096 tokens, including input and output. This issue has been recently fixed in the transformers library so we recommend using the `Flow-Judge-v0.1_HF` model configuration for longer contexts at the moment. For more details, refer to: [#33129](https://github.com/huggingface/transformers/pull/33129) and [#6135](https://github.com/vllm-project/vllm/issues/6135)

We currently support vLLM engine (recommended) and Hugging Face Transformers.

We are working on adding API-based usage as well as better options for CPU.

### Custom Metrics

Expand Down
Loading