From dc562bc40e9797300347982a720ccdfc9db1ff8a Mon Sep 17 00:00:00 2001 From: elenasamuylova <67064421+elenasamuylova@users.noreply.github.com> Date: Wed, 11 Sep 2024 13:24:20 +0100 Subject: [PATCH] Docs for v0.4.37 (#1297) --- docs/book/SUMMARY.md | 5 +- .../customization/add-custom-descriptor.md | 110 ++++++++++++++++++ .../add-custom-metric-or-test.md | 13 ++- docs/book/customization/llm_as_a_judge.md | 12 +- docs/book/evaluations/no_code_evals.md | 14 ++- docs/book/examples/cookbook_llm_judge.md | 1 + .../cookbook_llm_regression_testing.md | 75 ++++++------ docs/book/get-started/cloud_quickstart_llm.md | 16 +-- docs/book/installation/cloud_account.md | 20 ++-- docs/book/installation/self_hosting.md | 52 ++++++--- docs/book/reference/all-metrics.md | 13 ++- docs/book/tests-and-reports/output_formats.md | 6 +- .../tests-and-reports/text-descriptors.md | 16 ++- 13 files changed, 259 insertions(+), 94 deletions(-) create mode 100644 docs/book/customization/add-custom-descriptor.md diff --git a/docs/book/SUMMARY.md b/docs/book/SUMMARY.md index 4c94b65067..bb78e515c8 100644 --- a/docs/book/SUMMARY.md +++ b/docs/book/SUMMARY.md @@ -80,11 +80,12 @@ * [Feature importance in data drift](customization/feature-importance.md) * [Text evals with LLM-as-judge](customization/llm_as_a_judge.md) * [Text evals with HuggingFace](customization/huggingface_descriptor.md) + * [Add a custom text descriptor](customization/add-custom-descriptor.md) + * [Add a custom drift method](customization/add-custom-drift-method.md) + * [Add a custom Metric or Test](customization/add-custom-metric-or-test.md) * [Customize JSON output](customization/json-dict-output.md) * [Show raw data in Reports](customization/report-data-aggregation.md) * [Add text comments to Reports](customization/text-comments.md) - * [Add a custom drift method](customization/add-custom-drift-method.md) - * [Add a custom Metric or Test](customization/add-custom-metric-or-test.md) * [Change color schema](customization/options-for-color-schema.md) * [How-to guides](how-to-guides/README.md) diff --git a/docs/book/customization/add-custom-descriptor.md b/docs/book/customization/add-custom-descriptor.md new file mode 100644 index 0000000000..e0377b60b4 --- /dev/null +++ b/docs/book/customization/add-custom-descriptor.md @@ -0,0 +1,110 @@ +--- +description: How to add custom text descriptors. +--- + +You can implement custom row-level evaluations for text data that you will later use just like any other descriptor across Metrics and Tests. You can implement descriptors that use a single column or two columns. + +Note that if you want to use LLM-based evaluations, you can write custom prompts using [LLM judge templates](llm_as_a_judge.md). + +# Code example + +Refer to a How-to example: + +{% embed url="https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_use_llm_judge_template.ipynb" %} + +# Custom descriptors + +Imports: + +```python +from evidently.descriptors import CustomColumnEval, CustomPairColumnEval +``` + +## Single column descriptor + +You can create a custom descriptor that will take a single column from your dataset and run a certain evaluation for each row. + +**Implement your evaluation as a Python function**. It will take a pandas Series as input and return a transformed Series. + +Here, the `is_empty_string_callable` function takes a column of strings and returns an "EMPTY" or "NON EMPTY" outcome for each. + +```python +def is_empty_string_callable(val1): + return pd.Series(["EMPTY" if val == "" else "NON EMPTY" for val in val1], index=val1.index) +``` + +**Create a custom descriptor**. Create an example of `CustomColumnEval` class to wrap the evaluation logic into an object that you can later use to process specific dataset input. + +```python +empty_string = CustomColumnEval( + func=is_empty_string_callable, + feature_type="cat", + display_name="Empty response" +) +``` + +Where: +* `func: Callable[[pd.Series], pd.Series]` is a function that returns a transformed pandas Series. +* `display_name: str` is the new descriptor's name that will appear in Reports and Test Suites. +* `feature_type` is the type of descriptor that the function returns (`cat` for categorical, `num` for numerical) + +**Apply the new descriptor**. To create a Report with a new Descriptor, pass it as a `column_name` to the `ColumnSummaryMetric`. This will compute the new descriptor for all rows in the specified column and summarize its distribution: + +```python +report = Report(metrics=[ + ColumnSummaryMetric(column_name=empty_string.on("response")), +]) +``` + +Run the Report on your `df` dataframe as usual: + +```python +report.run(reference_data=None, + current_data=df) +``` + +## Double column descriptor + +You can create a custom descriptor that will take two columns from your dataset and will run a certain evaluation for each row. (For example, for pairwise evaluators). + +**Implement your evaluation as a Python function**. Here, the `exact_match_callable` function takes two columns and checks whether each pair of values is the same, returning "MATCH" if they are equal and "MISMATCH" if they are not. + +```python +def exact_match_callable(val1, val2): + return pd.Series(["MATCH" if val else "MISMATCH" for val in val1 == val2]) +``` + +**Create a custom descriptor**. Create an example of the `CustomPairColumnEval` class to wrap the evaluation logic into an object that you can later use to process two named columns in a dataset. + +```python +exact_match = CustomPairColumnEval( + func=exact_match_callable, + first_column="response", + second_column="question", + feature_type="cat", + display_name="Exact match between response and question" +) +``` + +Where: + +* `func: Callable[[pd.Series, pd.Series], pd.Series]` is a function that returns a transformed pandas Series after evaluating two columns. +* `first_column: str` is the name of the first column to be passed into the function. +* `second_column: str` is the name of the second column to be passed into the function. +* `display_name: str` is the new descriptor's name that will appear in Reports and Test Suites. +* `feature_type` is the type of descriptor that the function returns (`cat` for categorical, `num` for numerical). + +**Apply the new descriptor**. To create a Report with a new Descriptor, pass it as a `column_name` to the ColumnSummaryMetric. This will compute the new descriptor for all rows in the dataset and summarize its distribution: + +```python +report = Report(metrics=[ + ColumnSummaryMetric(column_name=exact_match.as_column()) +]) +``` + +Run the Report on your `df` dataframe as usual: + +```python +report.run(reference_data=None, + current_data=df) +``` diff --git a/docs/book/customization/add-custom-metric-or-test.md b/docs/book/customization/add-custom-metric-or-test.md index cf1570da45..a0e8cada43 100644 --- a/docs/book/customization/add-custom-metric-or-test.md +++ b/docs/book/customization/add-custom-metric-or-test.md @@ -1,4 +1,13 @@ -There are two ways to add a custom Metric to Evidently. +There are two ways to add a custom Metric or Test to Evidently: +* Add it as a Python function (Recommended). +* Implement a custom metric with custom Plotly render. + +Implementing a new Metric or Test means that you implement a completely custom column- or dataset-level evaluation. + +There are other ways to customize your evaluations that do not require creating Metrics or Tests from scratch: +* Add a custom descriptor for row-level evaluations. Read on [adding custom text descriptors](add-custom-descriptor.md). +* Write a custom LLM-based evaluator using templates. Read on [designing LLM judges](llm_as_a_judge.md). +* Add a custom data drift detection method, re-using the existing Data Drift metric render. Read on [drift method customization](add-custom-drift-method.md) option. # 1. Add a new Metric or Test as a Python function. (Recommended). @@ -9,8 +18,6 @@ This is a recommended path to add custom Metrics. Using this method, you can sen Example notebook: {% embed url="https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_build_metric_over_python_function.ipynb" %} -**Note**: if you want to add a custom data drift method, there is a separate [drift method customization](add-custom-drift-method.md) option. In this case, you will re-use the existing render. - # 2. Implement a new Metric and Test from scratch. You can also implement a new Metric or Test from scratch, defining both the calculation method and the optional visualization. diff --git a/docs/book/customization/llm_as_a_judge.md b/docs/book/customization/llm_as_a_judge.md index 7cf7ab07ba..e34797ff71 100644 --- a/docs/book/customization/llm_as_a_judge.md +++ b/docs/book/customization/llm_as_a_judge.md @@ -31,7 +31,7 @@ You can use built-in evaluators that include pre-written prompts for specific cr **Imports**. Import the `LLMEval` and built-in evaluators you want to use: ```python -from evidently.descriptors import LLMEval, NegativityLLMEval, PIILLMEval, DeclineLLMEval +from evidently.descriptors import LLMEval, NegativityLLMEval, PIILLMEval, DeclineLLMEval, BiasLLMEval, ToxicityLLMEval, ContextQualityLLMEval ``` **Get a Report**. To create a Report, simply list them like any other descriptor: @@ -58,6 +58,16 @@ report = Report(metrics=[ ]) ``` +**Run descriptors over two columns**. An evaluator that assesses if the context contains enough information to answer the question requires both columns. Run the evaluation over the `context` column and pass the name of the column containing the `question` as a parameter. + +```python +report = Report(metrics=[ + TextEvals(column_name="context", descriptors=[ + ContextQualityLLMEval(question="question"), + ]) +]) +``` + {% hint style="info" %} **Which descriptors are there?** See the list of available built-in descriptors in the [All Metrics](../reference/all-metrics.md) page. {% endhint %} diff --git a/docs/book/evaluations/no_code_evals.md b/docs/book/evaluations/no_code_evals.md index 6191dcb684..86f6d37cf8 100644 --- a/docs/book/evaluations/no_code_evals.md +++ b/docs/book/evaluations/no_code_evals.md @@ -6,11 +6,11 @@ The platform supports several evaluations directly from the user interface. | Name | Type | Description | |-------------------------|------------|------------------------------------------------------------------------------------------------------------------------------------------| -| Text Evals | Report | Analyze texts using methods from regular expressions to LLM judges. | -| Data Quality | Report | Get descriptive statistics and distribution overviews for all columns. | +| Text Evals | Report | Analyze text data, from regular expressions to LLM judges. | +| Data Quality | Report | Get descriptive statistics and distributions for all columns. | | Classification Quality | Report | Evaluate the quality of a classification model. | | Regression Quality | Report | Evaluate the quality of a regression model. | -| Data Quality Tests | Test Suite | Automatically check for issues like missing values, duplicates, etc. | +| Data Quality Tests | Test Suite | Automatically check for missing values, duplicates, etc. | Before you start, pick a dataset to evaluate. For example, this could be a CSV file containing inputs and outputs of your AI system, like chatbot logs. @@ -77,6 +77,10 @@ Select specific checks one by one: Each evaluation result is called a **Descriptor**. No matter the method, you’ll get a label or score for every evaluated text. Some, like “Sentiment,” work instantly, while others may need setup. +{% hint style="info" %} +**What other evaluators are there?** Check the list of Descriptors on the [All Metrics](../reference/all-metrics.md) page. +{% endhint %} + Here are few examples of Descriptors and how to configure them: ## Words presence @@ -111,8 +115,8 @@ For a binary classification template, you can configure: * **Target/Non-target Category**: labels you want to use. * **Uncertain Category**: how the model should respond when it can’t decide. * **Reasoning**: choose to include explanation (Recommended). -* **Category** and/or **Score**: have the LLM respond with the category (Recommended) or also return a score. -* **Visualize as**: when both Category and Score are computed, choose which to display in the report. +* **Category** and/or **Score**: have the LLM respond with the category (Recommended) or score. +* **Visualize as**: when both Category and Score are computed, choose which to display in the Report. {% hint style="info" %} **What other evaluators are there?** Check the list of Descriptors on the [All Metrics](../reference/all-metrics.md) page. diff --git a/docs/book/examples/cookbook_llm_judge.md b/docs/book/examples/cookbook_llm_judge.md index f1584f584e..aa948d48ed 100644 --- a/docs/book/examples/cookbook_llm_judge.md +++ b/docs/book/examples/cookbook_llm_judge.md @@ -365,6 +365,7 @@ verbosity_report.datasets().current ``` Preview: + ![](../.gitbook/assets/cookbook/llmjudge_verbosity_examples.png) Don't fully agree with the results? Use these labels as a starting point, and correct the decision where you see fit - now you've got your golden dataset! Next, iterate on your judge prompt. diff --git a/docs/book/examples/cookbook_llm_regression_testing.md b/docs/book/examples/cookbook_llm_regression_testing.md index d8a501b6a2..3c8c6974be 100644 --- a/docs/book/examples/cookbook_llm_regression_testing.md +++ b/docs/book/examples/cookbook_llm_regression_testing.md @@ -2,22 +2,20 @@ description: How to run regession testing for LLM outputs. --- -In this tutorial, we'll show you how to run regression testing for LLM outputs: for example, compare new and old responses to the same inputs after you change a prompt, model, or anything else in your system. - -When you make changes, you may want to re-run the same inputs and compare the results to find which ones changed significantly. This will allow you to push your updates to production with confidence or find issues to address. +In this tutorial, we’ll show you how to do regression testing for LLM outputs. You’ll learn how to compare new and old responses after changing a prompt, model, or anything else in your system. By re-running the same inputs, you can spot any significant changes. This helps you push updates with confidence or identify issues to fix. # Tutorial scope Here's what we'll do: -* **Create a toy dataset**. Create a toy Q&A dataset with answers nd reference responses. -* **Imitate generating new answers**. Imitate generating new answers to the same question we want to compare. -* **Create and run a Test Suite**. We'll compare the answers using the LLM-as-a-judge approach to evaluate length, correctness and style match. -* **Get a monitoring Dashboard**. Build a dashboard to monitor the results of Tests over time. +* **Create a toy dataset**. Build a small Q&A dataset with answers and reference responses. +* **Get new answers**. Imitate generating new answers to the same question we want to compare. +* **Create and run a Test Suite**. Compare the answers using LLM-as-a-judge to evaluate length, correctness and style match. +* **Build a monitoring Dashboard**. Get plots to track the results of Tests over time. To complete the tutorial, you will need: * Basic Python knowledge.  * An OpenAI API key to use for the LLM evaluator. -* An Evidently Cloud account to track test results. If not yet, [sign up](../installation/cloud_account.md) for a free account. +* An Evidently Cloud account to track test results. If not yet, [sign up](https://www.evidentlyai.com/register) for a free account. Use the provided code snippets or run a sample notebook. @@ -26,8 +24,6 @@ Jupyter notebook: Or click to [open in Colab](https://colab.research.google.com/github/evidentlyai/community-examples/blob/main/tutorials/Regression_testing_with_debugging.ipynb). -We will work with a toy dataset, which you can replace with your production data or calls to your LLM app to generate a new set of answers for the input data. - # 1. Installation and Imports Install Evidently: @@ -101,7 +97,7 @@ project.save() # 3. Prepare the Dataset -Create a dataset with questions and reference answers. This data will be used to compare the new responses of your LLM : +Create a dataset with questions and reference answers. We'll later compare the new LLM responses against them: ```python data = [ @@ -143,7 +139,7 @@ report.run(reference_data=None, report ``` -If you run the example in a non-interactive Python environment, call `report.as_dict()` or `report.json()` instead. +If you work in a non-interactive Python environment, call `report.as_dict()` or `report.json()` instead. Here is the distribution of text length: @@ -186,25 +182,23 @@ Here is the resulting dataset with the added new column: ![](../.gitbook/assets/cookbook/llmregesting_text_new.png) {% hint style="info" %} -**How to run it in production?** In practice, replace this step with calling your LLM app to score the inputs. After you get the new responses, add them to a DataFrame. You can also use our **tracing** library to instrument your app or function and collect traces, that will be automatically converted into a tabular dataset. Evidently will then let you pull a tabular dataset with collected traces. Check the tutorial that shows this [tracing workflow](../examples/tutorial_tracing.md). +**How to run it in production?** In practice, replace this step with calling your LLM app to score the inputs. After you get the new responses, add them to a DataFrame. You can also use our **tracing** library to instrument your app and get traces as a tabular dataset. Check the tutorial with [tracing workflow](../examples/tutorial_tracing.md). {% endhint %} # 5. Design the Test suite -Now, you must decide on the evaluation metrics. The goal is to compare if the new answers are different from the old ones. +To compare new answers with old ones, we need evaluation metrics. You can use deterministic or embeddings-based metrics like SemanticSimilarity. However, you often need more custom criteria. Using LLM-as-a-judge is useful for this, letting you define what to detect. -You can use some deterministic or embeddings-based metrics like SemanticSimilarity, but often, you'd want to make the comparison using more specific criteria you control. In this case, using LLM-as-a-judge is a valuable approach that allows you to formulate custom criteria on what you want to detect. +Let’s design our Tests: +* **Length check**. All new responses must be between 80 and 200 symbols. +* **Correctness**. All new responses should give the same answer without contradictions. +* **Style**. All new responses should match the style of the reference. -Let's say we want to Test that: -* All responses must be within the expected length. Looking at our previous responses, we've set it as between 80 and 200 symbols. -* All new responses must give essentially the same answer without contradiction. We will call this criterion CORRECTNESS and implement it using LLM-as-a-judge. -* That all new responses are the same in style compared to the reference. We will call this criterion STYLE and implement it using LLM-as-a-judge. - -Evidently has a few built-in LLM-based evaluators, but this time we'll write our custom LLM judges. +Text length is easy to check, but for Correctness and Style checks, we'll write our custom LLM judges. ## Correctness judge -Here is how we implement the correctness evaluator, using a built-in Evidenty template for binary classification. We are asking LLM to classify every response as correct or not based on what is in the `{target_response}` column and give a reasoning for that decision. +We implement the correctness evaluator, using an Evidenty template for binary classification. We ask the LLM to classify each response as correct or incorrect based on the {target_response} column and provide reasoning for its decision. ```python correctness_eval= LLMEval( @@ -233,7 +227,7 @@ REFERENCE: ) ``` -We strongly recommend splitting evaluation criteria into individual judges and using more straightforward grading scale like binary classifiers for reliability. +We recommend splitting each evaluation criterion into separate judges and using a simple grading scale, like binary classifiers, for better reliability. {% hint style="info" %} **Don't forget to evaluate your judge!** Each LLM evaluator is a small ML system you should tune to align with your preferences. We recommend running a couple of iterations to tune it. Check the tutorial on [creating LLM judges](cookbook_llm_judge.md). @@ -245,7 +239,7 @@ We strongly recommend splitting evaluation criteria into individual judges and u ## Style judge -Using a similar approach, we will create a judge for style. We add additonal clarifications to specify what we mean by style match. +Using a similar approach, we'll create a judge for style. We'll also add clarifications to define what we mean by a style match. ```python style_eval= LLMEval( @@ -282,13 +276,12 @@ You must focus only on STYLE. Ignore any differences in contents. ## Complete Test Suite -Now, we can create a Test Suite that incorporates the correctness evaluation, style matching, and text length checks. -* To formulate what we Test, we pick an appropriate Evidently column-level Test like `TestCategoryCount` and `TestShareOfOutRangeValues`. (You can pick other Tests, like `TestColumnValueMin` , `TestColumnValueMean`, etc.) -* Set additional parameters for each Test, if applicable. For example, we use `left` and `right` parameters to set the allowed range for Text Length. -* To set the Test condition, we use parameters like `gt` (greater than), `lt` (less than), `eq` (equal), etc. -* You can also identify some non-critical Tests, as we do with the Style match check. If such Test fails, it will give a Warning, but not an Error. This helps divide it visually on monitoring Panels and to set up alerting only for Fails. +Now, we can create a Test Suite that includes checks for correctness, style matching, and text length. +* **Choose Tests**. We select Evidently column-level tests like `TestCategoryCount` and `TestShareOfOutRangeValues`. (You can pick other Tests, like `TestColumnValueMin` or `TestColumnValueMean`). +* **Set Parameters and Conditions**. Some Tests require parameters: for example, `left` and `right` to set the allowed range for Text Length. For Test fail conditions, use parameters like `gt` (greater than), `lt` (less than), `eq` (equal), etc. +* **Set non-critical Tests**. Identify non-critical Tests, like the style match check, to trigger warnings instead of fails. This helps visually separate them on monitoring panels and set alerts only for critical failures. -Here is how we formulate the complete Test Suite. We reference two of our judges, `style_eval` and `correctness_eval,` and point out that they should apply to the `response` column in our dataset. For Text Length, we use a built-in `TextLength()` descriptor. +We reference our two LLM judges, `style_eval` and `correctness_eval`, and apply them to the `response` column in our dataset. For text length, we use the built-in `TextLength()` descriptor for the same column. ```python test_suite = TestSuite(tests=[ @@ -310,14 +303,14 @@ test_suite = TestSuite(tests=[ ]) ``` -Note that in this example, we always expect the share of failures to be zero using the `eq=0` condition. For example, you can swap the Test condition for text length for `lte=0.1`, which would stand for "less than 10%". In this case, the Test will fail if > 10% of rows in the dataset are longer than the set range. +In this example, we expect the share of failures to be zero using the `eq=0` condition. You can adjust this, such as using lte=0.1, which means "less than 10%". This would cause the Test to fail if more than 10% of rows are out of the set length range. Allowing some share of Tests to fail is convenient for real-world applications. You can add additional Tests as you see fit for regular expressions, word presence, etc. and Tests for other columns in the same Test Suite. {% hint style="info" %} -**Understand Tests**. Learn how to set [custom Test conditions](../tests-and-reports/run-tests.md) and use [Tests with text Descriptors](../tests-and-reports/text-descriptors.md). See the list of [All tests](../reference/all-tests.md). +**Understand Tests**. Learn how to set [Test conditions](../tests-and-reports/run-tests.md) and use [Tests with text data](../tests-and-reports/text-descriptors.md). See the list of [All tests](../reference/all-tests.md). {% endhint %} {% hint style="info" %} @@ -328,13 +321,13 @@ You can add additional Tests as you see fit for regular expressions, word presen Now that our Test Suite is ready - let's run it! -To apply this Test Suite to the `eval_data`, we prepared earlier: +To apply this Test Suite to the `eval_data` that we prepared earlier: ```python test_suite.run(reference_data=None, current_data=eval_data) ``` -This will compute the Test Suite! But how do you see it? You can actually preview the results directly in your Python notebook (call `test_suite`), but we'll now send it to Evidently Cloud, together with the scored data: +This will compute the Test Suite: but how do you see it? You can preview the results in your Python notebook (call `test_suite`). However, we’ll now send it to Evidently Cloud along with the scored data: ```python ws.add_test_suite(project.id, test_suite, include_data=True) @@ -342,9 +335,9 @@ ws.add_test_suite(project.id, test_suite, include_data=True) Including data is optional but useful for most LLM use cases since you'd want to see not just the aggregate Test results but also the raw texts to debug when Tests fail. -Now, navigate to the Evidently Platform UI. Go to the ([Home Page](https://app.evidently.cloud/)), enter your Project, and find the "Test Suites" section in the left menu. As you enter it, you will see the resulting Test Suite that you can Explore. Once you open this view, you will see both the summary Test results and the Dataset with the added scores and explanations. +To view the results, navigate to the Evidently Platform. Go to the ([Home Page](https://app.evidently.cloud/)), enter your Project, and find the "Test Suites" section in the left menu. Here, you'll see the Test Suite you can explore. -You can browse your Dataset view by selecting individual Columns to zoom in on the results of specific evaluations. For example, you might want to sort all columns by Text Length to find the one that is out of bounds or to find the style-mismatched answers as labeled by the LLM judge. +You'll find both the summary Test results and the Dataset with added scores and explanations. You can zoom in on specific evaluations, such as sorting the data by Text Length or finding rows labeled as "incorrect" or "style-mismatched". ![](../.gitbook/assets/cookbook/llmregtesting_test_1.png) @@ -402,13 +395,13 @@ If you go and open the new Test Suite results, you can again explore the outcome # 8. Get a Dashboard -You can continue running Test Suites in this manner. As you run multiple Test Suites, you may want to track their results over time clearly. +You can continue running Test Suites in this manner. As you run multiple, you may want to track Test results over time. -You can easily create a Dashboard to track Test outcomes over time. You can edit the monitoring Panels in the UI or programmatically. Let's create a couple of Panels using Dashboards as a code approach. +You can easily add this to a Dashboard, both in UI or programmatically. Let's create a couple of Panels using Dashboards as a code approach. The following code will add: -* A counter panel to show the SUCCESS rate of the latest test run. -* A test monitoring panel to show all test results over time. +* A counter panel to show the SUCCESS rate of the latest Test run. +* A test monitoring panel to show all Test results over time. ```python project.dashboard.add_panel( @@ -434,7 +427,7 @@ project.dashboard.add_panel( project.save() ``` -When you navigate to the UI, you will now see the Dashboard, which shows a summary of Test results (Success, Failure, and Warning) for each Test Suite we ran. As you add more Tests to the same Project, the Panels will be automatically updated to show new Test results. +When you navigate to the UI, you will now see a Panel which shows a summary of Test results (Success, Failure, and Warning) for each Test Suite we ran. As you add more Tests to the same Project, the Panels will be automatically updated to show new Test results. ![](../.gitbook/assets/cookbook/llmregesting_test_dashboard.png) diff --git a/docs/book/get-started/cloud_quickstart_llm.md b/docs/book/get-started/cloud_quickstart_llm.md index 59ec29a735..d3c03d5050 100644 --- a/docs/book/get-started/cloud_quickstart_llm.md +++ b/docs/book/get-started/cloud_quickstart_llm.md @@ -97,9 +97,9 @@ You have two options: {% tab title="Only local methods" %} **Define your evals**. You will evaluate all "Answers" for: -* Sentiment: from -1 for negative to 1 for positive -* Text length: character count -* Presence of "sorry" or "apologize": True/False +* Sentiment: from -1 for negative to 1 for positive. +* Text length: character count. +* Presence of "sorry" or "apologize": True/False. ```python text_evals_report = Report(metrics=[ @@ -118,7 +118,7 @@ text_evals_report.run(reference_data=None, current_data=evaluation_dataset) {% tab title="LLM as a judge" %} -**Set the OpenAI key** (it's best to set it as an environment variable). [See Open AI docs](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety) for tips. +**Set the OpenAI key**. It's best to set an environment variable: [see Open AI docs](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety) for tips. ```python ## import os @@ -126,9 +126,9 @@ text_evals_report.run(reference_data=None, current_data=evaluation_dataset) ``` **Define your evals**. Evaluate all "Answers" for: -* Sentiment: from -1 for negative to 1 for positive) -* Text length: character count -* Whether the chatbot denied an answer: returns "OK" or "Denial" labels with explanations. This uses LLM-as-a-judge (defaults to `gpt-4o-mini`) with a template Evidently prompt. +* Sentiment: from -1 for negative to 1 for positive. +* Text length: character count. +* Whether the chatbot denied an answer: returns "OK" / "Denial" labels with explanations. This uses LLM-as-a-judge (defaults to `gpt-4o-mini`) with a template Evidently prompt. ```python text_evals_report = Report(metrics=[ @@ -157,7 +157,7 @@ Each evaluation is a `descriptor`. You can choose from multiple built-in evaluat ws.add_report(project.id, text_evals_report, include_data=True) ``` -**View the Report** on Evidently Cloud. Go to the [home page](https://app.evidently.cloud/)), open your Project, and navigate to "Reports" in the left. +**View the Report**. Go to [Evidently Cloud](https://app.evidently.cloud/), open your Project, and navigate to "Reports" in the left. You will see the scores summary, and the dataset with new descriptor columns. For example, you can sort to find all answers with "Denials". diff --git a/docs/book/installation/cloud_account.md b/docs/book/installation/cloud_account.md index 9d58d96446..54a70a02db 100644 --- a/docs/book/installation/cloud_account.md +++ b/docs/book/installation/cloud_account.md @@ -2,17 +2,15 @@ description: How to set up Evidently Cloud account. --- -# Set Up Evidently Cloud - -## 1. Create an Account +# 1. Create an Account If not yet, [sign up for a free Evidently Cloud account](https://app.evidently.cloud/signup). -## 2. Create an Organization +# 2. Create an Organization After logging in, create an **Organization** and name it. -## 3. Create a Team +# 3. Create a Team Go to the **Teams** icon in the left menu, create a Team, and name it. ([Team page](https://app.evidently.cloud/teams)). @@ -20,19 +18,19 @@ Go to the **Teams** icon in the left menu, create a Team, and name it. ([Team pa **Do I always need a Team?** Yes. Every Project must be within a Team. Teams act as "folders" to organize your work, and you can create multiple Teams. If you work alone, simply create a Team without external users. {% endhint %} -# Connect from Python +# 4. Connect from Python You will need an access token to interact with Evidently Cloud from your Python environment. {% hint style="info" %} -**Does every user need this?** No. You only need a token if you’re setting up data uploads or running evaluations in Python. If you’re only viewing data and dashboards or want to upload data as CSV and run no-code evaluations, you don’t need a token. +**Do I always need this?** No, only for data uploads or to run evaluations in Python. You can view data, edit dashboards, upload CSVs and run no-code evaluations without the token. {% endhint %} -## 4. Get a Token +## Get a Token Click the **Key** icon in the left menu to open the ([Token page](https://app.evidently.cloud/token)). Generate and save the token securely. -## 5. Connect from Python +## Install Evidently To connect to the Evidently Cloud from Python, first [install the Evidently Python library](install-evidently.md). @@ -40,6 +38,8 @@ To connect to the Evidently Cloud from Python, first [install the Evidently Pyth pip install evidently ``` +## Connect + Import the cloud workspace and pass your API token to connect: ```python @@ -50,4 +50,4 @@ token="YOUR_TOKEN_HERE", url="https://app.evidently.cloud") ``` -Now, you are all set to start using Evidently Cloud! Choose your [next step](../get-started/quickstart-cloud.md). +Now, you are all set to start using Evidently Cloud! Create your first Project and choose your [next step](../get-started/quickstart-cloud.md). diff --git a/docs/book/installation/self_hosting.md b/docs/book/installation/self_hosting.md index fd7ea5e3cc..cb7ceca14e 100644 --- a/docs/book/installation/self_hosting.md +++ b/docs/book/installation/self_hosting.md @@ -5,13 +5,17 @@ description: How to self-host the open-source Evidently UI service. In addition to using Evidently Python library, you can self-host the UI Service to get a monitoring Dashboard and organize the results of your evaluations. This is optional: you can also run evaluations and render results directly in Python or export them elsewhere. {% hint style="info" %} -**Evidently Cloud.** Sign up for a free [Evidently Cloud](cloud_account.md) account to instantly get a managed version with extra features. +**Evidently Cloud.** Sign up for a free [Evidently Cloud](cloud_account.md) account to get a managed version with extra features. {% endhint %} {% hint style="info" %} -**Evidently Enterprise.** This page describes self-hosting the open-source version of the platform. To get a Enterprise version with support and extra features, [contact us](https://www.evidentlyai.com/get-demo). You can host the platform in your private cloud or on-premises. +**Evidently Enterprise.** This page explains how to self-host the open-source platform. For the Enterprise version with extra features and support, [contact us](https://www.evidentlyai.com/get-demo). Host it in your cloud or on-premises. {% endhint %} +To get a self-hostable Dashboard, you must: +1. Create a Workspace (local or remote) to store your data. +2. Launch the UI service. + # 1. Create a Workspace Once you [install Evidently](install-evidently.md), you will need to create a `workspace`. This designates a remote or local directory where you will store the evaluation results (as JSON `snapshots` of the Evidently `Reports` or `Test Suites`). The UI Service will read the data from this source. @@ -19,12 +23,18 @@ Once you [install Evidently](install-evidently.md), you will need to create a `w There are three scenarios, based on where you run the UI Service and store data. * **Local Workspace**. Both the UI Service and data storage are local. * **Remote Workspace**. Both the UI Service and data storage are remote. -* **Workspace with remote data storage**. You run the UI Service and store data on different servers. +* **Workspace with remote data storage**. You run the UI Service and store data on different servers. ## Local Workspace In this scenario, you generate, store the snapshots and run the monitoring UI on the same machine. +Imports: +```python +from evidently.ui.workspace import Workspace +from evidently.ui.workspace import WorkspaceBase +``` + To create a local Workspace and assign a name: ```python @@ -34,13 +44,21 @@ ws = Workspace.create("evidently_ui_workspace") You can pass a `path` parameter to specify the path to a local directory. {% hint style="info" %} -**Code example** [Self-hosting tutorial](../examples/tutorial-monitoring.md) shows a complete Python script to create and populate a local Workspace. +**Code example**. [Self-hosting tutorial](../examples/tutorial-monitoring.md) shows a complete Python script to create and populate a local Workspace. {% endhint %} ## Remote Workspace In this scenario, you send the snapshots to a remote server. You must run the Monitoring UI on the same remote server. It will directly interface with the filesystem where the snapshots are stored. +Imports: + +``` +from evidently.ui.remote import RemoteWorkspace +from evidently.ui.workspace import Workspace +from evidently.ui.workspace import WorkspaceBase +``` + To create a remote Workspace (UI should be running at this address): ```python @@ -74,6 +92,19 @@ FSSPEC_S3_KEY=my_key FSSPEC_S3_SECRET=my_secret evidently ui --workspace s3://my_bucket/workspace ``` +## [DANGER] Delete Workspace + +To delete a Workspace (for example, an empty or a test Workspace), run the command from the Terminal: + +``` +cd src/evidently/ui/ +rm -r workspace +``` + +{% hint style="danger" %} +**You are deleting all the data**. This command will delete the snapshots stored in the folder. To maintain access to the generated snapshots, you must store them elsewhere. +{% endhint %} + # 2. Launch the UI service To launch the Evidently UI service, you must run a command in the Terminal. @@ -97,16 +128,3 @@ evidently ui --workspace ./workspace --port 8080 ``` To view the Evidently interface, go to URL http://localhost:8000 or a specified port in your web browser. - -## [DANGER] Delete Workspace - -To delete a Workspace (for example, an empty or a test Workspace), run the command from the Terminal: - -``` -cd src/evidently/ui/ -rm -r workspace -``` - -{% hint style="danger" %} -**You are deleting all the data**. This command will delete the snapshots stored in the folder. To maintain access to the generated snapshots, you must store them elsewhere. -{% endhint %} diff --git a/docs/book/reference/all-metrics.md b/docs/book/reference/all-metrics.md index 25a76e6ceb..d697ba09d1 100644 --- a/docs/book/reference/all-metrics.md +++ b/docs/book/reference/all-metrics.md @@ -255,7 +255,7 @@ DatasetMissingValuesMetric(missing_values=["", 0, "n/a", -9999, None], replace=T # Text Evals -Text Evals only apply to text columns. To compute a Descriptor for a single text column, use a `TextEvals` Preset. +Text Evals only apply to text columns. To compute a Descriptor for a single text column, use a `TextEvals` Preset. Read [docs](../tests-and-reports/text-descriptors.md). You can also explicitly specify the Evidently Metric (e.g., `ColumnSummaryMetric`) to visualize the descriptor, or pick a [Test](all-tests.md) (e.g., `TestColumnValueMin`) to run validations. @@ -292,9 +292,12 @@ Use external LLMs with an evaluation prompt to score text data. (Also known as L | Descriptor | Parameters | | - | - | | **LLMEval()**

Scores the text using the user-defined criteria, automatically formatted in a templated evaluation prompt.| See [docs](../customization/llm_as_a_judge.md) for examples and parameters.| -| **DeclineLLMEval()**

Classifies texts into those containing a refusal or a rejection to do something or not. Returns a label or score.| See [docs](../customization/llm_as_a_judge.md) for parameters.| -| **PIILLMEval()**

Classifies texts into those containing PII (Personally Identifiable Information) or not. Returns a label or score.| See [docs](../customization/llm_as_a_judge.md) for parameters.| -| **NegativityLLMEval()**

Classifies texts into Positive or Negative. Returns a label or score.| See [docs](../customization/llm_as_a_judge.md) for parameters.| +| **DeclineLLMEval()**

Detects texts containing a refusal or a rejection to do something. Returns a label (DECLINE or OK) or score.| See [docs](../customization/llm_as_a_judge.md) for parameters.| +| **PIILLMEval()**

Detects texts containing PII (Personally Identifiable Information). Returns a label (PII or OK) or score.| See [docs](../customization/llm_as_a_judge.md) for parameters.| +| **NegativityLLMEval()**

Detects negative texts (containing critical or pessimistic tone). Returns a label (NEGATIVE or POSITIVE) or score.| See [docs](../customization/llm_as_a_judge.md) for parameters.| +| **BiasLLMEval()**

Detects biased texts (containing prejudice for or against a person or group). Returns a label (BIAS or OK) or score.| See [docs](../customization/llm_as_a_judge.md) for parameters.| +| **ToxicityLLMEval()**

Detects toxic texts (containing harmful, offensive, or derogatory language). Returns a label (TOXICITY or OK) or score.| See [docs](../customization/llm_as_a_judge.md) for parameters.| +| **ContextQualityLLMEval()**

Evaluates if CONTEXT is VALID (has sufficient information to answer the QUESTION) or INVALID (has missing or contradictory information). Returns a label (VALID or INVALID) or score.| Run the descriptor over the `context` column and pass the `question` column as a parameter. See [docs](../customization/llm_as_a_judge.md) for parameters.| ## Descriptors: Model-based @@ -302,7 +305,7 @@ Use pre-trained machine learning models for evaluation. | Descriptor | Parameters | | - | - | -| **Semantic Similarity()** Example use:
`ColumnSummaryMetric(column_name=SemanticSimilarity().on(["response", "new_response"]))`. | **Required:** **Optional:** | +| **Semantic Similarity()** Example use:
`SemanticSimilarity(with_column="response")` | **Required:** **Optional:** | | **Sentiment()** | **Required:**
n/a

**Optional:** | | **HuggingFaceModel()**

Scores the text using the user-selected HuggingFace model.| See [docs](../customization/huggingface_descriptor.md) with some example models (classification by topic, emotion, etc.)| | **HuggingFaceToxicityModel()** | **Optional**: | diff --git a/docs/book/tests-and-reports/output_formats.md b/docs/book/tests-and-reports/output_formats.md index 6a788003e8..b9fc287e99 100644 --- a/docs/book/tests-and-reports/output_formats.md +++ b/docs/book/tests-and-reports/output_formats.md @@ -52,6 +52,10 @@ To get the dictionary: drift_report.as_dict() ``` +{% hint style="info" %} +**Inlcude/exclude**. Check how to [manage verbosity](../customization/json-dict-output.md) of `json` or `as_dict` output. +{% endhint %} + # Scored DataFrame If you generated text Descriptors during your evaluation, you can retrieve a DataFrame with all generated descriptors added to each row of your original input data. @@ -67,7 +71,7 @@ This returns the complete original dataset with new scores. You can save the output of a Report or Test Suite as an Evidently JSON `snapshot`. {% hint style="info" %} -**How is a JSON snapshot different from `json()`?**. A snapshot contains all supplementary and render data. This lets you restore the output in any available Evidently format without accessing the initial raw data. +**How is a JSON snapshot different from `json()`?** A snapshot contains all supplementary and render data. This lets you restore the output in any Evidently format (like HTML) without accessing the initial raw data. {% endhint %} This is a rich JSON format used for storing the evaluation results on Evidently platform. When you save Reports or Test Suites to the platform, a snapshot is generated automatically. However, you can also generate and save a snapshot explicitly. diff --git a/docs/book/tests-and-reports/text-descriptors.md b/docs/book/tests-and-reports/text-descriptors.md index ccae81f8e9..ab5f4e2b37 100644 --- a/docs/book/tests-and-reports/text-descriptors.md +++ b/docs/book/tests-and-reports/text-descriptors.md @@ -149,6 +149,16 @@ report = Report(metrics=[ ]) ``` +**Multi-column descriptors**. Some Descriptors like `SemanticSimilarity` require a second column. Pass it as a parameter: + +```python +report = Report(metrics=[ + TextEvals(column_name="question", descriptors=[ + SemanticSimilarity(with_column="response") + ]), +]) +``` + Some Descriptors, like custom LLM judges, might require a more complex setup, but you can still include them in the Report just like any other Descriptor. {% hint style="info" %} @@ -159,6 +169,10 @@ Some Descriptors, like custom LLM judges, might require a more complex setup, bu **LLM-as-a-judge**. For a detailed guide on setting up LLM-based evals, check the guide to [LLM as a jugde](../customization/llm_as_a_judge.md). {% endhint %} +{% hint style="info" %} +**Custom descriptors**. You can implement descriptors as Python functions. Check the [guide on custom descriptors](../customization/add-custom-descriptor.md). +{% endhint %} + ## Using Metrics The `TextEvals` Preset works by generating a `ColumnSummaryMetric` for each Descriptor you calculate. You can achieve the same results by explicitly creating this Metric for each Descriptor: @@ -170,7 +184,7 @@ report = Report(metrics=[ ]) ``` -**Semantic Similariy**. To compare Semantic Similarity between two columns, you should use this approach instead of `TextEvals` to be able to process two columns at once. Pass both columns in a list: +For two-column descriptor like `SemanticSimilarity()`, pass both columns as a list: ```python report = Report(metrics=[