Docs for v0.4.37 (#1297)

evidentlyai · Sep 11, 2024 · dc562bc · dc562bc
1 parent cdd39cf
commit dc562bc
Show file tree

Hide file tree

Showing 13 changed files with 259 additions and 94 deletions.
diff --git a/docs/book/SUMMARY.md b/docs/book/SUMMARY.md
@@ -80,11 +80,12 @@
   * [Feature importance in data drift](customization/feature-importance.md)
   * [Text evals with LLM-as-judge](customization/llm_as_a_judge.md)
   * [Text evals with HuggingFace](customization/huggingface_descriptor.md)
+  * [Add a custom text descriptor](customization/add-custom-descriptor.md)
+  * [Add a custom drift method](customization/add-custom-drift-method.md)
+  * [Add a custom Metric or Test](customization/add-custom-metric-or-test.md)
   * [Customize JSON output](customization/json-dict-output.md)
   * [Show raw data in Reports](customization/report-data-aggregation.md)
   * [Add text comments to Reports](customization/text-comments.md)
-  * [Add a custom drift method](customization/add-custom-drift-method.md)
-  * [Add a custom Metric or Test](customization/add-custom-metric-or-test.md)
   * [Change color schema](customization/options-for-color-schema.md)
 * [How-to guides](how-to-guides/README.md)
 

diff --git a/docs/book/customization/add-custom-descriptor.md b/docs/book/customization/add-custom-descriptor.md
@@ -0,0 +1,110 @@
+---
+description: How to add custom text descriptors.
+---
+
+You can implement custom row-level evaluations for text data that you will later use just like any other descriptor across Metrics and Tests. You can implement descriptors that use a single column or two columns.
+
+Note that if you want to use LLM-based evaluations, you can write custom prompts using [LLM judge templates](llm_as_a_judge.md). 
+
+# Code example
+
+Refer to a How-to example:
+
+{% embed url="https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_use_llm_judge_template.ipynb" %}
+
+# Custom descriptors
+
+Imports: 
+
+```python
+from evidently.descriptors import CustomColumnEval, CustomPairColumnEval
+```
+
+## Single column descriptor 
+
+You can create a custom descriptor that will take a single column from your dataset and run a certain evaluation for each row.
+
+**Implement your evaluation as a Python function**. It will take a pandas Series as input and return a transformed Series. 
+
+Here, the `is_empty_string_callable` function takes a column of strings and returns an "EMPTY" or "NON EMPTY" outcome for each.
+
+```python
+def is_empty_string_callable(val1):
+    return pd.Series(["EMPTY" if val == "" else "NON EMPTY" for val in val1], index=val1.index)
+```
+
+**Create a custom descriptor**. Create an example of `CustomColumnEval` class to wrap the evaluation logic into an object that you can later use to process specific dataset input.
+
+```python
+empty_string = CustomColumnEval(
+    func=is_empty_string_callable,
+    feature_type="cat",
+    display_name="Empty response"
+)
+```
+
+Where:
+* `func: Callable[[pd.Series], pd.Series]` is a function that returns a transformed pandas Series.
+* `display_name: str` is the new descriptor's name that will appear in Reports and Test Suites.
+* `feature_type` is the type of descriptor that the function returns (`cat` for categorical, `num` for numerical)
+
+**Apply the new descriptor**. To create a Report with a new Descriptor, pass it as a `column_name` to the `ColumnSummaryMetric`. This will compute the new descriptor for all rows in the specified column and summarize its distribution:
+
+```python
+report = Report(metrics=[
+    ColumnSummaryMetric(column_name=empty_string.on("response")),
+])
+```
+
+Run the Report on your `df` dataframe as usual:
+
+```python
+report.run(reference_data=None, 
+           current_data=df)
+```
+
+## Double column descriptor
+
+You can create a custom descriptor that will take two columns from your dataset and will run a certain evaluation for each row. (For example, for pairwise evaluators).
+
+**Implement your evaluation as a Python function**. Here, the `exact_match_callable` function takes two columns and checks whether each pair of values is the same, returning "MATCH" if they are equal and "MISMATCH" if they are not.
+
+```python
+def exact_match_callable(val1, val2):
+    return pd.Series(["MATCH" if val else "MISMATCH" for val in val1 == val2])
+```
+
+**Create a custom descriptor**. Create an example of the `CustomPairColumnEval` class to wrap the evaluation logic into an object that you can later use to process two named columns in a dataset.
+
+```python
+exact_match =  CustomPairColumnEval(
+    func=exact_match_callable,
+    first_column="response",
+    second_column="question",
+    feature_type="cat",
+    display_name="Exact match between response and question"
+)
+```
+
+Where:
+
+* `func: Callable[[pd.Series, pd.Series], pd.Series]` is a function that returns a transformed pandas Series after evaluating two columns.
+* `first_column: str` is the name of the first column to be passed into the function.
+* `second_column: str` is the name of the second column to be passed into the function.
+* `display_name: str` is the new descriptor's name that will appear in Reports and Test Suites.
+* `feature_type` is the type of descriptor that the function returns (`cat` for categorical, `num` for numerical).
+
+**Apply the new descriptor**. To create a Report with a new Descriptor, pass it as a `column_name` to the ColumnSummaryMetric. This will compute the new descriptor for all rows in the dataset and summarize its distribution:
+
+```python
+report = Report(metrics=[
+    ColumnSummaryMetric(column_name=exact_match.as_column())
+])
+```
+
+Run the Report on your `df` dataframe as usual:
+
+```python
+report.run(reference_data=None, 
+           current_data=df)
+```
diff --git a/docs/book/customization/add-custom-metric-or-test.md b/docs/book/customization/add-custom-metric-or-test.md
@@ -1,4 +1,13 @@
-There are two ways to add a custom Metric to Evidently. 
+There are two ways to add a custom Metric or Test to Evidently:
+* Add it as a Python function (Recommended).
+* Implement a custom metric with custom Plotly render.
+
+Implementing a new Metric or Test means that you implement a completely custom column- or dataset-level evaluation.
+
+There are other ways to customize your evaluations that do not require creating Metrics or Tests from scratch:
+* Add a custom descriptor for row-level evaluations. Read on [adding custom text descriptors](add-custom-descriptor.md).
+* Write a custom LLM-based evaluator using templates. Read on [designing LLM judges](llm_as_a_judge.md).
+* Add a custom data drift detection method, re-using the existing Data Drift metric render. Read on [drift method customization](add-custom-drift-method.md) option.
 
 # 1. Add a new Metric or Test as a Python function. (Recommended).
 
@@ -9,8 +18,6 @@ This is a recommended path to add custom Metrics. Using this method, you can sen
 Example notebook: 
 {% embed url="https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_build_metric_over_python_function.ipynb" %}
 
-**Note**: if you want to add a custom data drift method, there is a separate [drift method customization](add-custom-drift-method.md) option. In this case, you will re-use the existing render. 
-
 # 2. Implement a new Metric and Test from scratch. 
 
 You can also implement a new Metric or Test from scratch, defining both the calculation method and the optional visualization. 

diff --git a/docs/book/customization/llm_as_a_judge.md b/docs/book/customization/llm_as_a_judge.md
@@ -31,7 +31,7 @@ You can use built-in evaluators that include pre-written prompts for specific cr
 **Imports**. Import the `LLMEval` and built-in evaluators you want to use:
 
 ```python
-from evidently.descriptors import LLMEval, NegativityLLMEval, PIILLMEval, DeclineLLMEval
+from evidently.descriptors import LLMEval, NegativityLLMEval, PIILLMEval, DeclineLLMEval, BiasLLMEval, ToxicityLLMEval, ContextQualityLLMEval
 ```
 
 **Get a Report**. To create a Report, simply list them like any other descriptor:
@@ -58,6 +58,16 @@ report = Report(metrics=[
 ])
 ```
 
+**Run descriptors over two columns**. An evaluator that assesses if the context contains enough information to answer the question requires both columns. Run the evaluation over the `context` column and pass the name of the column containing the `question` as a parameter.
+
+```python
+report = Report(metrics=[
+    TextEvals(column_name="context", descriptors=[
+        ContextQualityLLMEval(question="question"),
+    ])
+])
+```
+
 {% hint style="info" %}
 **Which descriptors are there?** See the list of available built-in descriptors in the [All Metrics](../reference/all-metrics.md) page. 
 {% endhint %}

diff --git a/docs/book/evaluations/no_code_evals.md b/docs/book/evaluations/no_code_evals.md
@@ -6,11 +6,11 @@ The platform supports several evaluations directly from the user interface.
 
 | Name                    | Type       | Description                                                                                                                              |
 |-------------------------|------------|------------------------------------------------------------------------------------------------------------------------------------------|
-| Text Evals              | Report     | Analyze texts using methods from regular expressions to LLM judges.                                                     |
-| Data Quality            | Report     | Get descriptive statistics and distribution overviews for all columns.  |
+| Text Evals              | Report     | Analyze text data, from regular expressions to LLM judges.                                                     |
+| Data Quality            | Report     | Get descriptive statistics and distributions for all columns.  |
 | Classification Quality  | Report     | Evaluate the quality of a classification model.                                |
 | Regression Quality      | Report     | Evaluate the quality of a regression model.                                           |
-| Data Quality Tests      | Test Suite | Automatically check for issues like missing values, duplicates, etc.                                                        |
+| Data Quality Tests      | Test Suite | Automatically check for missing values, duplicates, etc.                                                        |
 
 Before you start, pick a dataset to evaluate. For example, this could be a CSV file containing inputs and outputs of your AI system, like chatbot logs.
 
@@ -77,6 +77,10 @@ Select specific checks one by one:
 
 Each evaluation result is called a **Descriptor**. No matter the method, you’ll get a label or score for every evaluated text. Some, like “Sentiment,” work instantly, while others may need setup.
 
+{% hint style="info" %}
+**What other evaluators are there?** Check the list of Descriptors on the [All Metrics](../reference/all-metrics.md) page.
+{% endhint %}
+
 Here are few examples of Descriptors and how to configure them:
 
 ## Words presence
@@ -111,8 +115,8 @@ For a binary classification template, you can configure:
 * **Target/Non-target Category**: labels you want to use.
 * **Uncertain Category**: how the model should respond when it can’t decide.
 * **Reasoning**: choose to include explanation (Recommended).
-* **Category** and/or **Score**: have the LLM respond with the category (Recommended) or also return a score.
-* **Visualize as**: when both Category and Score are computed, choose which to display in the report.
+* **Category** and/or **Score**: have the LLM respond with the category (Recommended) or score.
+* **Visualize as**: when both Category and Score are computed, choose which to display in the Report.
 
 {% hint style="info" %}
 **What other evaluators are there?** Check the list of Descriptors on the [All Metrics](../reference/all-metrics.md) page.

diff --git a/docs/book/examples/cookbook_llm_judge.md b/docs/book/examples/cookbook_llm_judge.md
@@ -365,6 +365,7 @@ verbosity_report.datasets().current
 ```
 
 Preview:
+
 ![](../.gitbook/assets/cookbook/llmjudge_verbosity_examples.png)
 
 Don't fully agree with the results? Use these labels as a starting point, and correct the decision where you see fit - now you've got your golden dataset! Next, iterate on your judge prompt.
-Original file line number
+Diff line change
@@ Expand Up / @@ -365,6 +365,7 @@ verbosity_report.datasets().current @@
     ```
     Preview:
     ![](../.gitbook/assets/cookbook/llmjudge_verbosity_examples.png)
     Don't fully agree with the results? Use these labels as a starting point, and correct the decision where you see fit - now you've got your golden dataset! Next, iterate on your judge prompt.
@@ Expand Down @@