Skip to content

Commit

Permalink
updated documentation on data loader #1
Browse files Browse the repository at this point in the history
  • Loading branch information
pranav-kural committed Jul 13, 2024
1 parent 47807ad commit f0c1849
Show file tree
Hide file tree
Showing 6 changed files with 3,733 additions and 1,603 deletions.
10 changes: 5 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Contributing Guidelines

Thank you for your interest in contributing to our project! We welcome contributions from the community to help improve our project and make it even better. Before you get started, please take a moment to review the following guidelines.
Thank you for your interest in contributing to QvikChat documentation! We welcome contributions from the community to help improve the QvikChat documentation and make it even better. Before you get started, please take a moment to review the following guidelines.

## Getting Started

To contribute to our project, please follow these steps:
To contribute to QvikChat documentation, please follow these steps:

1. Fork the repository.
2. Create a new branch for your contribution.
Expand All @@ -15,7 +15,7 @@ To contribute to our project, please follow these steps:

## Code Style

We follow a specific code style in our project to maintain consistency and readability. Please make sure to adhere to the following guidelines:
We follow a specific code style in QvikChat documentation to maintain consistency and readability. Please make sure to adhere to the following guidelines:

- Use meaningful variable and function names.
- Indent code using spaces, not tabs. Use prettier to format your code if possible.
Expand All @@ -24,7 +24,7 @@ We follow a specific code style in our project to maintain consistency and reada

## Reporting Issues

If you encounter any issues or bugs while using our project, please report them by following these steps:
If you encounter any issues or bugs while using QvikChat documentation, please report them by following these steps:

1. Check if the issue has already been reported by searching our issue tracker.
2. If the issue hasn't been reported, create a new issue and provide a detailed description of the problem.
Expand All @@ -44,4 +44,4 @@ When submitting a pull request, please ensure the following:

We expect all contributors to adhere to our code of conduct. Please review our [Code of Conduct](CODE_OF_CONDUCT.md) before contributing.

Thank you for your contributions and helping us improve our project!
Thank you for your contributions and helping us improve QvikChat documentation!
2 changes: 1 addition & 1 deletion pages/integrations/langchain.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ By default, QvikChat uses an in-memory vector store, but you can easily provide

## Data Loaders

Out of the box, QvikChat provides support for loading data from text, PDF, JSON, CSV, or a code of a supported programming language. However, if you want to load a file not supported by QvikChat by default, you can simply provide an instance of any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) as the `dataLoader` parameter to the `retrieverConfig` or in the configurations parameters of the `getDataRetriever` method. For more info, check [Data Loaders](/rag-guide/data-loaders).
QvikChat provides built-in support for loading data from text, PDF, JSON, CSV, or a code file of a supported programming language. However, if you want to load a file not supported by QvikChat by default, you can simply use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property when configuring the retriever. Check the [Loading Custom Data](/rag-guide/data-loaders#loading-custom-data) section on the data loaders page for more information.
4 changes: 2 additions & 2 deletions pages/rag-guide/_meta.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"data-ingestion": "Data Ingestion",
"data-retrieval": "Data Retrieval",
"data-loaders": "Data Loaders",
"embedding-models": "Embedding Models",
"vector-stores": "Vector Stores",
"data-loaders": "Data Loaders"
"vector-stores": "Vector Stores"
}
69 changes: 6 additions & 63 deletions pages/rag-guide/data-ingestion.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,71 +12,14 @@ The data ingestion process in RAG involves the following steps:
4. **Embedding Generation**: Generate embeddings for each chunk of data. This is done using an embedding model. An embedding model converts text data into a numerical representation, and the distance between these numerical representations is used to determine the similarity between two pieces of text.
5. **Storage**: The generate vector embeddings are then stored in an efficient vector store.

## Data Loading Options
## Data Loaders

With QvikChat, you can configure the data loading options when calling the `getDataRetriever` method.
QvikChat provides built-in support for loading data from text, PDF, JSON, CSV, or a code file of a supported programming language. To learn more about data loading or how you can use custom data not supported by QvikChat, refer to the [Data Loaders](data-loaders) page.

```typescript copy
dataType: SupportedDataLoaderTypes; // specify the data type (helps ascertain best splitting strategy when not specified)
filePath: string; // path to the file to load
jsonLoaderKeysToInclude?: JSONLoaderKeysToInclude; // specify keys to include when loading JSON data
csvLoaderOptions?: CSVLoaderOptions; // specify options when loading CSV data
pdfLoaderOptions?: PDFLoaderOptions; // specify options when loading PDF data
dataSplitterType?: SupportedDataSplitterTypes; // if you want to specify the data splitter type
chunkingConfig?: ChunkingConfig; // data chunking configuration
splitterConfig?: DataSplitterConfig; // data splitter configuration
vectorStore?: VectorStore; // vector store instance to use
embeddingModel?: EmbeddingsInterface; // embedding model to use for generating embeddings
```
## Embedding Models

## Data Chunking
You can provide your own embedding model to generate embeddings for the data. There are more than 20+ embedding models supported by QvikChat through LangChain. To learn more about embedding models, refer to the [Embedding Models](embedding-models) page.

Be careful when specifying the data chunking configurations. In most cases, the chunk size, overlap and other parameters would depend highly on the data. It is recommended to experiment with different configurations to find the best one for your use case. There are, however, some default configurations that can be used, for example, for CSV data each row (or line) can be a chunk. If not sure, start with the default configurations and then experiment with different configurations.
## Vector Stores

## Embedding Model

You can provide your own embedding model to generate embeddings for the data. There are more than 20+ embedding models supported by QvikChat through LangChain. To check the list of available embedding models, refer to this page [Embedding models](https://js.langchain.com/v0.2/docs/integrations/text_embedding).

To use an embedding model, simply provide the instance to the `getDataRetriever` method. The below example shows how you can use an OpenAI embedding model to generate embeddings for the data.

```typescript copy
// import embedding model
import { getDataRetriever } from "@oconva/qvikchat/data-retrievers";
import { OpenAIEmbeddings } from "@langchain/openai";

// Index data and get retriever
const dataRetriever = await getDataRetriever({
dataType: "csv",
filePath: "test.csv",
generateEmbeddings: true,
embeddingModel: new OpenAIEmbeddings({
apiKey: process.env.OPENAI_API_KEY, // checks for OPENAI_API_KEY in .env file by default if not provided
batchSize: 512, // Default value if omitted is 512. Max is 2048
model: "text-embedding-3-large", // model name
}),
});
```

## Vector Store

The vector store is used to store the generated embeddings. The vector store is an efficient way to store and query embeddings.

QvikChat provides support for more than 30+ vector stores such as Faiss, Pinecone, Chroma and more, through LangChain. To see available vector stores, refer to this page [Vector stores](https://js.langchain.com/v0.2/docs/integrations/vectorstores).

To use a vector store, simply provide the instance to the `getDataRetriever` method. The below example shows how you can use a Faiss vector store to store the embeddings. You will need to provide the vector store instance the embedding model you want to use with it. If you wish to use a Google Gen AI or an OpenAI embedding model, you can use the `getEmbeddingModel` method to get the embedding model instance.

```typescript copy
import { getDataRetriever } from "@oconva/qvikchat/data-retrievers";
import { getEmbeddingModel } from "@oconva/qvikchat/embedding-models";
import { FaissStore } from "@langchain/community/vectorstores/faiss";

// Index data and get retriever
const dataRetriever = await getDataRetriever({
dataType: "csv",
filePath: "test.csv",
generateEmbeddings: true,
vectorStore: new FaissStore(getEmbeddingModel(), {
index: "test-index",
}),
});
```
The vector store is used to store the generated embeddings. The vector store is an efficient way to store and query embeddings. QvikChat provides support for more than 30+ vector stores such as Faiss, Pinecone, Chroma and more, through LangChain. To learn more about vector stores, refer to the [Vector Stores](vector-stores) page.
71 changes: 70 additions & 1 deletion pages/rag-guide/data-loaders.mdx
Original file line number Diff line number Diff line change
@@ -1,3 +1,72 @@
# Data Loaders

Out of the box, QvikChat provides support for loading data from text, PDF, JSON, CSV, or a code of a supported programming language. However, if you want to load a file not supported by QvikChat by default, you can simply provide an instance of any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) as the `dataLoader` parameter to the `retrieverConfig` or in the configurations parameters of the `getDataRetriever` method.
QvikChat provides built-in support for loading data from text, PDF, JSON, CSV, or a code file of a supported programming language. However, if you want to load a file not supported by QvikChat by default, you can simply use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property when configuring the retriever. Check [Loading Custom Data](#loading-custom-data) for more information.

## Built-in Data Loaders

QvikChat provides built-in support for loading data from the following file types:

- **Text files:** Any documents containing text data, plus any code files of supported programming languages.
- **PDF files:** Can use the `pdfLoaderOptions` to specify additional options.
- **JSON and JSONLines files:** Can use the `jsonLoaderKeysToInclude` property to specify the keys containing relevant data.
- **CSV files:** Can use the `csvLoaderOptions` property to specify the delimiter and other options.

The type of data you are providing is inferred from the file extension given in the `filePath` value that you provide when configuring the retriever. For information on configuring the retriever, see the [Retriever Configuration](/rag-guide/data-retrieval) page.

## Data Splitting

Data in the given file is first loaded into processable `Document` objects, and is then split into smaller chunks using a chunking strategy. This is important for two reasons: (1) it makes it easier to index data, and (2) it makes it easier to query data. Furthermore, since most LLM models have a finite context window (or input size), having smaller chunks of data ensures relevant context information isn't lost.

The chunking strategy used can have a signigicant impact on the performance of a chat service that responds to queries based on the data. However, since data comes in all shapes and sizes, it is recommended to experiment with different configurations to find the best one for your use case. There are, however, some default configurations that can be used, for example, for CSV data each row (or line) can be a chunk. If not sure, start with the default configurations and then experiment with different configurations.

You can use the `chunkingConfig` property to specify the data chunking configuration when configuring the retriever. The `chunkingConfig` object can contain the following properties:

- `chunkSize`: The size of each chunk in the data. The default value is 1000.
- `overlap`: The number of tokens to overlap between chunks. The default value is 200.

Here is an example of how you can specify the data chunking configuration:

```typescript
import { defineChatEndpoint } from "@oconva/qvikchat/endpoints";

// endpoint with custom chunking configurations
defineChatEndpoint({
endpoint: "rag",
enableRAG: true,
topic: "Test",
retrieverConfig: {
filePath: "src/data/knowledge-bases/test-data.csv",
generateEmbeddings: true,
chunkingConfig: {
chunkSize: 500,
chunkOverlap: 50,
},
},
});
```

## Data Loading Options

Below are the properties that can be used to configure the retriever for loading data:

**Required properties**

- `filePath`: The path to the file to load the data from.

**Optional properties**

- `dataType`: The type of data to load. This helps ascertain the best splitting strategy. If not specified, the data type is inferred from the file extension.
- `docs`: An array of `Document` objects containing the data to load. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported data loader](https://js.langchain.com/v0.2/docs/integrations/document_loaders) to load the data and provide the documents as the `docs` property.
- `splitDocs`: An array containing documents that have been processed through a data splitter. This is useful when you want to load data from a source not supported by QvikChat by default. You can use any [LangChain-supported text splitter](https://js.langchain.com/v0.2/docs/how_to/#text-splitters) to split the data and provide the split documents as the `splitDocs` property. If providing `splitDocs`, you do not need to provide `docs`.
- `jsonLoaderKeysToInclude`: An object containing the keys to include when loading JSON data. This is useful when you want to load only specific keys from the JSON data.
- `csvLoaderOptions`: An object containing options to specify when loading CSV data. This is useful when you want to specify the delimiter and other options when loading CSV data.
- `pdfLoaderOptions`: An object containing options to specify when loading PDF data. This is useful when you want to specify additional options when loading PDF data.
- `dataSplitterType`: Use this to specify a specific text splitter you want to use. If not specified, the default text splitter is used based on the data type.
- `chunkingConfig`: An object containing the data chunking configuration. This is useful when you want to specify the data chunking configuration.
- `splitterConfig`: An object containing the data splitter configuration. Use this to specify or override data splitting strategy.
- `vectorStore`: A vector store instance to use for retrieving relevant context data for a given query. When `generateEmbeddings` is set to `true`, the generated embeddings are stored in the vector store. If using a hosted vector store, you can provide the instance to the `vectorStore` property. To learn more, check out the [Vector Store](/rag-guide/vector-stores) page.
- `embeddingModel`: An embedding model instance to use for generating embeddings for the data. When `generateEmbeddings` is set to `true`, the embeddings are generated using the provided embedding model. If you want to use a specific embedding model, you can provide the instance to the `embeddingModel` property. By default, QvikChat will use either Gemini API or OpenAI for embedding model, depending on how you have configured the project. To learn more, check out the [Embedding Models](/rag-guide/embedding-models) page.
- `generateEmbeddings`: A boolean value to specify whether to generate embeddings for the data. If set to `true`, embeddings are generated for the data. If set to `false`, embeddings are not generated for the data, only the retriever instance is returned.
- `retrievalOptions`: Use this to configure the data retrieval strategy. Check out the [Data Retrieval](/rag-guide/data-retrieval) page for more information.

## Loading Custom Data
Loading

0 comments on commit f0c1849

Please sign in to comment.