Milestone 4 - Subtask 3: Finetune model with Q/A format dataset #108

TobeTek · 2024-10-10T09:22:22Z

Potential Data Format that gives courses to take and reasoning. Will be replace with LPI Dataset courses/infoy

input: I want to learn about llm and how to finetune them. I'm intermediate and i want to build a rag pipeline
output: Given your intermediate level and interest in building a RAG pipeline using LLMs, I recommend the following courses and modules:

Course/Module	Source	Level	Duration (Estimate)	Keywords	Reason
Introduction to Large Language Models (LLMs)	Various online courses (e.g., Coursera, edX)	Beginner/Intermediate	Varies (4-8 weeks)	LLMs, transformers, attention mechanisms, language modeling, tokenization	Provides foundational knowledge of LLMs, their architecture, and capabilities. Essential before tackling fine-tuning or RAG.
Natural Language Processing (NLP) Fundamentals	Various online courses (e.g., Stanford NLP)	Intermediate	Varies (6-10 weeks)	NLP, text preprocessing, word embeddings, sentiment analysis, named entity recognition	Necessary for understanding how LLMs process and understand text. Many RAG techniques rely on NLP for data preprocessing and query understanding.
Fine-tuning LLMs	Hugging Face Course, Papers with Code	Intermediate/Advanced	Varies (2-4 weeks)	Fine-tuning, transfer learning, hyperparameter tuning, model evaluation	Teaches you how to adapt pre-trained LLMs to specific tasks, crucial for building a high-performing RAG system. Learn techniques like prompt engineering.
Retrieval Augmented Generation (RAG) Techniques	Research Papers, Blogs, Tutorials	Advanced	Varies (Ongoing Study)	RAG, vector databases, embedding generation, knowledge retrieval, question answering	Focuses on the architecture and implementation of RAG pipelines. You'll learn to select and integrate components like vector databases (e.g., Pinecone, Weaviate) and retrieval methods.
Python for Data Science (if needed)	DataCamp, Codecademy, Fast.ai	Intermediate	Varies (2-4 weeks)	Python, pandas, numpy, scikit-learn	Reinforce your Python skills for data manipulation and model building within your RAG pipeline.
Vector Databases (if needed)	Pinecone, Weaviate documentation	Intermediate	Varies (1-2 weeks)	Vector databases, similarity search, indexing, scalability	Understanding vector databases is crucial for efficient knowledge retrieval in a RAG pipeline. Learn about different databases and their strengths.

Tacoman99 · 2024-10-21T21:39:47Z

Description moved to #108 (comment)

neomatrix369 · 2024-10-23T06:44:55Z

Description moved to #108 (comment)

As discussed on Discord, you create a new dataset based on the LPI dataset and use that for your training and other purposes.

LPI dataset is still needed for our other purposes (and DOES NOT need to be replaced), plus it holds metadata about the courses which is added information (if can or need to be used).

Can you pls amend your wordings to remove any confusions to others who are also referring or reading this issue.

neomatrix369 · 2024-10-23T14:54:36Z

@Tacoman99 just moved the ticket to In Progress and also added it to the LPI project

Tacoman99 self-assigned this Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Milestone 4 - Subtask 3: Finetune model with Q/A format dataset #108

Milestone 4 - Subtask 3: Finetune model with Q/A format dataset #108

TobeTek commented Oct 10, 2024 •

edited by neomatrix369

Loading

Tacoman99 commented Oct 21, 2024 •

edited by neomatrix369

Loading

neomatrix369 commented Oct 23, 2024 •

edited

Loading

neomatrix369 commented Oct 23, 2024

Milestone 4 - Subtask 3: Finetune model with Q/A format dataset #108

Milestone 4 - Subtask 3: Finetune model with Q/A format dataset #108

Comments

TobeTek commented Oct 10, 2024 • edited by neomatrix369 Loading

Tacoman99 commented Oct 21, 2024 • edited by neomatrix369 Loading

neomatrix369 commented Oct 23, 2024 • edited Loading

neomatrix369 commented Oct 23, 2024

TobeTek commented Oct 10, 2024 •

edited by neomatrix369

Loading

Tacoman99 commented Oct 21, 2024 •

edited by neomatrix369

Loading

neomatrix369 commented Oct 23, 2024 •

edited

Loading