-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Feat] Rag scrap notice and embedding for vectorDB (#192)
* setting: Chroma Vector DB 의존성 설정 * feat: 환경설정 파일 수정 * feat(QueryVectorStoreAdapter): QueryVectorStoreAdapter를 ChromaVectorStore를 사용하여 구현 * feat(Notice): Notice 테이블에 embedded boolean 필드 추가 해당 공지가 임베딩 되었는지 확인하는 컬럼을 추가한다 * feat(NoticeTextParserTemplate): 공지의 본문, 제목, 아이디를 파싱하는 ParserTemplate 구현 * test: ChromaDB test container 설정 * feat(NoticeApiClient): 단일 페이지를 scrap하는 requestSinglePageWithUrl 구현 * fix(NoticeJdbcRepository): 공지에 추가된 embedded 필드를 위해 bulk insert method 일부 수정 * feat(NoticeRepository): updateNoticeEmbeddingStatus, findNotYetEmbeddingNotice 메서드 구현 * fix(KuisHomepageNoticeTextParser): 본문을 포함하는 추가 테그를 파싱하는 로직 추가 * feat(KuisHomepageNoticeInfo): textParser 의존성 추가 * feat(ChromaVectorStoreAdapter): ChromaVector 구현 * test(KuisHomepageNoticeScraperTemplateTest): 임베딩 테스트 scrapForEmbedding 작성 * feat(RAGConfiguration): RAG 환경설정 구현 * feat(NoticeEmbeddingUpdater): 공지 embedding을 위한 Updater 구현 * feat: 공지 updater 작업 수행 시간 변경 * chore: 설정파일에 collection-name 추가 * fix(ChromaVectorStoreAdapter): embedding 메서드 수정과 테스트 추가 * feat(ChromaVectorStoreAdapter): 유사도 임계치 제거 유사도가 낮아도 답변을 꼭 생성하는쪽으로 구현 * feat: 사용하지 않는 RestTemplateConfig 제거 * chore: Public 접근 제어자 제거 * feat(ChromaVectorStoreAdapter): Top-K 를 2로 변경 * feat(User): 한달 질문 가능 횟수를 3번으로 변경 * feat(UserUpdater#questionCountReset): 매달 마지막날 사용자 질문 카운트 초기화 작업 구현 * feat(UserRegisterNonChainingFilter): 사용자 중복 등록 예외 로그를 남기도록 처리 * feat(UserUpdater): 사용자 제거작업 중지 * setting: ai max token 1000으로 변경 * feat(RAGQueryApiV2): RAGQueryApi 문서화 * refactor: SecurityRequirement에서 상수를 사용하도록 변경 * feat(User): 사용자 질문 횟수 2로 제한 * feat(RAGConfiguration): 테스트용 ChromaVectorStore를 prod 환경에서도 bean으로 등록하도록 변경
- Loading branch information
1 parent
c8cd35a
commit 1c8c99c
Showing
61 changed files
with
2,343 additions
and
121 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
64 changes: 64 additions & 0 deletions
64
src/main/java/com/kustacks/kuring/ai/adapter/out/persistence/ChromaVectorStoreAdapter.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
package com.kustacks.kuring.ai.adapter.out.persistence; | ||
|
||
import com.kustacks.kuring.ai.application.port.out.CommandVectorStorePort; | ||
import com.kustacks.kuring.ai.application.port.out.QueryVectorStorePort; | ||
import com.kustacks.kuring.notice.domain.CategoryName; | ||
import com.kustacks.kuring.worker.parser.notice.PageTextDto; | ||
import lombok.RequiredArgsConstructor; | ||
import org.springframework.ai.document.Document; | ||
import org.springframework.ai.reader.TextReader; | ||
import org.springframework.ai.transformer.splitter.TokenTextSplitter; | ||
import org.springframework.ai.vectorstore.ChromaVectorStore; | ||
import org.springframework.ai.vectorstore.SearchRequest; | ||
import org.springframework.context.annotation.Profile; | ||
import org.springframework.core.io.ByteArrayResource; | ||
import org.springframework.core.io.Resource; | ||
import org.springframework.stereotype.Component; | ||
|
||
import java.util.List; | ||
|
||
@Component | ||
@Profile("prod | local") | ||
@RequiredArgsConstructor | ||
public class ChromaVectorStoreAdapter implements QueryVectorStorePort, CommandVectorStorePort { | ||
|
||
private static final int TOP_K = 2; | ||
|
||
private final ChromaVectorStore chromaVectorStore; | ||
|
||
@Override | ||
public List<String> findSimilarityContents(String question) { | ||
return chromaVectorStore.similaritySearch( | ||
SearchRequest.query(question).withTopK(TOP_K) | ||
).stream() | ||
.map(Document::getContent) | ||
.toList(); | ||
} | ||
|
||
@Override | ||
public void embedding(List<PageTextDto> extractTextResults, CategoryName categoryName) { | ||
TokenTextSplitter textSplitter = new TokenTextSplitter(); | ||
|
||
for (PageTextDto textResult : extractTextResults) { | ||
if (textResult.text().isBlank()) continue; | ||
|
||
List<Document> documents = createDocuments(categoryName, textResult); | ||
List<Document> splitDocuments = textSplitter.apply(documents); | ||
chromaVectorStore.accept(splitDocuments); | ||
} | ||
} | ||
|
||
private List<Document> createDocuments(CategoryName categoryName, PageTextDto textResult) { | ||
Resource resource = new ByteArrayResource(textResult.text().getBytes()) { | ||
@Override | ||
public String getFilename() { | ||
return textResult.title(); | ||
} | ||
}; | ||
|
||
TextReader textReader = new TextReader(resource); | ||
textReader.getCustomMetadata().put("articleId", textResult.articleId()); | ||
textReader.getCustomMetadata().put("category", categoryName.getName()); | ||
return textReader.get(); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
20 changes: 0 additions & 20 deletions
20
src/main/java/com/kustacks/kuring/ai/adapter/out/persistence/QueryVectorStoreAdapter.java
This file was deleted.
Oops, something went wrong.
10 changes: 10 additions & 0 deletions
10
src/main/java/com/kustacks/kuring/ai/application/port/out/CommandVectorStorePort.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
package com.kustacks.kuring.ai.application.port.out; | ||
|
||
import com.kustacks.kuring.notice.domain.CategoryName; | ||
import com.kustacks.kuring.worker.parser.notice.PageTextDto; | ||
|
||
import java.util.List; | ||
|
||
public interface CommandVectorStorePort { | ||
void embedding(List<PageTextDto> extractTextResults, CategoryName categoryName); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
23 changes: 0 additions & 23 deletions
23
src/main/java/com/kustacks/kuring/config/RestTemplateConfig.java
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.