Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Dense Retrievals #49

Open
DelaramRajaei opened this issue Dec 27, 2023 · 4 comments
Open

Implementation of Dense Retrievals #49

DelaramRajaei opened this issue Dec 27, 2023 · 4 comments
Assignees
Labels
Dataset Data loaders, datasests experiment

Comments

@DelaramRajaei
Copy link
Member

Here is the issue, I will keep a record of all my findings as I work on the task of refining all aspects of the retrieval system on different datasets using dense retrievals.

@DelaramRajaei DelaramRajaei added documentation Improvements or additions to documentation experiment Dataset Data loaders, datasests labels Dec 27, 2023
@DelaramRajaei DelaramRajaei self-assigned this Dec 27, 2023
@DelaramRajaei
Copy link
Member Author

Hey @hosseinfani,
As mentioned here, I've downloaded the dbpedia and antique datasets. Could you please share the robust04 files with me so that I can initiate the dense indexing? There appears to be a problem extracting the stored tar files in the teams when using Windows.

Looking ahead, our next steps involve obtaining the clueweb12, clueweb09, and gov2 datasets. Similar to robust04, for gov2, we'll need to sign a contract, and they will send us a copy of the drive, as explained here.
I can begin by indexing the antique and dbpedia datasets.

@hosseinfani
Copy link
Member

Hi @DelaramRajaei
I'm uploading the extracted files in our RePair > Datasets .. > Corpora >> Robust04
Can you upload the rest there as well?
I submitted the request for gov2.

@DelaramRajaei
Copy link
Member Author

@hosseinfani
Yes, I will upload the raw datasets in teams.

@DelaramRajaei
Copy link
Member Author

DelaramRajaei commented Jan 12, 2024

Hi @hosseinfani,

I wanted to provide you with an update on the indexing process. I downloaded the antique and dbpedia corpus and converted their format to the required jsonl format as mentioned in the documentation. I uploaded the jsonls in the Teams > RePir channel > files > Datasets & indexes > Corpora. Currently, I'm facing an issue when using pyserini for indexing. There seems to be a conflict with pygaggle, but I successfully removed pygaggle and used other libraries. However, I'm still encountering some issues with the library.

Hi @yogeswarl,

I noticed that you created the dense indexes for aol dataset. I followed the path you explained in the Readme and pyserini's documentation. However, I'm facing some problems. One issue is related to torch using CUDA. I installed torch with CUDA, but it's still not recognizing CUDA. Have you ever encountered this problem? Additionally, I have another question. Considering the large datasets and the possibility of running out of memory space, I wanted to know if you created the indexes using your local system or not?

@DelaramRajaei DelaramRajaei removed the documentation Improvements or additions to documentation label Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataset Data loaders, datasests experiment
Projects
None yet
Development

No branches or pull requests

2 participants