This project is about implementing an inverted index using Apache Spark for building the index and a relational database (e.g. SQLite) for storing the index. We are using Python (PySpark) for this project. Storing the index in a database offers the benefit of using the B-Tree data structure offered by a relational database instead of building it from the scratch.
- Build the index using a document collection.
- Create database tables for storing the inverted index.
- Implement the keyword search functionality.
- Implement result ranking using the TF-IDF measure.
- Implement a simple interface for giving keyword queries and showing results.
Python(Pyspark)
SQLite
NLTK package
Google Colab