A distributed search engine where user's are able to control what they can search, they can manually crawl specific websites of their liking and based on what they want to work with everyday. User's can crawl the web with a click of a button, and while crawling they can continue using the search feature to query existing webpages in their database, crawling might take some time because of security reasons and network throttling mechanism used by different website authors.
TF-IDF: "In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general.Like the bag-of-words model, it models a document as a multiset of words, without word order. It is a refinement over the simple bag-of-words model, by allowing the weight of words to depend on the rest of the corpus." source.
- Despite it being fast to determine the ranking of each document relative to the user's search query, it disregards the context of the query as long as a document matches the terms in the search query eg: "dog bites man" or "man bites dog" does not matter in the context of this model.
Document Length Normalization: Is a technique which mitigates the length bias of a document whose context might not be relevant to the query of a user, because a document's length is greater than all other documents that contains the user's query, the term frequency of a longer document might be much more higher compared to all other documents if we disregard the concentration of a lengthy document to the term.
- The Document length normalization mitigates the length of a long document by dividing: currentDocLength / avgDocLength and controlled by
b
which controls the normalization of the document to determine the concentration of the term in that document. if the term frequency is proportionate to the document length then that means the current document is relevant to the query, else if the document is longer than average and is not proportionate to the term frequency then it is most likely no the main focus of the document.
Beyond TF-IDF using BM25: I am no expert but from my understanding of the BM25 Model is that it is an instance of the TF-IDF but with super powers, where the relevancy of a document is controlled by constants k1
& b
where k1
controls the weight of a term frequency in a document or how much impact this term has throughout a document.
-
In
k1
if the constant is set to a lower value, it saturates the term very quickly which diminishes the term frequency as the term grows and stops to a certain point but if it is set to a higher value eg:k1 = 2
it will grow a bit slower up to a point where it begins slow down the rate as the term grows. -
The
b
controls the normalization of the length of the document relative to the term's relevancy or controls the concentration of the term in the document, if the term is sparse and is not mentioned enough in a long document and ifb
is high from 0-1, then long documents will be punished which means they will be scored lower, but if a document mentions the term more frequently and is more concetrated throughout the whole document, the document will be scored higher, using0
normalization will render the document to only consider the term frequency and not consider if the document is relative to the term.
- Save the most recently crawled webpage for continuation.
- Create cancellation for crawling but still save the indexed pages up to that point.
- One click to clear database.
- Let users delete a website from the sqlite database from the client-side.
- Documentation.
- Create a list of known websites (websites that have been indexed.) on the front-end
As of now while crawling webpages when for some reason user's would want to cancel or any system errors might occur while crawling, the webpages that are crawled up to that point will NOT BE SAVED unfortunately, I'll have to implement a safety net for saving and resuming crawled webpages, sorry about that.
So for now just crawl websites with fewer contents/pages for expermentation and testing.
- if the
Crawl list
sidebar is opened with ongoing crawls and zensearch is down, to reset the UI navigate to the browser'sdevtools
and clear thecookies
andlocal storage
.
User will have take to full responsibility in the event that they will be blocked by a website author, so make sure you're crawling a website that would generally accept web crawlers and has a rate-limiting mechanism in their services, I have implemented a rudimentary rate-limiter for the crawler in crawler/page_navigator.go
file called requestDelay()
.
/*
using elapsed time from start to end of request in milliseconds and compressing
it using log to smooth the values for increasing intervals for each requests
such that it doesnt grow too much when multiplying intervals.
multiplier values:
- 0 ignores all intervals
- 1 increases slowly but is still fast and might be blocked
- 2 sweet middleground
*/
func (pn *PageNavigator) requestDelay(multiplier int) {
max := 10000
base := int(math.Log10(float64(pn.mselapsed)))
fmt.Printf("CURRENT ELAPSED TIME: %d\n", pn.mselapsed)
if pn.interval < max {
pn.interval = (pn.interval + base) * multiplier
fmt.Printf("INCREASE INTERVAL: %d\n", pn.interval)
} else if pn.interval > max {
fmt.Printf("RESET INTERVAL: %d\n", pn.interval)
pn.interval = 0
}
time.Sleep(time.Duration(pn.interval * 1000000))
}
So be careful and read their robots.txt
file from their website https://<website-hostname>/robots.txt
.
# make sure you have docker and docker compose installed
# run these commands
cd path/to/zensearch/
docker compose up
The project uses Sqlite3 database which is stored within database/dist/website_collection.db
, you can go into it if you have sqlite3
installed in your system and if not go ahead and install then after that:
cd
to thedatabase/dist/website_collection.db
- run
sqlite3 website_collection.db
TABLES:
indexed_sites
known_sites
(i know both of these mean the same thing)webpages
SCHEMAS:
known_sites (
id INTEGER PRIMARY KEY,
url TEXT NOT NULL,
last_added INTEGER NOT NULL
);
indexed_sites (
id INTEGER PRIMARY KEY,
primary_url TEXT NOT NULL UNIQUE,
last_indexed INTEGER NOT NULL
);
webpages (
parent INTEGER REFERENCES indexed_sites(id),
id INTEGER PRIMARY KEY,
url TEXT NOT NULL ,
title TEXT,
contents TEXT
);
I have not yet implemented a way for users to delete individual crawled websites, so in order for the users to remove their indexed websites, users will have to navigate into the running container.
- if you are in the zensearch directory do
docker exec -ti db sh
else find the docker running docker container prefixed withzensearch-
usingdocker ps
copy the container id ofzensearch-db
then do the samedocker exec -ti <id of container> sh
remove the brackets. - run
sqlite3 dist/webiste_collection.db
- remove the indexed website's data