This repository contains code an materials for the web crawler implementation and data extraction algorithms.
Configurable and multi-threaded crawler that crawls *.gov.si
sites by default.
- Regular expressions and XPath queries for data extraction from
rtvslo.si
,overstock.com
andthemoviedb.org
. - Implementation of an automatic data extraction wrapper generator.
- HTML webpages inverted index generation
- Data retrieval using queries