Crawler and data extraction

This repository contains code an materials for the web crawler implementation and data extraction algorithms.

Crawler

Configurable and multi-threaded crawler that crawls *.gov.si sites by default.

Regular expressions and XPath queries for data extraction from rtvslo.si, overstock.com and themoviedb.org.
Implementation of an automatic data extraction wrapper generator.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
analysis		analysis
crawler		crawler
implementation-extraction		implementation-extraction
implementation-indexing		implementation-indexing
input-extraction		input-extraction
wrappers-extraction		wrappers-extraction
.gitignore		.gitignore
README.md		README.md
crawldb.sql		crawldb.sql
crawler.md		crawler.md
extraction.md		extraction.md
indexing.md		indexing.md
report-extraction.pdf		report-extraction.pdf
report-indexing.pdf		report-indexing.pdf
report.pdf		report.pdf
requirements.txt		requirements.txt