Creepy Crawler is a full-stack search engine application. It's inspired by popular search engine apps. It allows the user to make queries, see their history, and set their theme.
- Queries from the frontend are received asynchronously by Flask with help from the Crochet library where they are processed and passed to the Scrapy spiders.
import crochet crochet.setup() @crochet.wait_for(timeout=200.0) def scrape_with_crochet(raw_query): partitioned_query = ... query_regex = re.compile(...) dispatcher.connect(_crawler_result, signal=signals.item_scraped) spiders = [...] if len(partitioned_query): for spider in spiders: crawl_runner.crawl(spider, query_regex=query_regex) eventual = crawl_runner.join() return
- Settings are passed from Flask backend to Scrapy framework through configuration object.
... from scrapy.utils.project import get_project_settings ... settings = get_project_settings() settings_dict = json.load(open('app/api/routes/settings.json')) settings.update(settings_dict) crawl_runner = CrawlerRunner(settings)
- Each spider runs a broad crawl through the web, starting from a seed URL.
class BroadCrawler2(scrapy.Spider): """Broad crawling spider.""" name = 'broad_crawler_2' start_urls = ['https://example.com/'] def parse(self, response): """Follow links.""" try: all_text = response.css('*:not(script):not(style)::text') for text in all_text: query_found = bool(re.search(self.query_regex, text.get())) if query_found: yield { 'url': response.request.url, 'text': text.get() } except: print(f'End of the line error for {self.name}.') yield from response.follow_all(css='a::attr(href)', callback=self.parse)
- AWS integration allows users to add backgrounds and profile images of their choice.
- The user can conveniently switch between 24 and 12 hour time.
- Moreover, NATO timezone abbreviations are specially parsed for users with altered native settings.