Skip to content

Latest commit

 

History

History
80 lines (70 loc) · 4.62 KB

README.md

File metadata and controls

80 lines (70 loc) · 4.62 KB

Creepy Crawler is a full-stack search engine application. It's inspired by popular search engine apps. It allows the user to make queries, see their history, and set their theme.

Python SQLAlchemy Flask JavaScript React Redux Scrapy HTML CSS AWS

Crawl the web 🕷

search

  • Queries from the frontend are received asynchronously by Flask with help from the Crochet library where they are processed and passed to the Scrapy spiders.
    import crochet
    crochet.setup()
    @crochet.wait_for(timeout=200.0)
    def scrape_with_crochet(raw_query):
      partitioned_query = ...
      query_regex = re.compile(...)
      dispatcher.connect(_crawler_result, signal=signals.item_scraped)
      spiders = [...]
      if len(partitioned_query):
          for spider in spiders: crawl_runner.crawl(spider, query_regex=query_regex)
          eventual = crawl_runner.join()
          return
  • Settings are passed from Flask backend to Scrapy framework through configuration object.
    ...
    from scrapy.utils.project import get_project_settings
    ...
    settings = get_project_settings()
    settings_dict = json.load(open('app/api/routes/settings.json'))
    settings.update(settings_dict)
    crawl_runner = CrawlerRunner(settings)
  • Each spider runs a broad crawl through the web, starting from a seed URL.
    class BroadCrawler2(scrapy.Spider):
      """Broad crawling spider."""
    
      name = 'broad_crawler_2'
      start_urls = ['https://example.com/']
    
      def parse(self, response):
          """Follow links."""
          try:
              all_text = response.css('*:not(script):not(style)::text')
              for text in all_text:
                  query_found = bool(re.search(self.query_regex, text.get()))
                  if query_found: yield { 'url': response.request.url, 'text': text.get() }
                  
          except: print(f'End of the line error for {self.name}.')
    
          yield from response.follow_all(css='a::attr(href)', callback=self.parse)

Create custom themes 🎨

custom themes

  • AWS integration allows users to add backgrounds and profile images of their choice.

Look over your search history 🔍

history

  • The user can conveniently switch between 24 and 12 hour time.
  • Moreover, NATO timezone abbreviations are specially parsed for users with altered native settings.

Enjoy advanced interactions with your themes 🧮

theme interaction

Contact

Errors I encountered and conquered: