Skip to content

Half assed but ambitious attempt at creating a search engine

License

Notifications You must be signed in to change notification settings

duke79/Gaveshak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build & Install

Dependencies

1. Boost (1.64.0)

Boost is already compiled, zipped and a part of this repo. To use it directly-

  • Unzip boost from zip/boost_1_64_0_SHARED.7z into lib/boost

We currently use boost DLLs for dynamic linking. If static linking is required zip\boost_1_64_0.7z can be used.

To use a newer version of boost :

  • Download boost and install it to lib/boost
  • Run the following commands -
boostrap.bat
b2.exe link=shared

2. CMake

  • Download and install CMake (minimum version 2.6)

3. cURL (curl-7.54.0)

{Pre-installed} - We may skip this step unless a newer version of cURL is required.

There are two ways to build cURL - cmake & nmake.

#cmake#

mkdir build  
cd build  
cmake ..
  • This builds a Visual Studio Solution. Open the solution and build it.
  • Following files are generated :
    • build\lib\Release\libcurl.dll
    • build\lib\Release\libcurl_imp.exp
    • build\lib\Release\libcurl_imp.lib

#nmake#

  • Change to /winbuild
  • Run the following command
nmake /f makefile.vc ENABLE_WINSSL=yes mode=dll MACHINE=x86 VC=14

nmake is available in VS binaries. (Microsoft Visual Studio 14.0\VC\bin)

  • builds/libcurl-vc14-x86-release-dll-ipv6-sspi-winssl/bin/libcurl.dll
  • builds/libcurl-vc14-x86-release-dll-ipv6-sspi-winssl/bin/curl.exe
  • builds/libcurl-vc14-x86-release-dll-ipv6-sspi-winssl/lib/libcurl.lib
  • builds/libcurl-vc14-x86-release-dll-ipv6-sspi-winssl/lib/libcurl.exp

#Copy files#

  • Put these generated files in lib/libcURL/lib (exe is not required :P)
  • Copy headers from
    • path_to_pulled_code/include/curl
    • to lib/libcURL/include/curl

4. Gumbo-Parser (0.10.1)

{Pre-installed} - We may skip this step unless a newer version of Gumbo-Parser is required.

  • Download repository from https://github.com/google/gumbo-parser

  • Open VS project in /visualc and build it to get gumbo.lib

  • Build two variants of gumbo.lib, ie. debug & release { gumbo_debug.lib & gumbo_release.lib }. Gumbo doesn't export symbols, so DLL is not usable

  • Copy these libraries to Gaveshak/lib/gumbo/lib

  • Copy all .h files from gumbo/src to Gaveshak/lib/gumbo/include

5. Gumbo-Query

{Pre-installed} - We may skip this step unless a newer version of Gumbo-Query is required.

  • Download repository from https://github.com/lazytiger/gumbo-query
  • Copy the headers to Gaveshak/src/Parser/include
  • Copy the .cpp files to Gaveshak/src/Parser/src
  • Export all the classes of Gumbo-Query (Document, Node, ...)

6. Storage

HBase

Cassandra

  • Docs
  • Installation
  • Getting started
  • Cassandra - DataStax downloads
    • datastax-community-64bit_3.0.9_2.msi
  • Windows binaries (32-bit) for Cpp-Driver can be downloaded from here along with dependencies.
    • cassandra-cpp-driver-2.7.0-win32-msvc140
  • DevCenter
    • DevCenter-1.5.0-win-x86_64.zip
  • Configuration
    • Open ports in firewall (22-62000). Inbound & Outbound both.
    • Cassandra.yaml
      • cluster_name must be same for all the nodes
      • rpc_address & listen_address must be IP of current machine in network
      • seeds must be a list of servers which provide info to a new node about cluster. (Maybe one or two machines which are treated as host must be enough.)
    • Create a cqlshrc with hostname set to current IP.
    • Add ../apache-cassandra/bin to path.
    • Run the following command to test the nodes status - nodetool status
    • Multiple node clusters reference1 & reference2 if required

6. Testing

Google Test

Visual studio solution setting for gtest before building :

VS Setting Value
Runtime Library Multi-threaded Debug DLL (/MDd)

Build

In the root directory -

mkdir build
cd build
cmake .. -Wno-dev

This builds a Visual Studio Solution. Open the solution and build it.

Planning & Information

ToDo

Crawler

How to build a crawler-Quora

A bare minimum crawler needs at least these components:

  • Extractor: Minimal support to extract URL from page like anchor links.
  • Duplicate Eliminator: To make sure same content is not extracted twice unintentionally. Consider it as a set based data structure.
  • URL Frontier: To prioritize URL that has to fetched and parsed. Consider it as a priority queue
  • Datastore: To store retrieve pages and URL and other meta data.
  • Min. Delay
  • Bot ID

Crawler - ToDo

  • Relative path
  • MyID?
  • Bad URLs problem?
  • Robots.txt
  • Frequent proxy switch
  • Switch user agents once in a while (use the famous ones)
  • URL parsing (domain, http/https/...?, file-extension ...)
  • Delay before fetching from same domain
  • Avoid traps
  • Max file size limit even if size is not know in advance

Scraping/Parsing

Indexing

Ranking

Query Processing

Search Algorithm

Data management

Distributed/Parallel Processing

Caching

Load balancing

Redundancy

Analytics

Links

About

Half assed but ambitious attempt at creating a search engine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published