Boost is already compiled, zipped and a part of this repo. To use it directly-
- Unzip boost from zip/boost_1_64_0_SHARED.7z into lib/boost
We currently use boost DLLs for dynamic linking. If static linking is required zip\boost_1_64_0.7z can be used.
To use a newer version of boost :
- Download boost and install it to lib/boost
- Run the following commands -
b2.exe link=shared
- Download and install CMake (minimum version 2.6)
{Pre-installed} - We may skip this step unless a newer version of cURL is required.
There are two ways to build cURL - cmake & nmake.
- Download latest code from
- Clone and navigate to that directory. Run the following commands
mkdir build
cd build
cmake ..
- This builds a Visual Studio Solution. Open the solution and build it.
- Following files are generated :
- build\lib\Release\libcurl.dll
- build\lib\Release\libcurl_imp.exp
- build\lib\Release\libcurl_imp.lib
- Change to /winbuild
- Run the following command
nmake /f ENABLE_WINSSL=yes mode=dll MACHINE=x86 VC=14
nmake is available in VS binaries. (Microsoft Visual Studio 14.0\VC\bin)
- builds/libcurl-vc14-x86-release-dll-ipv6-sspi-winssl/bin/libcurl.dll
- builds/libcurl-vc14-x86-release-dll-ipv6-sspi-winssl/bin/curl.exe
- builds/libcurl-vc14-x86-release-dll-ipv6-sspi-winssl/lib/libcurl.lib
- builds/libcurl-vc14-x86-release-dll-ipv6-sspi-winssl/lib/libcurl.exp
- Put these generated files in lib/libcURL/lib (exe is not required :P)
- Copy headers from
- path_to_pulled_code/include/curl
- to lib/libcURL/include/curl
{Pre-installed} - We may skip this step unless a newer version of Gumbo-Parser is required.
Download repository from
Open VS project in /visualc and build it to get gumbo.lib
Build two variants of gumbo.lib, ie. debug & release { gumbo_debug.lib & gumbo_release.lib }. Gumbo doesn't export symbols, so DLL is not usable
Copy these libraries to Gaveshak/lib/gumbo/lib
Copy all .h files from gumbo/src to Gaveshak/lib/gumbo/include
{Pre-installed} - We may skip this step unless a newer version of Gumbo-Query is required.
- Download repository from
- Copy the headers to Gaveshak/src/Parser/include
- Copy the .cpp files to Gaveshak/src/Parser/src
- Export all the classes of Gumbo-Query (Document, Node, ...)
- Docs
- Installation
- Getting started
- Cassandra - DataStax downloads
- datastax-community-64bit_3.0.9_2.msi
- Windows binaries (32-bit) for Cpp-Driver can be downloaded from here along with dependencies.
- cassandra-cpp-driver-2.7.0-win32-msvc140
- DevCenter
- Configuration
- Open ports in firewall (22-62000). Inbound & Outbound both.
- Cassandra.yaml
- cluster_name must be same for all the nodes
- rpc_address & listen_address must be IP of current machine in network
- seeds must be a list of servers which provide info to a new node about cluster. (Maybe one or two machines which are treated as host must be enough.)
- Create a cqlshrc with hostname set to current IP.
- Add ../apache-cassandra/bin to path.
- Run the following command to test the nodes status -
nodetool status
- Multiple node clusters reference1 & reference2 if required
Visual studio solution setting for gtest before building :
VS Setting | Value |
Runtime Library | Multi-threaded Debug DLL (/MDd) |
In the root directory -
mkdir build
cd build
cmake .. -Wno-dev
This builds a Visual Studio Solution. Open the solution and build it.
- [Design/Architecture] (
- Testing
- GoogleTest :
- Smart Pointers
- What need for Globals.h?
- Although environment variables in future maybe!
- EXE for modules
- Maybe useful in testing too
- Classes must be testing units, instead of modules
- Fetcher
- Download in parts
- Cookies to be stored in local files?
- User agents sortable/categorized by platform and browser
- May use a list of classes/structs representing a useragent
- Crawler
- Extract "text" from the page before storing. (or not? Google stores original pages!)
A bare minimum crawler needs at least these components:
- Extractor: Minimal support to extract URL from page like anchor links.
- Duplicate Eliminator: To make sure same content is not extracted twice unintentionally. Consider it as a set based data structure.
- URL Frontier: To prioritize URL that has to fetched and parsed. Consider it as a priority queue
- Datastore: To store retrieve pages and URL and other meta data.
- Min. Delay
- Bot ID
- Relative path
- MyID?
- Bad URLs problem?
- Robots.txt
- Frequent proxy switch
- Switch user agents once in a while (use the famous ones)
- URL parsing (domain, http/https/...?, file-extension ...)
- Delay before fetching from same domain
- Avoid traps
- Max file size limit even if size is not know in advance