Skip to content

Latest commit

 

History

History
267 lines (197 loc) · 14.9 KB

README.markdown

File metadata and controls

267 lines (197 loc) · 14.9 KB

Simple web-crawler for node.js

Simplecrawler is designed to provide the most basic possible API for crawling websites, while being as flexible and robust as possible. I wrote simplecrawler to archive, analyse, and search some very large websites. It has happily chewed through 50,000 pages and written tens of gigabytes to disk without issue.

What does simplecrawler do?

  • Provides a very simple event driven API using EventEmitter
  • Extremely configurable base for writing your own crawler
  • Provides some simple logic for autodetecting linked resources - which you can replace or augment
  • Has a flexible queue system which can be frozen to disk and defrosted
  • Provides basic statistics on network performance
  • Uses buffers for fetching and managing data, preserving binary data (except when discovering links)

Installation

npm install simplecrawler

Getting Started

Creating a new crawler is very simple. First you'll need to include it:

var Crawler = require("simplecrawler").Crawler;

Then create your crawler:

var myCrawler = new Crawler("www.example.com");

Nonstandard port? HTTPS? Want to start archiving a specific path? No problem:

myCrawler.initialPath = "/archive";
myCrawler.initialPort = 8080;
myCrawler.initialProtocol = "https";

// Or:
var myCrawler = new Crawler("www.example.com","/archive",8080);

And of course, you're probably wanting to ensure you don't take down your webserver. Decrease the concurrency from five simultaneous requests - and increase the request interval from the default 250ms like this:

myCrawler.interval = 10000; // Ten seconds
myCrawler.maxConcurrency = 1;

For brevity, you may also specify the initial path and request interval when creating the crawler:

var myCrawler = new Crawler("www.example.com","/",8080,300);

Running the crawler

First, you'll need to set up an event listener to get the fetched data:

myCrawler.on("fetchcomplete",function(queueItem, responseBuffer, response) {
	console.log("I just received %s (%d bytes)",queueItem.url,responseBuffer.length);
	console.log("It was a resource of type %s",response.headers['content-type']);
	
	// Do something with the data in responseBuffer
});

Then, when you're satisfied you're ready to go, start the crawler! It'll run through its queue finding linked resources on the domain to download, until it can't find any more.

myCrawler.start();

Of course, once you've got that down pat, there's a fair bit more you can listen for...

Events

  • queueadd ( queueItem ) Fired when a new item is automatically added to the queue (not when you manually queue an item yourself.)
  • queueerror ( errorData , URLData ) Fired when an item cannot be added to the queue due to error.
  • fetchstart ( queueItem ) Fired when an item is spooled for fetching.
  • fetchheaders ( queueItem , responseObject ) Fired when the headers for a resource are received from the server. The node http response object is returned for your perusal.
  • fetchcomplete ( queueItem , responseBuffer , response ) Fired when the resource is completely downloaded. The entire file data is provided as a buffer, as well as the response object.
  • fetchdataerror ( queueItem, response ) Fired when a resource can't be downloaded, because it exceeds the maximum size we're prepared to receive (16MB by default.)
  • fetchredirect ( queueItem, parsedURL, response ) Fired when a redirect header is encountered. The new URL is validated and returned as a complete canonical link to the new resource.
  • fetch404 ( queueItem, response ) Fired when a 404 HTTP status code is returned for a request.
  • fetcherror ( queueItem, response ) Fired when an alternate 400 or 500 series HTTP status code is returned for a request.
  • fetchclienterror ( queueItem, errorData ) Fired when a request dies locally for some reason. The error data is returned as the second parameter.
  • complete Fired when the crawler completes processing all the items in its queue, and does not find any more to add. This event returns no arguments.

####A note about HTTP error conditions By default, simplecrawler does not download the response body when it encounters an HTTP error status in the response. If you need this information, you can listen to simplecrawler's error events, and through node's native data event (response.on("data",function(chunk) {...})) you can save the information yourself.

If this is annoying, and you'd really like to retain error pages by default, let me know. I didn't include it because I didn't need it - but if it's important to people I might put it back in. :)

Configuring the crawler

Here's a complete list of what you can stuff with at this stage:

  • crawler.host - The domain to scan. By default, simplecrawler will restrict all requests to this domain.
  • crawler.initialPath - The initial path with which the crawler will formulate its first request. Does not restrict subsequent requests.
  • crawler.initialPort - The initial port with which the crawler will formulate its first request. Does not restrict subsequent requests.
  • crawler.initialProtocol - The initial protocol with which the crawler will formulate its first request. Does not restrict subsequent requests.
  • crawler.interval - The interval with which the crawler will spool up new requests (one per tick.) Defaults to 250ms.
  • crawler.maxConcurrency - The maximum number of requests the crawler will run simultaneously. Defaults to 5 - the default number of http agents nodejs will run.
  • crawler.timeout - The maximum time the crawler will wait for headers before aborting the request.
  • crawler.userAgent - The user agent the crawler will report. Defaults to Node/SimpleCrawler 0.1 (http://www.github.com/cgiffard/node-simplecrawler).
  • crawler.queue - The queue in use by the crawler (Must implement the FetchQueue interface)
  • crawler.filterByDomain - Specifies whether the crawler will restrict queued requests to a given domain/domains.
  • crawler.scanSubdomains - Enables scanning subdomains (other than www) as well as the specified domain. Defaults to false.
  • crawler.ignoreWWWDomain - Treats the www domain the same as the originally specified domain. Defaults to true.
  • crawler.stripWWWDomain - Or go even further and strip WWW subdomain from requests altogether!
  • crawler.discoverResources - Use simplecrawler's internal resource discovery function. Defaults to true. (switch it off if you'd prefer to discover and queue resources yourself!)
  • crawler.cache - Specify a cache architecture to use when crawling. Must implement SimpleCache interface.
  • crawler.useProxy - The crawler should use an HTTP proxy to make its requests.
  • crawler.proxyHostname - The hostname of the proxy to use for requests.
  • crawler.proxyPort - The port of the proxy to use for requests.
  • crawler.domainWhitelist - An array of domains the crawler is permitted to crawl from. If other settings are more permissive, they will override this setting.
  • crawler.supportedMimeTypes - An array of RegEx objects used to determine supported MIME types (types of data simplecrawler will scan for links.) If you're not using simplecrawler's resource discovery function, this won't have any effect.
  • crawler.allowedProtocols - An array of RegEx objects used to determine whether a URL protocol is supported. This is to deal with nonstandard protocol handlers that regular HTTP is sometimes given, like feed:. It does not provide support for non-http protocols (and why would it!?)
  • crawler.maxResourceSize - The maximum resource size, in bytes, which will be downloaded. Defaults to 16MB.
  • crawler.downloadUnsupported - Simplecrawler will download files it can't parse. Defaults to true, but if you'd rather save the RAM and GC lag, switch it off.
  • crawler.needsAuth - Flag to specify if the domain you are hitting requires basic authentication
  • crawler.authUser - Username provdied for needsAuth flag
  • crawler.authPass - Passowrd provided for needsAuth flag

The Simplecrawler Queue

Simplecrawler has a queue like any other web crawler. It can be directly accessed at crawler.queue (assuming you called your Crawler() object crawler.) It provides array access, so you can get to queue items just with array notation and an index.

crawler.queue[5];

For compatibility with different backing stores, it now provides an alternate interface which the crawler core makes use of:

crawler.queue.get(5);

It's not just an array though.

Adding to the queue

You could always just .push a new resource onto the queue, but you'd need to have it all in the correct format, and validate the URL yourself, and oh wouldn't that be a pain. Instead, use the queue.add function provided for your convenience:

crawler.queue.add(protocol,domain,port,path);

That's it! It's basically just a URL, but comma separated (that's how you can remember the order.)

Queue items

Because when working with simplecrawler, you'll constantly be handed queue items, it helps to know what's inside them. These are the properties every queue item is expected to have:

  • url - The complete, canonical URL of the resource.
  • protocol - The protocol of the resource (http, https)
  • domain - The full domain of the resource
  • port - The port of the resource
  • path - The bit of the URL after the domain - includes the querystring.
  • fetched - Has the request for this item been completed? You can monitor this as requests are processed.
  • status - The internal status of the item, always a string. This can be one of:
    • queued - The resource is in the queue to be fetched, but nothing's happened to it yet.
    • spooled - A request has been made to the remote server, but we're still waiting for a response.
    • headers - The headers for the resource have been received.
    • downloaded - The item has been entirely downloaded.
    • redirected - The resource request returned a 300 series response, with a Location header and a new URL.
    • notfound - The resource could not be found. (404)
    • failed - An error occurred when attempting to fetch the resource.
  • stateData - An object containing state data and other information about the request:
    • requestLatency - The time taken for headers to be received after the request was made.
    • requestTime - The total time taken for the request (including download time.)
    • downloadTime - The total time taken for the resource to be downloaded.
    • contentLength - The length (in bytes) of the returned content. Calculated based on the content-length header.
    • contentType - The MIME type of the content.
    • code - The HTTP status code returned for the request.
    • headers - An object containing the header information returned by the server. This is the object node returns as part of the response object.
    • actualDataSize - The length (in bytes) of the returned content. Calculated based on what is actually received, not the content-length header.
    • sentIncorrectSize - True if the data length returned by the server did not match what we were told to expect by the content-length header.

You can address these properties like you would any other object:

crawler.queue[52].url;
queueItem.stateData.contentLength;
queueItem.status === "queued";

As you can see, you can get a lot of meta-information out about each request. The upside is, the queue actually has some convenient functions for getting simple aggregate data about the queue...

Queue Statistics and Reporting

First of all, the queue can provide some basic statistics about the network performance of your crawl (so far.) This is done live, so don't check it thirty times a second. You can test the following properties:

  • requestTime
  • requestLatency
  • downloadTime
  • contentLength
  • actualDataSize

And you can get the maximum, minimum, and average values for each with the crawler.queue.max, crawler.queue.min, and crawler.queue.avg functions respectively. Like so:

console.log("The maximum request latency was %dms.",crawler.queue.max("requestLatency"));
console.log("The minimum download time was %dms.",crawler.queue.min("downloadTime"));
console.log("The average resource size received is %d bytes.",crawler.queue.avg("actualDataSize"));

You'll probably often need to determine how many items in the queue have a given status at any one time, and/or retreive them. That's easy with crawler.queue.countWithStatus and crawler.queue.getWithStatus.

crawler.queue.getwithStatus returns the number of queued items with a given status, while crawler.queue.getWithStatus returns an array of the queue items themselves.

var redirectCount = crawler.queue.countWithStatus("redirected");

crawler.queue.getWithStatus("failed").forEach(function(queueItem) {
	console.log("Whoah, the request for %s failed!",queueItem.url);
	
	// do something...
});

Then there's some even simpler convenience functions:

  • crawler.queue.complete - returns the number of queue items which have been completed (marked as fetched)
  • crawler.queue.errors - returns the number of requests which have failed (404s and other 400/500 errors, as well as client errors)

Saving and reloading the queue (freeze/defrost)

You'll probably want to be able to save your progress and reload it later, if your application fails or you need to abort the crawl for some reason. (Perhaps you just want to finish off for the night and pick it up tomorrow!) The crawler.queue.freeze and crawler.queue.defrost functions perform this task.

A word of warning though - they are not CPU friendly or set up to be asynchronous, as they rely on JSON.parse and JSON.stringify. Use them only when you need to save the queue - don't call them every request or your application's performance will be incredibly poor - they block like crazy. That said, using them when your crawler commences and stops is perfectly reasonable.

// Freeze queue
crawler.queue.freeze("mysavedqueue.json");

// Defrost queue
crawler.queue.defrost("mysavedqueue.json");

Licence

You may copy and use this library as you see fit (including commercial use) and modify it, as long as you retain my attribution comment (which includes my name, link to this github page, and library version) at the top of all script files. You may not, under any circumstances, claim you wrote this library, or remove my attribution. (Fair's fair!)

I'd appreciate it if you'd contribute patches back, but you don't have to. If you do, I'll be happy to credit your contributions!

Contributors

Thanks to Nick Crohn for the HTTP Basic auth support. Thanks also to Mike Moulton for fixing a bug in the URL discovery mechanism and Mike Iannacone for correcting a keyword naming collision with node 0.8's EventEmitter.