Semantic Web Crawler built in Rails using Mechanize, Nokogiri, Rmagick and Sidekiq
Need to install Image Magick
$ sudo apt-get install imagemagick libmagickwand-dev redis-server firefox xvfb
Trying something new
$ bundle exec rake crawl:scrimper[]
Request example
curl "http://localhost:3000/v1"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns JSON structured like this:
"response": {
"status": 200
"active": true,
"api_usage_cap": 100000000,
"api_last_used": "2015-02-24",
"api_daily_usage": 17,
"available": {
"amazon-offers": "2.9M",
"walmart-offers": "134.4K",
"costco-offers": "83K",
"target-offers": "38.4K"
"indexing": "180",
"processing": "453.6K",
"pending": "947.2M"
Check the status of your current api usage as well as available data sources to pull from.
GET http://localhost:3000/v1
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
Status Type | Description |
active | Api Key is still active |
api_usage_cap | Max number of allowed api calls per day (can be unlimited) |
api_last_used | Date of the last call your key made to our api |
api_daily_usage | Total number of api calls made on api_last_used date |
available | Name and total number of data containers available for you to research |
indexing | Collected data being added to search |
processing | Data still waiting to be processed |
pending | Data still waiting to be added to processing |
Request example
curl "http://localhost:3000/v1.xml"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns XML structured like this:
<status type="integer">200</status>
<active type="boolean">true</active>
<api-usage-cap type="integer">1000</api-usage-cap>
<api-daily-usage type="integer">38</api-daily-usage>
Check the status of your current api usage as well as available data sources to pull from.
GET http://localhost:3000/v1.xml
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
Status Type | Description |
active | Api Key is still active |
api_usage_cap | Max number of allowed api calls per day (can be unlimited) |
api_last_used | Date of the last call your key made to our api |
api_daily_usage | Total number of api calls made on api_last_used date |
available | Name and total number of data containers available for you to research |
indexing | Collected data being added to search |
processing | Data still waiting to be processed |
pending | Data still waiting to be added to processing |
Request example
curl "http://localhost:3000/v1/amazon-offers/search/chromecast"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns JSON structured like this:
"response": {
"status": 200
"results": [
"id": "B00DR0PDNE",
"name": "Google Chromecast HDMI Streaming Media Player",
"container": "amazon-offers"
Search Items based on a query or key word.
GET http://localhost:3000/v1/:CONTAINER/search/:QUERY
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container you are searching in |
:QUERY | true | What you are searching for |
fetch | false | Automatically crawl new data (default: true) |
social | false | Automatically fetch new social data (default: false) |
Request example
curl "http://localhost:3000/v1/amazon-offers/search/chromecast.xml"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns XML structured like this:
<status type="integer">200</status>
<results type="array">
<name>Google Chromecast HDMI Streaming Media Player</name>
Search Items based on a query or key word.
GET http://localhost:3000/v1/:CONTAINER/search/:QUERY.xml
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container you are searching in |
:QUERY | true | What you are searching for |
fetch | false | Automatically crawl new data (default: true) |
social | false | Automatically fetch new social data (default: false) |
Request example
curl "http://localhost:3000/v1/walmart-offers/match?model=86002596-01"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns JSON structured like this:
"response": {
"status": 200
"results": [
"url": "",
"date": "2015-02-24",
"open_graph": true,
"type": "Offer",
"id": "811571013579",
"image": "",
"site_name": "",
"schema_org": true,
"tags": [
"name": "Google Chromecast HDMI Streaming Media Player",
"ItemID": "811571013579",
"screenshot": "811571013579/2015-02-23.jpg",
"price": "30.07",
"priceCurrency": "USD",
"availability": "InStock",
"title": "Media Streaming Players",
"sku": "811571013579",
"mpn": "86002596-01",
"brand": "Google",
"model": "86002596-01",
"facebook_shares": 42,
"google_shares": 47,
"twitter_shares": 12,
"pinterest_shares": 10,
"stumbleupon_shares": 1,
"total_shares": 112,
"container": "walmart-offers"
Match Items in a container based on a specific set of known parameters and values.
GET http://localhost:3000/v1/:CONTAINER/match
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container you are searching in |
:QUERY | true | What you are searching for |
results | false | Number of Results you want back (default: 1) |
fetch | false | Automatically crawl new data (default: true) |
social | false | Automatically fetch new social data (default: false) |
url | false | Unique URL for Item |
date | false | When data was last gathered |
id | false | Unique ID for Item |
tags | false | Tags associated with Item |
name | false | Unique name of Item |
description | false | Given description for Item |
type | false | Item Type |
image | false | Unique Item image |
facebook_shares | false | Number of times Item has been shared on Facebook |
google_shares | false | Number of times Item has been shared on Google Plus |
twitter_shares | false | Number of times Item has been shared on Twitter |
reddit_shares | false | Number of times Item has been shared on Reddit |
linkedin_shares | false | Number of times Item has been shared on LinkedIn |
pinterest_shares | false | Number of times Item has been shared on Pinterest |
stumbleupon_shares | false | Number of times Item has been shared on StumbleUpon |
total_shares | false | Number of times Item has been shared on Social Media |
Request example
curl "http://localhost:3000/v1/walmart-offers/match.xml?mpn=86002596-01"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns XML structured like this:
<status type="integer">200</status>
<results type="array">
<open-graph type="boolean">true</open-graph>
<schema-org type="boolean">true</schema-org>
<tags type="array">
<name>Google Chromecast HDMI Streaming Media Player</name>
<title>Media Streaming Players</title>
<facebook-shares type="integer">42</facebook-shares>
<google-shares type="integer">47</google-shares>
<twitter-shares type="integer">12</twitter-shares>
<pinterest-shares type="integer">10</pinterest-shares>
<stumbleupon-shares type="integer">1</stumbleupon-shares>
<total-shares type="integer">112</total-shares>
Match Items in a container based on a specific set of known parameters and values.
GET http://localhost:3000/v1/:CONTAINER/match.xml
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container you are searching in |
:QUERY | true | What you are searching for |
results | false | Number of Results you want back (default: 1) |
fetch | false | Automatically crawl new data (default: true) |
social | false | Automatically fetch new social data (default: false) |
url | false | Unique URL for Item |
date | false | When data was last gathered |
id | false | Unique ID for Item |
tags | false | Tags associated with Item |
name | false | Unique name of Item |
description | false | Given description for Item |
type | false | Item Type |
image | false | Unique Item image |
facebook_shares | false | Number of times Item has been shared on Facebook |
google_shares | false | Number of times Item has been shared on Google Plus |
twitter_shares | false | Number of times Item has been shared on Twitter |
reddit_shares | false | Number of times Item has been shared on Reddit |
linkedin_shares | false | Number of times Item has been shared on LinkedIn |
pinterest_shares | false | Number of times Item has been shared on Pinterest |
stumbleupon_shares | false | Number of times Item has been shared on StumbleUpon |
total_shares | false | Number of times Item has been shared on Social Media |
Request example
curl "http://localhost:3000/v1/amazon-offers/B00DR0PDNE"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns JSON structured like this:
"response": {
"status": 200
"url": "",
"date": "2015-02-24",
"id": "B00DR0PDNE",
"tags": [
"name": "Google Chromecast HDMI Streaming Media Player",
"description": " Google Chromecast HDMI Streaming Media Player: Electronics",
"type": "Offer",
"image": "",
"sku": "B00DR0PDNE",
"screenshot": "B00DR0PDNE/2015-02-24.jpg",
"price": "30.07",
"original_price": "35.00",
"facebook_shares": 48466,
"google_shares": 5776,
"twitter_shares": 177,
"reddit_shares": 149,
"linkedin_shares": 364,
"pinterest_shares": 106,
"stumbleupon_shares": 23,
"total_shares": 55061
Get most up to date Item information by Item id.
GET http://localhost:3000/v1/:CONTAINER/:ID
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container |
:ID | true | Item ID |
fetch | false | Automatically crawl new data (default: true) |
social | false | Automatically fetch new social data (default: false) |
Request example
curl "http://localhost:3000/v1/amazon-offers/B00DR0PDNE.xml"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns XML structured like this:
<status type="integer">200</status>
<tags type="array">
<name>Google Chromecast HDMI Streaming Media Player</name>
<description> Google Chromecast HDMI Streaming Media Player: Electronics
<facebook-shares type="integer">48466</facebook-shares>
<google-shares type="integer">5776</google-shares>
<twitter-shares type="integer">177</twitter-shares>
<reddit-shares type="integer">149</reddit-shares>
<linkedin-shares type="integer">364</linkedin-shares>
<pinterest-shares type="integer">106</pinterest-shares>
<stumbleupon-shares type="integer">23</stumbleupon-shares>
<total-shares type="integer">55061</total-shares>
Get most up to date Item information by Item id.
GET http://localhost:3000/v1/:CONTAINER/:ID.xml
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container |
:ID | true | Item ID |
fetch | false | Automatically crawl new data (default: true) |
social | false | Automatically fetch new social data (default: false) |
Request example
curl "http://localhost:3000/v1/amazon-offers/B00DR0PDNE/history"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns JSON structured like this:
"response": {
"status": 200
"id": "B00DR0PDNE",
"name": "Google Chromecast HDMI Streaming Media Player",
"sku": {
"2015-02-09": "B00DR0PDNE"
"screenshot": {
"2015-02-09": "B00DR0PDNE/2015-02-09.jpg",
"2015-02-10": "B00DR0PDNE/2015-02-10.jpg",
"2015-02-21": "B00DR0PDNE/2015-02-21.jpg",
"2015-02-24": "B00DR0PDNE/2015-02-24.jpg"
"price": {
"2015-02-09": "32.49",
"2015-02-10": "31.78",
"2015-02-21": "30.07"
"original_price": {
"2015-02-09": "35.00"
Get most up to date Item history by Item id.
GET http://localhost:3000/v1/:CONTAINER/:ID/history
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container |
:ID | true | Item ID |
fetch | false | Automatically crawl new data (default: true) |
social | false | Automatically fetch new social data (default: false) |
Request example
curl "http://localhost:3000/v1/amazon-offers/B00DR0PDNE/2015-02-24.jpg"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command redirects to a URL like the one below:
You are being
<a href="">redirected</a>.
Get Item redirect to screenshot image.
GET http://localhost:3000/v1/:CONTAINER/:ID/:SCREENSHOT_DATE.jpg
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container |
:ID | true | Item ID |
:SCREENSHOT_DATE | true | Date screenshot was captured |
Request example
curl "http://localhost:3000/v1/amazon-offers/B00DR0PDNE/2015-02-24"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns JSON structured like this:
"response": {
"status": 200
"id": "B00DR0PDNE",
"redirect_url": ""
Get Item screenshot in JSON form.
GET http://localhost:3000/v1/:CONTAINER/:ID/:SCREENSHOT_DATE
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container |
:ID | true | Item ID |
:SCREENSHOT_DATE | true | Date screenshot was captured |
Request example
curl "http://localhost:3000/v1/amazon-offers/B00DR0PDNE/2015-02-24.xml"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns XML structured like this:
<status type="integer">200</status>
Get Item screenshot in XML form.
GET http://localhost:3000/v1/:CONTAINER/:ID/:SCREENSHOT_DATE.xml
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container |
:ID | true | Item ID |
:SCREENSHOT_DATE | true | Date screenshot was captured |
Request example
curl "http://localhost:3000/v1/search/chromecast"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns JSON structured like this:
"response": {
"status": 200
"results": [
"id": "B00DR0PDNE",
"name": "Google Chromecast HDMI Streaming Media Player",
"container": "amazon-offers"
"id": "811571013579",
"name": "Google Chromecast HDMI Streaming Media Player",
"container": "walmart-offers"
"name": "Google Chromecast HDMI Streaming Media Player",
"id": "15460778",
"container": "target-offers"
"id": "945132",
"name": "Google Chromecast HDMI Streaming Media Player with $10 Google Play Credit",
"container": "costco-offers"
Search all availble Items based on a query or key word.
GET http://localhost:3000/v1/search/:QUERY
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:QUERY | true | What you are searching for |
Request example
curl "http://localhost:3000/v1/search/chromecast.xml"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns XML structured like this:
<status type="integer">200</status>
<results type="array">
<name>Google Chromecast HDMI Streaming Media Player</name>
<name>Google Chromecast HDMI Streaming Media Player</name>
<name>Google Chromecast HDMI Streaming Media Player</name>
<name>Google Chromecast HDMI Streaming Media Player with $10 Google Play Credit</name>
Search all availble Items based on a query or key word
GET http://localhost:3000/v1/search/:QUERY.xml
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:QUERY | true | What you are searching for |
Request example
curl "http://localhost:3000/v1/match?name=chromecast"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns JSON structured like this:
"response": {
"status": 200
"results": [
"id": "B00DR0PDNE",
"name": "Google Chromecast HDMI Streaming Media Player",
"container": "amazon-offers"
"id": "811571013579",
"name": "Google Chromecast HDMI Streaming Media Player",
"container": "walmart-offers"
"name": "Google Chromecast HDMI Streaming Media Player",
"id": "15460778",
"container": "target-offers"
"id": "945132",
"name": "Google Chromecast HDMI Streaming Media Player with $10 Google Play Credit",
"container": "costco-offers"
Match all Items based on a specific set of known parameters and values.
GET http://localhost:3000/v1/match
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container you are searching in |
:QUERY | true | What you are searching for |
results | false | Number of Results you want back (default: 10) |
fetch | false | Automatically crawl new data (default: true) |
social | false | Automatically fetch new social data (default: false) |
url | false | Unique URL for Item |
date | false | When data was last gathered |
id | false | Unique ID for Item |
tags | false | Tags associated with Item |
name | false | Unique name of Item |
description | false | Given description for Item |
type | false | Item Type |
image | false | Unique Item image |
facebook_shares | false | Number of times Item has been shared on Facebook |
google_shares | false | Number of times Item has been shared on Google Plus |
twitter_shares | false | Number of times Item has been shared on Twitter |
reddit_shares | false | Number of times Item has been shared on Reddit |
linkedin_shares | false | Number of times Item has been shared on LinkedIn |
pinterest_shares | false | Number of times Item has been shared on Pinterest |
stumbleupon_shares | false | Number of times Item has been shared on StumbleUpon |
total_shares | false | Number of times Item has been shared on Social Media |
Request example
curl "http://localhost:3000/v1/match.xml?name=chromecast"
-H "Authorization: Token token=YOUR-ACCESS-TOKEN"
The above command returns XML structured like this:
<status type="integer">200</status>
<results type="array">
<name>Google Chromecast HDMI Streaming Media Player</name>
<name>Google Chromecast HDMI Streaming Media Player</name>
<name>Google Chromecast HDMI Streaming Media Player</name>
<name>Google Chromecast HDMI Streaming Media Player with $10 Google Play Credit</name>
Match all Items based on a specific set of known parameters and values.
GET http://localhost:3000/v1/match.xml
Parameter | Required | Description |
access_token | true | Access token used to authenticate |
:CONTAINER | true | The available container you are searching in |
:QUERY | true | What you are searching for |
results | false | Number of Results you want back (default: 10) |
fetch | false | Automatically crawl new data (default: true) |
social | false | Automatically fetch new social data (default: false) |
url | false | Unique URL for Item |
date | false | When data was last gathered |
id | false | Unique ID for Item |
tags | false | Tags associated with Item |
name | false | Unique name of Item |
description | false | Given description for Item |
type | false | Item Type |
image | false | Unique Item image |
facebook_shares | false | Number of times Item has been shared on Facebook |
google_shares | false | Number of times Item has been shared on Google Plus |
twitter_shares | false | Number of times Item has been shared on Twitter |
reddit_shares | false | Number of times Item has been shared on Reddit |
linkedin_shares | false | Number of times Item has been shared on LinkedIn |
pinterest_shares | false | Number of times Item has been shared on Pinterest |
stumbleupon_shares | false | Number of times Item has been shared on StumbleUpon |
total_shares | false | Number of times Item has been shared on Social Media |