-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathscrappy.cabal
105 lines (101 loc) · 4.29 KB
/
scrappy.cabal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
cabal-version: 2.4
name: scrappy
-- PVP summary: +-+------- breaking API changes
-- | | +----- non-breaking API additions
-- | | | +--- code changes with no API change
version: 0.1.0.4
synopsis: html pattern matching library and high-level interface concurrent requests lib for webscraping
description:
Scrappy is meant to be a fully expressive library for all aspects of webscraping.
In this sense it is meant to be as undetectable as using Selenium but with a design
specifically for webscraping (where Selenium was never intended for webscraping nor do
it's maintainers seek to provide features that allow for more expressive scraping if it
would not help testing). The Elem.* modules provide a wide range of expressive pattern matching
functions while adding no more complexity in needing to learn this library over parsing.
.
In addition to expressive patterns to fit your specific patterns you hope to scrape, scrappy
provides helper functions for complex control flows such as running multiple different
parser-scrapers on respective sites based on a rotating ConcurrentStream when users have many
target sites that they can rotate working on to not overload a given site and thus avoid detection
.
For simpler control flows such as scraping a large number of pages on a single site, scrappy currently
provides functions like getHtml, which not only is a super simple interface to http requests,
but a persistent function that gurantees retrieval of the HTML document to be scraped
.
This package is labelled as uncurated, and so suggestions are very much welcome and this package is
expected to grow based on feedback. The primary focus will be on running Javascript to allow for greater
access to information on sites
-- URL for the project homepage or repository.
homepage: https://github.com/Ace-Interview-Prep/scrappy
license: BSD-3-Clause
license-file: LICENSE
author: Galen Sprout
maintainer: galen.sprout@gmail.com
bug-reports: https://github.com/Ace-Interview-Prep/scrappy/issues
x-curated: uncurated-seeking-adoption
stability: Experimental
category: Webscraping
build-type: Simple
extra-source-files: README.MD
library
exposed-modules: Scrappy.Find
, Scrappy.Scrape
, Scrappy.BuildActions
, Scrappy.Elem
, Scrappy.Elem.Types
, Scrappy.Elem.TreeElemParser
, Scrappy.Elem.SimpleElemParser
, Scrappy.Elem.ElemHeadParse
, Scrappy.Elem.ITextElemParser
, Scrappy.Elem.ChainHTML
, Scrappy.Links
, Scrappy.Requests
-- Needs a lot of work, is more like an idea rn
-- , Concurrencies
, Scrappy.Proxies
, Scrappy.Types
, Scrappy.JS
, Scrappy.Template
build-depends: base
, aeson
, bytestring
, containers
, directory
, http-client
, http-client-tls
, http-types
, lens
, modern-uri
-- ^ less maintained?
, network-uri
, parsec
, parser-combinators
, text
, time
, transformers
hs-source-dirs: src
default-language: Haskell2010
test-suite scrappy-test
default-language: Haskell2010
type: exitcode-stdio-1.0
hs-source-dirs: tests
main-is: MyLibTest.hs
build-depends:
base
, containers
, scrappy
, webdriver
, process
, which
, transformers
, parsec
, text
, exceptions
, lifted-base
, monad-control
, random
, aeson
, bytestring
, http-client
, http-types
, http-client-tls