- Install the freshest JDK and Maven
- Clone the repository and step into the project directory
- Run
mvn clean install
in the terminal - You can find the JAR file in
target
directory and also in your Maven home - And from this point you can include it in your Maven project:
<dependency>
<groupId>com.github.juzraai</groupId>
<artifactId>cordis-projects-crawler</artifactId>
<version>VERSION</version>
</dependency>
Thanks to JitPack, you don't need to clone and build the project to use it as a dependency. Follow the link, click on the green "Get it" button next to the latest version and follow the instructions listed there.
Create a CordisCrawlerConfiguration
object and set its fields via the builder methods. See the User guide for descriptions of each option.
val configuration = CordisCrawlerConfiguration().seed("...")
// .crawlEverything()
// .crawlPublications()
// .forceDownload()
// .mysqlExport("user", "host:port", "schema")
// .outputDirectory("cordis-data")
// .password("mysql password")
// .quiet()
// .tsvExport()
// .verbose()
Tasks inside the crawler are separated into different types of modules (e.g. processors, caches, exporters). They are technically stored in a single list, but they encapsulated in a registry (CordisCrawlerModuleRegistry
). This way we can easily perform operations on all of the modules.
In order to use the crawler you must create a registry object:
val modules = CordisCrawlerModuleRegistry()
The registry provides the following methods:
close()
: callsclose()
on eachCloseable
module. This method is called by the crawler at the end of the crawl process.initialize(CordisCrawlerConfiguration)
: callsinitialize
method of each module and passes the configuration and the registry itself. This method is called by the crawler before the crawl process.ofType(Class)
: returns all modules of the given type. This method is called by the crawler and some modules to reach specific modules.
After you created the necessary objects, create the crawler itself too:
val crawler = CordisCrawler(configuration, modules)
You can start crawling by calling the crawlProjects
method of this object. This method has 3 different signatures:
crawlProjects()
- uses the configuration passed in crawler constructorcrawlProjects(args: Array<String>)
- you can parse command line arguments with this method, it will overwrite the crawler object's internal configurationcrawlProjects(seed: Iterator<Long>)
- use this if you want to override only the RCN seed, the internal configuration object will remain untouched
The method does the following:
- calls
modules.initialize()
to initialize modules - parses seed string using
ICordisProjectRcnSeed
modules - iterates through each seed RCN (
Long
)- maps it to a
CordisCrawlerRecord
object, this is the record type of the batch processing - runs every processor (
ICordisCrawlerRecordProcessor
) module on it
- maps it to a
- creates chunks with at most 100 RCNs
- runs exporter (
ICordisCrawlerRecordExporter
) modules on each
- runs exporter (
- closes
Closeable
modules withmodules.close()
The crawler consists of many kinds of modules which are used for different tasks. They are defined by interfaces. In some cases a module may call another modules too.
This is the root of all ev... ehm... modules :), all modules implement this interface.
Method:
Receives the configuration and modules. For Kotlin developers, overriding this method is optional, it has a default no-operation implementation. Java developers must implement the method.
fun initialize(
configuration: CordisCrawlerConfiguration,
modules: CordisCrawlerModuleRegistry
) {}
Call:
All modules of this type will be called at the beginning of the crawl.
Method:
It should parse configuration.seed
and return an iterator of CORDIS project RCN numbers. If it can't parse the seed string, it should return null
.
fun projectRcns(): Iterator<Long>?
Call:
The first module of this type which returns a non-null value will be used.
Implementations:
There are a lot of implementations, almost for all seed options. In the order of calling:
CordisProjectRcnRangeSeed
- parses RCN range seedCordisProjectRcnListSeed
- parses RCN list or single RCN seedCordisProjectUrlSeed
- parses RCN URL seedAllCordisProjectRcnSeed
- rewrites the seed to a CORDIS search URL which returns all projectsCordisProjectSearchUrlSeed
- crawls CORDIS search URLCordisProjectRcnDirectorySeed
- reads RCNs from output directory
Method:
It can make modifications on CordisCrawlerRecord
record object. If it returns null
, then the record is filtered out and will not reach further processors or exporters.
fun process(cordisCrawlerRecord: CordisCrawlerRecord): CordisCrawlerRecord?
Call:
All modules of this type will be called.
Implementations:
CordisProjectCrawler
- crawls CORDIS project metadataOpenAirePublicationsCrawler
- crawls publication list for the project from OpenAIRE
Method:
Receives chunks of CordisCrawlerRecord
objects, and should export them somewhere.
fun export(cordisCrawlerRecords: List<CordisCrawlerRecord>)
Call:
All modules of this type will be called after the processing phase.
Implementations:
MysqlExporter
- exports all data into a MySQL databaseProjectsTsvExporter
- exports projects' metadata into a TSV filePublicationsTsvExporter
- exports publications' metadata into a TSV file
They are all used by CordisProjectCrawler
processor.
Method:
Receives an RCN and should return an XML string.
fun projectXmlByRcn(rcn: Long): String?
Call:
The first module which return a non-null value will be used.
Implementations:
CordisCrawlerFileCache
- reads from output directoryCordisProjectXmlDownloader
- downloads from CORDIS
Method:
Receives an XML string and an RCN, and should write down the XML to somewhere (e.g. into a file), from where the module can read it back, because they are also readers.
fun cacheProjectXml(xml: String, rcn: Long)
Call:
All modules of this type will be called.
Implementations:
CordisCrawlerFileCache
- writes to output directory
Method:
Receives an XML string and should parse it into a Project
object.
fun parseProjectXml(xml: String): Project?
Call:
The first module which return a non-null value will be used.
Implementations:
CordisProjectXmlParser
- uses Simple framework to parse XML
They are all used by OpenAirePublicationsCrawler
processor.
Method:
Receives a Project
object and should return an XML string.
fun publicationsXmlByProject(project: Project): String?
Call:
The first module which return a non-null value will be used.
Implementations:
CordisCrawlerFileCache
- reads from output directoryOpenAirePublicationsXmlDownloader
- downloads from CORDIS
Method:
Receives an XML string and a Project
object, and should write down the XML to somewhere (e.g. into a file), from where the module can read it back, because they are also readers.
fun cachePublicationsXml(xml: String, project: Project)
Call:
All modules of this type will be called.
Implementations:
CordisCrawlerFileCache
- writes to output directory
Method:
Receive an XML string and should parse it into List<Publication>
.
fun parsePublicationsXml(xml: String): List<Publication>?
Call:
The first module which return a non-null value will be used.
Implementations:
OpenAirePublicationsXmlParser
- uses Simple framework to parse XML
Create a module class which implements one of the interfaces listed above, then add its instance to the module registry:
class MyModule : ICordisCrawlerRecordProcessor {
override fun process(cordisCrawlerRecord: CordisCrawlerRecord): CordisCrawlerRecord? {
println("Hello World, I'm processing project ${cordisCrawlerRecord.rcn}!")
return cordisCrawlerRecord
}
}
fun main(args: Array<String>) {
val configuration = CordisCrawlerConfiguration()
var registry = CordisCrawlerModuleRegistry()
var myModule = MyModule() // instantiating module
registry.modules.add(myModule) // adding module instance
CordisCrawler(configuration, registry).crawlProjects(args)
}
In some cases (e.g. when implementing seeds, readers or parsers) you may want to add your module with higher priority, to be called before other modules of the same type. You can pass an index as the first argument of add
:
registry.modules.add(0, myModule) // adding as first module
registry.modules
is a simple List<ICordisCrawlerModule>
.
Extend the CordisCrawlerConfiguration
class and use the intance of your new class to initialize the crawler:
class MyConfiguration : CordisCrawlerConfiguration() {
@Parameter(names = ["-X", "--extra"], description = "...") // JCommander annotation
var extraParameter: String? = null
}
fun main(args: Array<String>) {
val myCconfiguration = MyConfiguration() // using custom class
// you can still use build methods like .seed("...") and others
var registry = CordisCrawlerModuleRegistry()
CordisCrawler(myConfiguration, registry).crawlProjects(args)
}
Command line arguments are parsed by JCommander and your new field will be filled too. You can then use your custom configuration field in your custom modules. Note that inside your custom module, you have to cast the configuration object into MyConfiguration
to use the extra field.
Run the above program with the following arguments:
java -jar custom-cordis-crawler.jar -s ... -X ...