To run these examples download
and install Apache Spark version 1.6.1
Apache Spark example is an SBT subproject. To run it complete steps as follows:
- build a fat-jar of the example with
sbt ";++2.10.6;exampleSpark/assembly"
- run it with Spark by executing
$SPARK_HOME/bin/spark-submit ./examples/spark/target/scala-2.10/gnparser-example-spark-assembly-1.0.2.jar
- build a fat-jar of the
gnparser's spark-python
project withsbt ";++2.10.6;sparkPython/assembly"
. The project provides a thin wrapper for allowing transformation of input (RDD[String] scientific names) to the output (RDD[String] parsed results in compact JSON format). - run
pyspark
with command:
$SPARK_HOME/bin/pyspark \
--jars "`pwd`/spark-python/target/scala-2.10/gnparser-spark-python-assembly-1.0.2.jar" \
--driver-class-path="`pwd`/spark-python/target/scala-2.10/gnparser-spark-python-assembly-1.0.2.jar"`
- add Python snippet to call the wrapper:
def parse(names):
from pyspark.mllib.common import _py2java, _java2py
parser = sc._jvm.org.globalnames.parser.spark.Parser()
result = parser.parse(_py2java(sc, names), False, False)
return _java2py(sc, result)
- now scientific name strings can be parsed in your program as follows:
names = sc.parallelize(["Homo sapiens Linnaeus 1758",
"Salinator solida (Martens, 1878)",
"Taraxacum officinale F. H. Wigg."])
import json
canonical_names = parse(names) \
.map(lambda r: json.loads(r)) \
.map(lambda j: (j["name_string_id"], j["canonical_name"]["value"])) \
.collect()
print canonical_names
# [(u'208eb0ea-40e3-5894-9b7d-664721bd24e6', u'Homo sapiens'),
# (u'b0f8459f-8b73-514c-b6f3-568d54d99ded', u'Salinator solida'),
# (u'c2ab9908-ea25-57e1-835a-06b9d1ade53b', u'Taraxacum officinale')]