Skip to content
This repository has been archived by the owner on Jul 25, 2024. It is now read-only.

Commit

Permalink
README added
Browse files Browse the repository at this point in the history
  • Loading branch information
Alex Hebing committed May 24, 2019
1 parent bc28009 commit b3841ca
Showing 1 changed file with 46 additions and 0 deletions.
46 changes: 46 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Place Name Disambiguation

This repo contains some scripts that enable collecting NER-entities and geocodes for a corpus in a folder. The main entry point is `extract.py`, whereas `icab_parser.py` is a helper script that is not part of `extract.py`s workflow.

## Setup

Setup a virtualenv with Python 3.4 (or higher > untested):

```bash
virtualenv .env -p python3.4 --prompt "(pnd) "
```

activate the virtualenv:

```bash
source .env/bin/activate (.env/Scripts/activate on Windows)
```

install requirements:

```python
pip install -r requirements.txt
```

## `extract.py`

`Extract.py` expects certain parameters, to get an overview of these, go:

```python
python extract.py --help
```

Note to self: document these parameters here once the script is in a more final state.

One thing is important to note:
Before `extract.py` can do anything for you, you need to make sure you have exported some authentication details required to collect geocodes. In particular, you need a [API Key for Google]() and [a username for GeoNames](http://www.geonames.org/export/web-services.html). More information on what is needed exactly can be found in the documentation of [the magnificent geocoder library](https://geocoder.readthedocs.io/index.html), which is used here (in `geocoding.py`) to collect geocode data.

Make sure you have the acccounts and keys and export them as environment variables before calling `extract.py`. For example:

```bash
export GOOGLE_API_KEY=<your_api_key>
```

## `icab_parser.py`

`icab_parser.py` is a very basic parser made to extract the text from the `.sgm` (XML-like) files of the [I-CAB](http://ontotext.fbk.eu/icab.html) corpus.

0 comments on commit b3841ca

Please sign in to comment.