GitHub - vered1986/NC_embeddings: Comparison between various noun compound embeddings

A Systematic Comparison of English Noun Compound Representations

This repository contains the code used in the paper "A Systematic Comparison of English Noun Compound Representations" in the MWE workshop @ ACL 2019. It can be used to train various types of noun compound embeddings and evaluate them on several tasks.

Dependencies

Python 3
argparse
allennlp (0.8.2)
gensim
word_forms

Training Noun Compound Embeddings

Noun Compound Vocabulary

The vocabulary is based on the Tratz (2011) dataset. It is found under data/nc_vocab.txt. This repository does not handle recognizing noun compounds.

Corpus

I used the English Wikipedia dump from January 2018 as the corpus. To download:

wget https://dumps.wikimedia.org/enwiki/20180101/enwiki-20180101-pages-meta-current.xml.bz2;

Then, we extract and clean the text using WikiExtractor:

python WikiExtractor.py --processes 20 -o corpora/text/ ~/corpora/[lang]wiki-20180101-pages-meta-current.xml.bz2;
cat corpora/text/*/* > output/en_corpus;

Finally, tokenize the corpus using spacy:

python training/distributional/preprocessing/tokenize_corpus.py output/en_corpus;

Training Noun Compound Embeddings

You can train embeddings with several training objectives:

Distributional - learning vectors by treating noun compounds as single tokens. See here.
Compositional - learning vectors by composing the embeddings of the constituent words. See here.
Paraphrase Based - learning vectors by minimizing the distance to the embeddings of paraphrases. See here.

Evaluation

The noun compound embeddings were evaluated on the following tasks:

Top K - a qualitative analysis of the neighbours of each noun compound. See here.
Classification - multiclass classification of the noun compounds according to the relationship between the head and the modifier. See here.
Attributes - binary classification to whether a noun compound holds a certain property. See here.

Citation

If you used this code in your publication, please cite the following paper:

@inproceedings{shwartz-2019-nc_embeddings,
    title = "A Systematic Comparison of English Noun Compound Representations",
    author = "Shwartz, Vered",
    booktitle = "Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019) @ ACL 2019",
    month = August,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
}

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
data		data
source		source
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Systematic Comparison of English Noun Compound Representations

Dependencies

Training Noun Compound Embeddings

Noun Compound Vocabulary

Corpus

Training Noun Compound Embeddings

Evaluation

Citation

About

Releases

Packages

Languages

License

vered1986/NC_embeddings

Folders and files

Latest commit

History

Repository files navigation

A Systematic Comparison of English Noun Compound Representations

Dependencies

Training Noun Compound Embeddings

Noun Compound Vocabulary

Corpus

Training Noun Compound Embeddings

Evaluation

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages