Skip to content

Commit

Permalink
Analysis, Annotation, Regression, RWQ, other updates
Browse files Browse the repository at this point in the history
  • Loading branch information
JanOdijk committed Mar 13, 2024
1 parent 8788d2b commit 162769f
Show file tree
Hide file tree
Showing 1,785 changed files with 1,039,092 additions and 214 deletions.
191 changes: 191 additions & 0 deletions mwe_query/MWE-statistics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# Dedicated MWE statistics

## Introduction

This section describes the dedicated MWE analysis component of GrETEL 5.

## HTML pages with pivot tables

The functions to generate the dedicated MWE statistics generate for each
statistic multiple html files containing a pivottable based on
[*pivottablejs*](https://github.com/nicolaskruchten/pivottable). In addition, an html file is generated with
an overview of all the different statistics and links to the relevant html files.
(called *MWE-Analysis.html*)

The html files are all generated in the folder *html* in the *mwe_query* folder
that corresponds to the git repository,

## HTML template files

The html files are generated on the basis of html template files in which
variables are marked with a preceding $. The
[Template class](https://docs.python.org/3/library/string.html#template-strings)
of the python string module
is used to substitute actual values for these variables.

There are two html template files, residing in the folder
*htmltemplates*.

The two template files are:

- overviewtemplate.html (for the overview)
- pivottemplate.html (for the specific statistics)

## Generation of the analysis data

The data for the statistics is generated in python (this already existed a year ago),
in the module . There are two relevant functions:

- getstats (in the module *mwestats.py*)
- getgramconfigstats (in the module *gramconfig.py*)

### The *getstats* function

The *getstats* function generates data of type MWEcsv, which is a class with two elements:

- header (of type List\[str\])
- data (of type List\[List\[str\]\])

and thus simply implements an internal representation for simple tables with a header.

The function *getstats* generates data for:

- Arguments of the MWE
- modifiers of the MWE components
- Determiners of the MWE components
- Argument Frames
- Arguments with the relation and categories
- Component Sequences

and it does so for the results of MEQ,
the NMQ, and the difference between NMQ and MEQ (NMQ-MEQ).

A possible extension is to derive statistics for the governor of the MWE
(e.g., to answer the question: which verbs can occur as the governor
of the MWE *in de war*, and in which frequencies).

### The *getgramconfigstat* function

The function *getgramconfigstats* obtains data about the
grammatical configurations in which the major lemmas of the MWE occur.
The major lemmas are sorted alphabetically and the grammatical configuration
describes the path through the syntactic structure of a sentence
that connects the first major lemma to the second, etc until
the last major lemma.

This is done for the results of the MLQ, and for the results of
the difference between the MLQ and the NMQ results (MLQ-NMQ)

This function can also be used to generate statistics
for the newly proposed (and partially implemented) *Related Word Query (RWQ)*,
the results of which are a superset of the results of the MLQ,
and for the difference between the RWQ and MLQ results

## The *mkpivothtmls.py* module

The html pages are generated by the function *queryresults2statshtml*

queryresults2statshtml(mwe: str, mweparse: SynTree, treebank: Dict[str, SynTree],
fulltreebankname: str, queryresults: AllQueriesResult)

which takes as parameters:

- *mwe*: the MWE canonical form
- *mweparse*: the syntactic structure of the MWE canonical form (with annotations removed)
- *treebank*: this is a dictionary with the identifier of a sentence as key
and the syntactic structure of the sentence as value. This treebank contains the syntactic parses for
which a match has been found by any of the queries.
- *fulltreebankname*: the name of the full treebank that has been queried
- *queryresults*: result for the application of all queries (by the function
*applyqueries* of the module *canonicalform.py*)


## Experiments

This was developed and tested with three treebanks that contain the results of the
MLQ query applied to the Lassy-Groot/Kranten corpus. These files were obtained
in PaQu with the help of Peter Kleiweg.
The treebanks are stored in xml files in the folder *mwe-query\tests\data\mwetreebanks* in the folders:

- dansontspringena
- hartbreken/data
- pogingdoen

The MWEs dealt with are:

- *iemand zal de dans ontspringen*
- *iemand zal iemands hart breken*
- *iemand zal een poging doen*

For the experiments the module *trystats* was used.
Since the queries still have to be executed for these treebanks,
this module executes the function *test()*, which reads the xml
treebank into a treebank dictionary and calls the function
*createstatshtmlpages*. This function takes the mwe, the
treebank dictionary and the treebank name as input,
generates the queries, applies the queries and calls the function
*queryresults2statshtml* described above.

## Pivot html pages

For each analysis multiple html pages are generated.
Each html page contains the full data relevant
for this analysis inside the html file.

Each analysis result involves a number of properties. For example, for the analysis of the MWE arguments
the following properties are involved:

- *rel*: the (extended) relation of the argument
- *arglemma*: the lemma of the head of the argument
- *argword* the word of the head of the argument
- *arg*: the whole argument
- *utt*: the utterance in which this match was found (actually
with each word of the argument marked with html bold codes,
but these do not work here)
- *id* the identifier of the sentence

Example (the bold marking does work here):

- *rel*: su
- *arglemma*: *blik*
- *argword*: *blikjes*
- *arg*: *achtergelaten blikjes*
- *utt*: *De idee dat Suwarrow het mooiste eiland ter wereld is -
niet alleen door de Frisbies uitgedragen , maar vooral door Tom Neale , een Nieuw-Zeelander die er ruim tien jaar doorbracht en ook over zijn wedervaren publiceerde - lokt tal van zeilers en de vervuiling en <b>achtergelaten</b> <b>blikjes</b> breken Johnny's hart .*
- *id*: WR-P-P-G_part00477__WR-P-P-G-0000206974.p.12.s.2

>Side note on this example: the sentence has been wrongly
> parsed by Alpino, of course the whole subject argument is *de vervuiling en achtergelaten blikjes*.
> But even if Alpino would have parsed this sentence correctly, this would still be one of the results because in
> coordinations all the heads of the conjuncts and the coordinator(s) count as the head of the subject argument.
> Here we follow the strategy adopted in PaQu

The overview contains a link to a html file
in which the first property of this property list has
been put in the pivot row field.
The pivot table is sorted by row totals descending,
so the user sees (for the treebank *hartbreken/data*)
the following table:

| rel | Totals |
|:------------|--------:|
| su | 40 |
| obj1/det | 33 |
| **Totals** | **73** |


The user can now add other properties, remove the *rel* property,
filter for values etc., as is usual with these tables.

Since it is expected that users will often want to select multiple properties,
in the order in which they are given, additional links are given in the html page, e.g., for arguments:

rel > arglemma > argword > arg > utt > id

Clicking on *arg* leads the user to a html page that contains
a pivot table in which all the properties preceding *arg* and *arg* (and in this order)
are included in the pivot table row field. (and similarly for each other property)

Doing this, however, does undo any filtering done in an earlier html page,
one starts with a fresh new page.
67 changes: 67 additions & 0 deletions mwe_query/adpositions.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,70 @@
]

vzazindex = {vz+az: (vz, az) for (vz, az) in circumpositions}

# list of prepositions that can also occur (posssibly in a variant) as a separable particle,
# e.g 'aan', 'met' (because of 'mee';, but not 'van'

vzandprts = {
'aan',
'achter',
'af',
'bij',
'binnen',
'buiten',
'door',
'in',
'langs',
'mee',
'met',
'na',
'naar',
'om',
'onder',
'op',
'over',
'rond',
'tegen',
'toe',
'tot',
'uit',
'voor',
'voorbij',


}

# Source e-ANS


informal_locative_prepositions = { 'op', 'aan', 'tegen', 'in', 'binnen', 'buiten', 'onder', 'boven', 'voor',
'achter', 'naast', 'tussen', 'halverwege', 'tegenover', 'bij', 'beneden'}

formal_locative_prepositions = {'nabij', 'te', 'benoorden', 'beoosten', 'bewesten', 'bezuiden'}
informal_directional_prepositions = {'van', 'uit', 'vanaf', 'vanuit', 'vanonder', 'door', 'om', 'over',
'langs', 'voorbij', 'via', 'rond', 'rondom', 'naar', 'tot', 'richting'}
informal_temporal_prepositions = {'na', 'sinds', 'tijdens'}
formal_temporal_prepositions = {'sedert', 'omstreeks', 'gedurende', 'hangende', 'staande', 'gaande'}
informal_other_prepositions = {'met', 'zonder', 'per', 'volgens', 'dankzij', 'ondanks', 'vanwege'}
formal_other_prepositions = {'blijkens', 'conform', 'gegeven', 'getuige', 'gezien', 'ingevolge', 'krachtens',
'luidens', 'middels', 'namens', 'naargelang', 'overeenkomstig', 'wegens',
'behoudens', 'bezijden', 'exclusief', 'niettegenstaande', 'ongeacht',
'onverminderd', 'uitgezonderd', 'aangaande', 'betreffende', 'inzake',
'jegens', 'nopens', 'omtrent', 'qua', 'benevens', 'inclusief', 'contra', 'versus', 'à'}

informalprepositions = informal_locative_prepositions.union(informal_temporal_prepositions, informal_other_prepositions)
formalprepositions = formal_locative_prepositions.union(formal_temporal_prepositions, formal_other_prepositions)

locative_prepositions = informal_locative_prepositions.union(formal_locative_prepositions)
temporal_prepositions = informal_temporal_prepositions.union(formal_temporal_prepositions)
other_prepositions = informal_other_prepositions.union(formal_other_prepositions)

portmanteauprepositions = {'ter', 'ten'}


allsimpleprepositions = locative_prepositions.union(temporal_prepositions, other_prepositions)

allprepositions = allsimpleprepositions.union(portmanteauprepositions)


postpositions = {'in', 'binnen', 'op', 'uit', 'af', ' door', 'over', 'voorbij', 'langs', 'rond', 'om'}
Loading

0 comments on commit 162769f

Please sign in to comment.