Analysis, Annotation, Regression, RWQ, other updates

UUDigitalHumanitieslab · Mar 13, 2024 · 162769f · 162769f
1 parent 8788d2b
commit 162769f
Show file tree

Hide file tree

Showing 1,785 changed files with 1,039,092 additions and 214 deletions.
diff --git a/mwe_query/MWE-statistics.md b/mwe_query/MWE-statistics.md
@@ -0,0 +1,191 @@
+# Dedicated MWE statistics
+
+## Introduction
+
+This section describes the dedicated MWE analysis component of GrETEL 5.
+
+## HTML pages with pivot tables
+
+The functions to generate the dedicated MWE statistics generate for each 
+statistic multiple html files containing a pivottable based on 
+[*pivottablejs*](https://github.com/nicolaskruchten/pivottable). In addition, an html file is generated with 
+an overview of all the different statistics and links to the relevant html files.
+(called *MWE-Analysis.html*)
+
+The html files are all generated in the folder *html* in the *mwe_query* folder
+that corresponds to the git repository,
+
+## HTML template files
+
+The html files are generated on the basis of html template files in which 
+variables are marked with a preceding $. The
+[Template class](https://docs.python.org/3/library/string.html#template-strings) 
+of the python string module
+is used to substitute actual values for these variables.
+
+There are two html template files,  residing in the folder 
+*htmltemplates*.
+
+The two template files are:
+
+- overviewtemplate.html  (for the overview)
+- pivottemplate.html  (for the specific statistics)
+
+## Generation of the analysis data
+
+The data for the statistics is generated in python (this already existed a year ago),
+in the module . There are two relevant functions:
+
+- getstats (in the module *mwestats.py*)
+- getgramconfigstats (in the module *gramconfig.py*)
+
+### The *getstats* function
+
+The *getstats* function generates data of type MWEcsv, which is a class with two elements: 
+
+- header (of type List\[str\])
+- data (of type List\[List\[str\]\])
+
+and thus simply implements an internal representation for simple tables with a header.
+
+The function *getstats* generates data for:
+
+- Arguments of the MWE
+- modifiers of the MWE components
+- Determiners of the MWE components 
+- Argument Frames
+- Arguments with the relation and categories
+- Component Sequences
+
+and it does so for the results of MEQ, 
+the NMQ, and the difference between NMQ and MEQ (NMQ-MEQ).
+
+A possible extension is to derive statistics for the governor of the MWE 
+(e.g., to answer the question: which verbs can occur as the governor 
+of the MWE *in de war*, and in which frequencies).
+
+### The *getgramconfigstat* function 
+
+The function *getgramconfigstats* obtains data about the 
+grammatical configurations in which the major lemmas of the MWE occur.
+The major lemmas are sorted alphabetically and the grammatical configuration 
+describes the path through the syntactic structure of a sentence 
+that connects the first major lemma to the second, etc until 
+the last major lemma.
+
+This is done for the results of the MLQ, and for the results of 
+the difference between the MLQ and the NMQ results (MLQ-NMQ)
+
+This function can also be used to generate statistics 
+for the newly proposed (and partially implemented) *Related Word Query (RWQ)*,
+the results of which are a superset of the results of the MLQ, 
+and for the difference between the RWQ and MLQ results
+
+## The *mkpivothtmls.py* module
+
+The html pages are generated by the function *queryresults2statshtml*
+
+    queryresults2statshtml(mwe: str, mweparse: SynTree, treebank: Dict[str, SynTree],
+                           fulltreebankname: str, queryresults: AllQueriesResult)
+
+which takes as parameters:
+
+- *mwe*: the MWE canonical form
+- *mweparse*: the syntactic structure of the MWE canonical form (with annotations removed)
+- *treebank*: this is a dictionary with the identifier of a sentence as key 
+and the syntactic structure of the sentence as value. This treebank contains the syntactic parses for 
+which a match has been found by  any of the queries. 
+- *fulltreebankname*: the name of the full treebank that has been queried
+- *queryresults*: result for the application of all queries (by the function 
+*applyqueries* of the module *canonicalform.py*)
+
+
+## Experiments
+
+This was developed and tested with three treebanks that contain the results of the 
+MLQ query applied to the Lassy-Groot/Kranten corpus. These files were obtained 
+in PaQu with the help of Peter Kleiweg.
+The treebanks are stored in xml files in the folder *mwe-query\tests\data\mwetreebanks* in the folders:
+
+- dansontspringena
+- hartbreken/data
+- pogingdoen
+
+The MWEs dealt with are:
+
+- *iemand zal de dans ontspringen*
+- *iemand zal iemands hart breken*
+- *iemand zal een poging doen*
+
+For the experiments the module *trystats* was used. 
+Since the queries still have to be executed for these treebanks,
+this module executes the function *test()*, which reads the xml 
+treebank into a treebank dictionary and calls the function 
+*createstatshtmlpages*. This function takes the mwe, the 
+treebank dictionary and the treebank name as input, 
+generates the queries, applies the queries and calls the function 
+*queryresults2statshtml* described above.
+
+## Pivot html pages
+
+For each analysis multiple html pages are generated. 
+Each html page contains the full data relevant 
+for this analysis inside the html file.
+
+Each analysis result involves a number of properties. For example, for the analysis of the MWE arguments 
+the following properties are involved:
+
+- *rel*: the (extended) relation of the argument
+- *arglemma*: the lemma of the head of the argument
+- *argword* the word of the head of the argument
+- *arg*: the whole argument
+- *utt*: the utterance in which this match was found (actually 
+with each word of the argument marked with html bold codes, 
+but these do not work here)
+- *id* the identifier of the sentence
+
+Example (the bold marking does work here):
+
+- *rel*: su
+- *arglemma*: *blik*
+- *argword*: *blikjes*
+- *arg*: *achtergelaten blikjes*
+- *utt*: *De idee dat Suwarrow het mooiste eiland ter wereld is - 
+niet alleen door de Frisbies uitgedragen , maar vooral door Tom Neale , een Nieuw-Zeelander die er ruim tien jaar doorbracht en ook over zijn wedervaren publiceerde - lokt tal van zeilers en de vervuiling en <b>achtergelaten</b> <b>blikjes</b> breken Johnny's hart .*
+- *id*: WR-P-P-G_part00477__WR-P-P-G-0000206974.p.12.s.2
+
+>Side note on this example: the sentence has been wrongly 
+> parsed by Alpino, of course the whole subject argument is *de vervuiling en achtergelaten blikjes*.
+> But even if Alpino would have parsed this sentence correctly, this would still be one of the results because in
+> coordinations all the heads of the conjuncts and the coordinator(s) count as the head of the subject argument. 
+> Here we follow the strategy adopted in PaQu
+
+
+The overview contains a link to a html file 
+in which the first property of this property list has 
+been put in the pivot row field. 
+The pivot table is sorted by row totals descending, 
+so the user sees (for the treebank *hartbreken/data*)
+the following table:
+
+| rel         | 	Totals |
+|:------------|--------:|
+| su          |     	40 |
+| obj1/det    |     	33 |
+| **Totals**	 |  **73** |
+
+
+The user can now add other properties, remove the *rel* property,
+filter for values etc., as is usual with these tables.
+
+Since it is expected that users will often want to select multiple properties, 
+in the order in which they are given, additional links are given  in the html page, e.g., for arguments:
+
+rel > arglemma > argword > arg > utt > id 
+
+Clicking on *arg*  leads the user to  a html page that contains 
+a pivot table in which all the properties preceding *arg* and *arg* (and in this order) 
+are included in the pivot table row field. (and similarly for each other property)
+
+Doing this, however, does undo any filtering done in an earlier html page, 
+one starts with a fresh new page. 
diff --git a/mwe_query/adpositions.py b/mwe_query/adpositions.py
@@ -34,3 +34,70 @@
 ]
 
 vzazindex = {vz+az: (vz, az) for (vz, az) in circumpositions}
+
+# list of prepositions that can also occur (posssibly in a variant) as a separable particle,
+# e.g 'aan', 'met' (because of 'mee';, but not 'van'
+
+vzandprts = {
+'aan',
+'achter',
+'af',
+'bij',
+'binnen',
+'buiten',
+'door',
+'in',
+'langs',
+'mee',
+'met',
+'na',
+'naar',
+'om',
+'onder',
+'op',
+'over',
+'rond',
+'tegen',
+'toe',
+'tot',
+'uit',
+'voor',
+'voorbij',
+
+
+}
+
+# Source e-ANS
+
+
+informal_locative_prepositions = { 'op', 'aan', 'tegen', 'in', 'binnen', 'buiten', 'onder', 'boven', 'voor',
+                                   'achter', 'naast', 'tussen', 'halverwege', 'tegenover', 'bij', 'beneden'}
+
+formal_locative_prepositions = {'nabij', 'te', 'benoorden', 'beoosten', 'bewesten', 'bezuiden'}
+informal_directional_prepositions = {'van', 'uit', 'vanaf', 'vanuit', 'vanonder', 'door', 'om', 'over',
+                                     'langs', 'voorbij', 'via', 'rond', 'rondom', 'naar', 'tot', 'richting'}
+informal_temporal_prepositions = {'na', 'sinds', 'tijdens'}
+formal_temporal_prepositions = {'sedert', 'omstreeks', 'gedurende', 'hangende', 'staande', 'gaande'}
+informal_other_prepositions = {'met', 'zonder', 'per', 'volgens', 'dankzij', 'ondanks', 'vanwege'}
+formal_other_prepositions = {'blijkens', 'conform', 'gegeven', 'getuige', 'gezien', 'ingevolge', 'krachtens',
+                             'luidens', 'middels', 'namens', 'naargelang', 'overeenkomstig', 'wegens',
+                             'behoudens', 'bezijden', 'exclusief', 'niettegenstaande', 'ongeacht',
+                             'onverminderd', 'uitgezonderd', 'aangaande', 'betreffende', 'inzake',
+                             'jegens', 'nopens', 'omtrent', 'qua', 'benevens', 'inclusief', 'contra', 'versus', 'à'}
+
+informalprepositions = informal_locative_prepositions.union(informal_temporal_prepositions, informal_other_prepositions)
+formalprepositions = formal_locative_prepositions.union(formal_temporal_prepositions, formal_other_prepositions)
+
+locative_prepositions = informal_locative_prepositions.union(formal_locative_prepositions)
+temporal_prepositions = informal_temporal_prepositions.union(formal_temporal_prepositions)
+other_prepositions = informal_other_prepositions.union(formal_other_prepositions)
+
+portmanteauprepositions = {'ter', 'ten'}
+
+
+allsimpleprepositions = locative_prepositions.union(temporal_prepositions, other_prepositions)
+
+allprepositions = allsimpleprepositions.union(portmanteauprepositions)
+
+
+postpositions = {'in', 'binnen', 'op', 'uit', 'af', ' door', 'over', 'voorbij', 'langs', 'rond', 'om'}