A dramatic text can also be understood as an information network, in which characters can play different roles depending on the extent to which they add new themes and ideas to an already established discourse or repeat what has already been said. From this point of view, within the field of quantitative drama analysis, we can apply an information theoretic approach and measure how innovative one character is compared to another. Furthermore, such pairwise comparisons make it possible to construct a network that presents the relationships between characters in a new way, complementing the already well-established practice of creating character networks based on co-occurrences in scenes. To achieve this, we have developed a methodology for calculating the semantic similarity and novelty between characters’ speeches taking also into account the time of the utterances.
The method consists of two parts: in preprocessing, a table is created from the TEI files of the dramas, in which the names of the actors, the act in which they are uttered and a sentence-level embedding score (based on S-BERT) are assigned to each sentence. In the second part (calculation), the novelty of an actor's utterance is calculated based on the cosine similarity of the embedding scores associated with the sentences.
Data from the Shakespeare subcorpus of the Drama Corpus Project: https://github.com/dracor-org/shakedracor
Preprocessing:
- In folder drama_conversion you can find the steps to create tsv files from the tei files. In the output tsv for every sentence of the drama we assign the name of the speaker, a timestamp, the act in which the sentence is uttered
- In folder embedding_test you can find the way to calculate embeddings of the sentences created at drama_conversion with any HuggingFace model using HuggingFaceEmbeddings and Sentence-Transformers.
Calculation:
- In suprise-pairwise-from-embedding.R you can find the main calculations for the pairwise comparison of characters based on the maximum cosine similarities of their sentences taking into account the time of the utterence; the weightening procedure; and the network normalization.
- In network-from-embedding.R you can find the R codes for visualizing the results from suprise-pairwise-from-embedding.R
- In pairwise_sentences you can compere two characters and see their most and less similar sentences.
Example sentences (folder sentence-example):
- hamlet-name_all-MiniLM-L6-v2.tsv is an example output of the preprocessing (of Shakespeare's Hamlet), which can be used as an input for the R codes.
- In the orher folders you can find the most and less similar sentences in some pairwise compersions as a result of pairwise_sentences.R.
- You can find here the results of the comparison of different SBERT models too (in folder model-compare).