diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html new file mode 100644 index 00000000..1fe2faa7 --- /dev/null +++ b/404.html @@ -0,0 +1,576 @@ + + + +
+ + + + + + + + + + + + + +Version: 1.0.0
+Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
+license: Not publicly available.
+DaNews consists of articles from Danish news and tabloid media from 1 December 2019 to +30 April 2021. The articles stem from multiple news sources, including both online of physical newspapers.
+Following the recommendation and framework of [5] we add the following datasheet.
+For what purpose was the dataset created? Who created the dataset? Who funded the +creation of the dataset?
+DANews was collected as a part of the HOPE project, examining news coverage during the COVID-19 pandemic. The purpose was to train a model to understand how the novelty and resonance imprint of COVID-19 as a case of crisis compared to non-crises news imprints.
+Any other comments?
+No.
+What do the instances that comprise the dataset represent (e.g., documents, photos, +people, countries)?
+Instances of the dataset are Danish articles derived from Danish tabloids or news media.
+Does the dataset contain all possible instances or is it a sample (not necessarily +random) of instances from a larger set?
+Prior to filtering DaNews dataset contains all digitized news articles from the given +period across the sources.
+What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) +or features? In either case, please provide a description.
+Each instance consists of the following columns +
'ArticleUrl', 'Heading', 'SubHeading', 'Lead', 'Paragraph', 'PublishDate', 'BodyText',
+'Captions', 'Authors', 'Source', 'WordCount', 'ArticleId', 'PageIds', 'Section', 'text'
+
Where we constructed the columns text
column by joining the Heading
, SubHeading
+using newline. If the text field is empty it is ignored and no newline is added. The we
+join the resulting string with the BodyText
using two newlines.
During the quality filtering, we add the following indicator columns: +
'passed_quality_filter', 'filtered_by_max_chr_length', 'filtered_by_doc_length',
+'filtered_by_mean_word_length', 'filtered_by_alpha_ratio', 'filtered_by_stop_word',
+'filtered_by_symbol_2_word_hashtag', 'filtered_by_symbol_2_word_ellipsis',
+'filtered_by_line_bullets_or_ellipsis', 'filtered_by_duplicate_lines_chr_fraction',
+'filtered_by_duplicate_paragraph_chr_fraction', 'filtered_by_top_ngram_chr_fraction',
+'filtered_by_duplicate_ngram_chr_fraction', 'is_duplicate'
+
Is there a label or target associated with each instance? If so, please provide a +description.
+No.
+Is any information missing from individual instances? If so, please provide a +description, explaining why this information is missing (e.g., because it was +unavailable). This does not include intentionally removed information but might +include, e.g., redacted text.
+The team of researchers at the Humanities Computing Aarhus (CHCAA) have not +removed any information from the instances.
+Are relationships between individual instances made explicit (e.g., users’ movie +ratings, and social network links)? If so, please describe how these relationships are made +explicit.
+The metadata columns denote the relationship between articles including the date of +publication, sections, and authors.
+Are there recommended data splits (e.g., training, development/validation, testing)? +If so, please provide a description of these splits, explaining the rationale behind +them.
+There are not splits performed on this dataset.
+Are there any errors, sources of noise, or redundancies in the dataset? If so, please +provide a description.
+News sources can publish their content both in an online and printed format which would +lead to similar instances in the dataset. To alleviate this redundancy by removing +near-duplicates (see Preprocessing/cleaning/labeling).
+Is the dataset self-contained, or does it link to or otherwise rely on external +resources (e.g., websites, tweets, other datasets)?
+Articles are intended to tell a self-contained story but can include external +references such as tweets or website URLs.
+Does the dataset contain data that, if viewed directly, might be offensive, insulting, +threatening, or might otherwise cause anxiety?
+Articles often describe content that is considered offensive, insulting, or threatening.
+If the dataset is a sample from a larger set, what was the sampling strategy?
+The dataset is not a sample, but is a filtered version of the full dataset, see +Preprocessing/cleaning/labeling for more on this.
+Over what timeframe was the data collected?
+The dataset includes articles from 1 December 2019 to +30 April 2021.
+Were any ethical review processes conducted?
+No.
+Was any preprocessing/Cleaning/Labeling of the data done +(e.g., discretization or bucketing, tokenization, part-of-speech tagging, +SIFT feature extraction, removal of instances, processing of missing values)?
+DaNews has been filtered using a series of heuristic filters as well as removing +repetitious texts. Following the filtering, DaNews is deduplicated to remove exact and +near-duplicates.
+Of all documents, 9% were filtered based due to low-quality and 4% because they were near-duplicates.
+For quality filtering, DaNews applies a filter akin to [2] which contains text +that:
+have less than 30% of lines ending with an ellipsis.
+Have a low high degree of repetitious text:
+The deduplication removed all documents with a 13-gram Jaccard similarity higher than 80% +following the MinHash algorithm [1] using 128 permutations. The MinHash algorithm is a +probabilistic data structure for approximating the Jaccard similarity between two sets.
+Is the software used to preprocess/clean/label the instances available?
+Yes, the scripts are available +here. +the scripts use version 0.0.2 of the +dfm package.
+Has the dataset been used for any tasks already?
+Yes, the dataset has been used to pre-train Danish language models. +Parts of the dataset have also been used in [3] and [4]
+Is there a repository that links to any or all papers or systems that use the dataset?
+No.
+What (other) tasks could the dataset be used for?
+The scale of the dataset makes it suitable for NLP tasks such as language modeling. +Similarly, the structure of the articles makes it a suitable dataset for training text +summarisation models.
+Is there anything about the composition of the dataset or the way it was collected and +preprocessed/cleaned/labeled that might impact future uses?
+This dataset is static and thus does not evolve over time with the language. +A consequence of this is that it will become increasingly outdated over time.
+Are there tasks for which the dataset should not be used?
+This dataset contains Danish articles and thus should not be used for non-Danish +language tasks.
+As the writers of the content are predominantly journalists, it reflects a certain +writing style which is unlikely to reflect the Danish language as a whole.
+Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
+Data will only be available at the entity during the project. If you wish access to the dataset you will have to come to an agreement with the individuals +Danish newspapers.
+If you wish to cite this work please see our GitHub page for an up-to-date citation: +https://github.com/centre-for-humanities-computing/danish-foundation-models
+Version: 1.0.0
+Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
+License: Not publicly available.
+DaRadio consists of radio broadcasts from the Danish radio stations DR P1 and Radio24Syv, and contains approximately 140.000 hours of speech. DaRadio includes all shows aired on DR P1 from 2005 to 2021, and all shows aired on Radio24Syv from 2011 to 2019.
+DaRadio has been deduplicated using a series of heuristics based on metadata. For more on deduplication, see the data cleaning section further below.
+Following the recommendation and framework of [1], we add the following datasheet.
+For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?
+Data included in DaRadio was collected following the Danish Legal Deposit Act by the Royal Danish Library (RDL). From this, a dataset of Danish speech-only radio was derived by RDL. The dataset was created for research purposes, including training a Danish wav2vec2.0 model.
+The dataset was preprocessed to remove duplicates by a team of researchers at the Center for Humanities Computing, Aarhus University (CHC) with collaborators from the Danish speech-processing company Alvenir.
+What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
+Instances of the dataset include an mp3 file for each show aired on the two staions within the period. Further metadata include information on date and time of airing, title, short description of the show, and various internal identifiers used by RDL.
+How many instances are there in total (of each type, if appropriate)?
+DaRadio consists of a total of 215.582 hours of unprocessed Danish speech radio shows across two stations, DR P1 and Radio24syv. The table below shows the distribution over the stations with and without heuristic rerun removal.
+Source | +Duration (hours) | +Reruns removed | +
---|---|---|
P1 | +145.160 | +False | +
+ | 97.401 | +True | +
Radio24syv | +70.422 | +False | +
+ | 44.569 | +True | +
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
+The dataset contains all shows from the two stations in the time period (2005-2021 for DR P1 and 2011-2019 for Radio24syv).
+If the dataset is a sample from a larger set, what was the sampling strategy?
+The dataset is a subset of all Danish radio. The two stations were chosen for the dataset as they are talk-radio only.
+Who was involved in the data collection process?
+The RDL collects Danish radio shows and constructed DaRadio for handing to researchers at CHC.
+Over what timeframe was the data collected?
+The dataset includes radio shows from the period 2005 to 2021.
+Were any ethical review processes conducted?
+The RDL collects radio shows in adherence to Danish Archival laws. DaRadio was constructed for a research project, for which a project proposal was accepted by RDL. No other ethical review processes were conducted.
+Was any preprocessing/Cleaning/Labeling of the data done +(e.g., discretization or bucketing, tokenization, part-of-speech tagging, +SIFT feature extraction, removal of instances, processing of missing values)?
+DaRadio has been deduplicated using a series of heuristic filters and all files have been converted to 16 Khz .wav files.
+Reruns/duplicates were identified by the following rules:
+The deduplication was coded and conducted by researchers at CHC.
+Is the software used to preprocess/clean/label the instances available?
+The scripts are available at the following GitHub repository: link.
+Has the dataset been used for any tasks already?
+Yes, the dataset has been used to pre-train a Danish wav2vec2.0 model.
+Is there a repository that links to any or all papers or systems that use the dataset?
+No, but as of 23/10/16 no others have used the dataset.
+What (other) tasks could the dataset be used for?
+As the dataset only contains un-labelled data, i.e. no transcriptions, it is mainly designed for pre-training language models. However, given the metadata and re-occuring hosts, further processing might make it possible to train e.g. text-to-speech systems.
+Is there anything about the composition of the dataset or the way it was collected and +preprocessed/cleaned/labeled that might impact future uses?
+This dataset is static and does not evolve over time with the language, thus will become increasingly outdated over time.
+Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
+Data will only be available at the entity during the project. An equivalent or updated dataset can be requested at the Royal Danish Library.
+If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models
+Version: 1.0.0
+Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
+license: Not publicly available.
+HopeTwitter consists of tweets collected from the Twitter API using a stopword list +and consists of 32.5 million tweets across 538,398 unique users. HopeTwitter includes +tweets from 2019-01-01 to 2021-04-30.
+HopeTwitter, have been filtered to only include Danish tweets, based on language tag from Twitter API. Similarly, HopeTwitter +have had low-quality tweets have removed and then deduplicated to remove exact and +near-duplicates. For more on data cleaning see section; +"Preprocessing/cleaning/labeling".
+HopeTwitter includes a total of 0.97 billion tokens before filtering and includes 0.48 +billion (50%) after.
+Following the recommendation and framework of [3] we add the following datasheet.
+**For what purpose was the dataset created? Who created the dataset? Who funded the +creation of the dataset? **
+HopeTwitter was initially collected as a part of the +HOPE project, examining societal behaviour during the +COVID-19 pandemic. Next, HopeTwitter was cleaned in preparation for pre-training Danish language +models by a team of researchers at Center for Humanities Computing Aarhus +(CHCAA), using +a codebase jointly developed with partners from academia and industry, including KMD, +Ekstra Bladet, Bristol University and Deepdivr. For more on collaborators on this +project see the +GitHub repository.
+Any other comments?
+No.
+What do the instances that comprise the dataset represent (e.g., documents, photos, +people, countries)?
+HopeTwitter consists of tweets containing at least one of a series of stopwords, +collected through the Twitter API. See "If the dataset is a sample from a larger set, +what was the sampling strategy?" for the stopword list.
+How many instances are there in total (of each type, if appropriate)?
+The dataset consist of 32,499,019 documents where 14,399,284 (44%) were considered +duplicates.
+Does the dataset contain all possible instances or is it a sample (not necessarily +random) of instances from a larger set?
+No. It does not contain all instances of Danish Twitter as there are likely some Danish +tweets which does not include a stopword.
+Is there a label or target associated with each instance? If so, please provide a +description.
+No.
+Are there recommended data splits (e.g., training, development/validation, testing)? +If so, please provide a description of these splits, explaining the rationale behind +them.
+No splits are performed on this dataset.
+If the dataset is a sample from a larger set, what was the sampling strategy?
+Tweets are streamed continuously using queried a set of the highest +frequency Scandinavian-specific keywords from Danish, Norwegian (Bokmål) and Swedish, +resulting in the following list: +
aften, aldrig, alltid, altid, andet, arbejde, bedste, behöver, behøver, beklager,
+berätta, betyr, blev, blevet, blir, blitt, blive, bliver, bruge, burde, bättre, båe
+bør, deim, deires, ditt, drar, drepe, dykk, dykkar, där, död, döda, død, døde, efter,
+elsker, endnu, faen, fandt, feil, fikk, finner, flere, forstår, fortelle, fortfarande,
+fortsatt, fortælle, från, få, fået, får, fått, förlåt, första, försöker, før, først,
+første, gick, gikk, gillar, gjennom, gjerne, gjorde, gjort, gjør, gjøre, godt, gå, gång,
+går, göra, gør, gøre, hadde, hallå, havde, hedder, helt, helvete, hende, hendes, hennes,
+herregud, hjelp, hjelpe, hjem, hjälp, hjå, hjælp, hjælpe, honom, hossen, hvem, hvis,
+hvordan, hvorfor, händer, här, håll, håller, hør, høre, hører, igjen, ikkje, ingenting,
+inkje, inte, intet, jeres, jävla, kanske, kanskje, kender, kjenner, korleis, kvarhelst,
+kveld, kven, kvifor, känner, ledsen, lenger, lidt, livet, längre, låt, låter, længe,
+meget, menar, mycket, mykje, må, måde, många, mår, måske, måste, måtte, navn, nogen,
+noget, nogle, noko, nokon, nokor, nokre, någon, något, några, nån, når, nåt, nødt,
+också, også, pengar, penger, pratar, prøver, på, redan, rundt, rätt, sagde, saker,
+samma, sammen, selv, selvfølgelig, sidan, sidste, siger, sikker, sikkert, själv, skete,
+skjedde, skjer, skulle, sluta, slutt, snakke, snakker, snill, snälla, somt, stadig,
+stanna, sted, står, synes, säger, sätt, så, sådan, såg, sånn, tager, tiden, tilbage,
+tilbake, tillbaka, titta, trenger, trodde, troede, tror, två, tycker, tänker, uden,
+undskyld, unnskyld, ursäkta, uten, varför, varit, varte, veldig, venner, verkligen,
+vidste, vilken, virkelig, visste, väg, väl, väldigt, vän, vår, våra, våre, væk, vær,
+være, været, älskar, åh, år, åt, över
+
Who was involved in the data collection process?
+A team of researchers at the Center for Humanities +Computing Aarhus (CHCAA), including Kristoffer Nielbo and Peter Bjerregaard Vahlstrup, in collaboration with Rebekah Baglini, at the School of Communcation and Culture at Aarhus university.
+Over what timeframe was the data collected?
+The dataset includes tweets from the period 2019-01-01 to 2021-04-30.
+Were any ethical review processes conducted?
+No
+Was any preprocessing/Cleaning/Labeling of the data done +(e.g., discretization or bucketing, tokenization, part-of-speech tagging, +SIFT feature extraction, removal of instances, processing of missing values)?
+Firstly, HopeTwitter had non-Danish tweets removed, after which a series of +heuristic filters were applied, including the removal of repetitious texts. Following the filtering, +HopeTwitter was deduplicated, removing both exact duplicates and near-duplicates.
+Of all documents, 3,023,427 (9%) were filtered due to low-quality and +14,399,284 (33%) because they were near-duplicates.
+For the quality filtering, HopeTwitter applies a filter akin to [2] which contains text +that:
+Have less than 60% of words containing an alphabetic character.
+Have low high degree of repetitious text:
+The deduplication removed all documents with a 10-gram Jaccard similarity higher than 80% +following the MinHash algorithm [1] using 128 permutations. The MinHash algorithm is a +probabilistic data structure for approximating the Jaccard similarity between two sets.
+Is the software used to preprocess/clean/label the instances available?
+Yes, the scripts are available +here. +The scripts use version 0.0.2 of the +dfm package.
+Has the dataset been used for any tasks already?
+Yes, the dataset has been used to pre-train Danish language models. +Parts of the dataset have also been used in HOPE project reports +and in [4].
+Is there a repository that links to any or all papers or systems that use the dataset?
+There is a website for the HOPE project for which the dataset was initially collected. This website contains report and articles regarding the dataset.
+What (other) tasks could the dataset be used for?
+The scale of the dataset makes it suitable for NLP tasks such as language modelling. +Similarly, one could imagine using the conversation structure could be used to train +conversational chatbots.
+Is there anything about the composition of the dataset or the way it was collected and +preprocessed/cleaned/labeled that might impact future uses?
+This dataset is static and thus does not evolve over time with the language. +A consequence of this is that it will become increasingly outdated over time. However, +it possible to extend the dataset by a continual collection of tweets.
+Are there tasks for which the dataset should not be used?
+HopeTwitter contains Danish tweets and thus should not be used for non-Danish language tasks.
+As the writers of the content is predominantly journalists, politicians, influencers, +and academics, it reflects a certain social group which is unlikely to reflect Danish +population as a whole.
+Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
+Data will only be available at the entity during the project. After the project the data will be archived for a period of five years to comply with the university [policy] for research integrity. After the five years, the data will be registered at the national archives as required by executive order 514 for potential long-term deposit.
+If you wish to cite this work please see our GitHub page for an up to date citation: +https://github.com/centre-for-humanities-computing/danish-foundation-models
+Version: 1.0.0
+Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
+license: Not publicly available.
+Netarkivet Text (NAT) consists of a subsection of Netarkivet and +contains 2,332 million sites across 1.6 million domains. +Netarkivet includes sites from the period 2006 to 2016.
+NAT has been filtered using a series of heuristic filters and removing repetitious texts. +Following the filtering, NAT is further deduplicated to remove exact and near-duplicates. For more on data cleaning, +see the post processing section below.
+The sites which passed the quality filter were deduplicated per year. NAT consist of 865 billion tokens of which 134 (15%) billion were left after filtering and deduplication.
+Following the recommendation and framework of [3], we add the following datasheet.
+For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?
+Netarkivet was created following the Danish Legal Deposit Act, +from which a text-only corpus was derived for research purposes, see [4,5]. This is the +part from which this dataset is derived. +This part has then been filtered with the intention of training Danish language
+models by a team of researchers at the Center for Humanities Computing Aarhus (CHCAA) using +a codebase jointly developed with partners from industry (e.g. KMD, Ekstra Bladet) and +other research institutions (e.g. Bristol University, Alexandra Institute). +For more on collaborators on this project see the GitHub repository.
+Any other comments?
+No.
+What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
+Instances of the dataset are Danish domain sites, which further include metadata such as:
++ | Column | +Dtype | +
---|---|---|
0 | +harvest_id | +int32 | +
1 | +job_id | +int32 | +
2 | +sha1 | +object | +
3 | +mime_served | +object | +
4 | +language | +object | +
5 | +mime_droid | +object | +
6 | +timestamp | +object | +
7 | +uri | +object | +
9 | +domain_key | +object | +
Where harvest_id
is the id of the associated Netarkivet web harvest. Each web harvest
+consists of jobs, each with their associated job-id
.
Language is the language classified using the following language detection library. uri
is the URI of the site e.g. "http://www.apple.com/podcasting"
.
+timestamp
is the date given in the format "20060612105533"
, indicating year, month, date, and time.
+The sha1
is the website hash.
+mime_*
indicates the mime/media type.
+mime_served
could for instance be "text/html; charset=iso-8859-1"
and mime_droid
could be "text/html; version=2.0"
and is the mime type identified by the server and by DROID, respectively.
+How many instances are there in total (of each type, if appropriate)?
NAT contains a total of 2,332 million sites distributed over 1.6 million domains.
+1,370 million of these sites are Danish, with the largest secondary language being English
+with 718 million sites.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
+These domains are a subset of Netarkivet, which again is a sample of all the Danish content on the internet.
+If the dataset is a sample from a larger set, what was the sampling strategy?
+Netarkivet has been scraped from the internet using the following procedures:
+A selective subset of Netarkivet is then extracted per year from 2006 to 2016 such that
+it contains no duplicate sites. Apache Tika (v. 1.15) is then used to extract the text from the sites.
+During extraction, all HTML markup is removed, along with javascript and CSS code.
+The news media of textual HTML elements, such as <P>
and <H1>
are concatenated into one piece of text.
Who was involved in the data collection process?
+The Royal Danish Library collects Netarkivet along with Brügger et al. [4,5] +helped with the construction of NAT.
+Over what timeframe was the data collected?
+The dataset includes articles from the period 2006 to 2016.
+Were any ethical review processes conducted?
+Netarkivet in collected in adherence to an update to the Danish archival law in 2005, +which extended the law to also include internet domains.
+Our text subset was constructed for a research project and thus a project proposal +has been accepted by the Royal Danish Library. Besides these, the author is not aware of +any ethical approvals.
+Was any preprocessing/Cleaning/Labeling of the data done +(e.g., discretization or bucketing, tokenization, part-of-speech tagging, +SIFT feature extraction, removal of instances, processing of missing values)?
+NAT has been filtered using a series of heuristic filters as well as removing +repetitious texts. Following the filtering, the corpus was deduplicated to remove exact and +near-duplicates.
+For quality filtering, NAT applies a filter akin to [2] which contains text +that:
+have less than 30% of lines ending with an ellipsis.
+Have a low high degree of repetitious text:
+The deduplication removed all documents with a 13-gram Jaccard similarity higher than 80% +following the MinHash algorithm [1] using 128 permutations. The MinHash algorithm is a +probabilistic data structure for approximating the Jaccard similarity between two sets.
+Is the software used to preprocess/clean/label the instances available?
+Yes, the scripts are available +here. +the scripts use version 0.0.2 of the +dfm package.
+Has the dataset been used for any tasks already?
+Yes, the dataset has been used to pre-train Danish language models. +Furthermore, the unfiltered dataset has also been used in [4] and [5], for examining +the development of the Danish web.
+Is there a repository that links to any or all papers or systems that use the dataset?
+No.
+What (other) tasks could the dataset be used for?
+The scale of the dataset makes it suitable for NLP tasks such as language modelling. +It is likely possible to extract reviews, social media posts and similar semi-labelled +datasets from the dataset which can be used for NLP task such as sentiment analysis or +hate-speech detection.
+The content of dataset makes it useable in a wide range of other applications in media +studies, social science or humanities, including development of written Danish, +emerging conspiracy theories, and online information dynamics.
+Is there anything about the composition of the dataset or the way it was collected and +preprocessed/cleaned/labeled that might impact future uses?
+This dataset is static and thus does not evolve over time with the language, thus will +become increasingly outdated over time. Netarkivet, from which it is derived, is +not static however, and is thus likely to further develop, which will allow us to update the +dataset going forward.
+Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
+Data will only be available at the entity during the project. An equivalent or updated dataset can be requested at the Royal Danish Library.
+If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models
+The DCC is a composite corpus consisting of the following subcorpora.
+ + + + + + +This website is under construction 🛠️
+ + + + + + +Each user tagged 100 documents unless otherwise specified. Documents were split by newlines into text-blocks, block was rated. +Text-blocks longer than 1000 characters were split into multiple blocks of 1000 characters or less.
+This tagging scheme is similar to +(Kreutzer et al., 2022).
+Each block was put into one of the following categories: +Each user tagged 100 documents (unless otherwise specified). Each document were tagged
+wrong_language
: Not Danishskipped
: Unsure of categorycorrect_language
: Danish text where at least 80% of the text is reasonable.not_language
: Text where less than 80% of the text is reasonable. Takes priority over wrong_language
.Additionally, each block was tagged for pornography (yes/no) and offensiveness (yes/no).
+Kenneth (Session: test)
+Proportions:
+correct_language
not_language
skipped
wrong_language
Kenneth (Session: 1)
+Proportions:
+correct_language
not_language
skipped
wrong_language
Lasse (Session: 1)
+Proportions:
+correct_language
not_language
wrong_language
Kenneth (Session: test) vs Kenneth - (Session: 1)
+Cohen's Kappa (all categories): 0.8242 (Overlap in sentences: 98)
+Cohen's Kappa (correct_language vs not correct_language): 0.9075 (Overlap in sentences: 98)
+Kenneth (Session: test) vs Lasse - (Session: 1)
+Cohen's Kappa (all categories): 0.8140 (Overlap in sentences: 95)
+Cohen's Kappa (correct_language vs not correct_language): 0.8389 (Overlap in sentences: 95)
+Kenneth (Session: 1) vs Lasse - (Session: 1)
+Cohen's Kappa (all categories): 0.6767 (Overlap in sentences: 245)
+Cohen's Kappa (correct_language vs not correct_language): 0.7259 (Overlap in sentences: 245)
+Comparison with mC4
+Note: mC4 did have a high degree of repititious texts. Similarly it did when texts blocks where not language they were often something like:
+2lineStart%22%3A%22%22%2C%22placeholder%22%3A1%2C%22extName%22%3A%22nowiki%22%7D"" class=""placeholder placeholder-ext"" contenteditable=""false"">]​</span></a></sup>​</span>, at en lurifaks som Jimmy page, bruger MIT navn til opfindelsen! SV<span data-rte-instance=""1524-12953202845f3523698f3f1"" data-rte-meta=""%7B%22type%22%3A%22ext%22%2C%22wikitext%22%3A%22%3Cref%3ESVIN%3C%5C%2Fref%3E%22%2C%22lineStart%22%3A%22%22%2C%22placeholder%22%3A1%2C%22extName%22%3A%22ref%22%7D"" class=""placeholder placeholder-ext"" contenteditable=""false""><sup data-rte-washtml=""1"" id=""cite_ref-2"" class=""reference"" data-rte-attribs=""
+
While non-language texts in NAT was often menu bars, contact information, or navigation.
+Kenneth (Session: 1)
+Proportions:
+correct_language
not_language
skipped
wrong_language
This section contain references to models trained on speech
+Model | +Model type | +
---|---|
xls-r-300m-danish | +Pretrained wav2vec2.0 model | +
xls-r-300m-danish-nst-cv9 | +Automatic speech recognition | +
chcaa/xls-r-300m-nst-cv9-da | +Automatic speech recognition | +
This section contain references to models trained on text
+Model | +Model type | +Size (parameters) | +
---|---|---|
dfm-encoder-large-v1 | +Encoder | +large (355M) | +
dfm-encoder-medium-v1 | +Encoder | +medium (110M) | +
dfm-encoder-small-v1 | +Encoder | +small (22M) | +
This website is under construction \ud83d\udee0\ufe0f
"},{"location":"dcc/","title":"DCC v1","text":"The DCC is a composite corpus consisting of the following subcorpora.
"},{"location":"intercoder_reliability/","title":"Results from corpus tagging","text":"Each user tagged 100 documents unless otherwise specified. Documents were split by newlines into text-blocks, block was rated. Text-blocks longer than 1000 characters were split into multiple blocks of 1000 characters or less.
This tagging scheme is similar to (Kreutzer et al., 2022).
Each block was put into one of the following categories: Each user tagged 100 documents (unless otherwise specified). Each document were tagged
wrong_language
: Not Danishskipped
: Unsure of categorycorrect_language
: Danish text where at least 80% of the text is reasonable.not_language
: Text where less than 80% of the text is reasonable. Takes priority over wrong_language
.Additionally, each block was tagged for pornography (yes/no) and offensiveness (yes/no).
"},{"location":"intercoder_reliability/#text-proportions","title":"Text proportions","text":"Kenneth (Session: test)
Proportions:
correct_language
not_language
skipped
wrong_language
Kenneth (Session: 1)
Proportions:
correct_language
not_language
skipped
wrong_language
Lasse (Session: 1)
Proportions:
correct_language
not_language
wrong_language
Kenneth (Session: test) vs Kenneth - (Session: 1)
Cohen's Kappa (all categories): 0.8242 (Overlap in sentences: 98)
Cohen's Kappa (correct_language vs not correct_language): 0.9075 (Overlap in sentences: 98)
Kenneth (Session: test) vs Lasse - (Session: 1)
Cohen's Kappa (all categories): 0.8140 (Overlap in sentences: 95)
Cohen's Kappa (correct_language vs not correct_language): 0.8389 (Overlap in sentences: 95)
Kenneth (Session: 1) vs Lasse - (Session: 1)
Cohen's Kappa (all categories): 0.6767 (Overlap in sentences: 245)
Cohen's Kappa (correct_language vs not correct_language): 0.7259 (Overlap in sentences: 245)
Comparison with mC4
Note: mC4 did have a high degree of repititious texts. Similarly it did when texts blocks where not language they were often something like:
2lineStart%22%3A%22%22%2C%22placeholder%22%3A1%2C%22extName%22%3A%22nowiki%22%7D\"\" class=\"\"placeholder placeholder-ext\"\" contenteditable=\"\"false\"\">]​</span></a></sup>​</span>, at en lurifaks som Jimmy page, bruger MIT navn til opfindelsen! SV<span data-rte-instance=\"\"1524-12953202845f3523698f3f1\"\" data-rte-meta=\"\"%7B%22type%22%3A%22ext%22%2C%22wikitext%22%3A%22%3Cref%3ESVIN%3C%5C%2Fref%3E%22%2C%22lineStart%22%3A%22%22%2C%22placeholder%22%3A1%2C%22extName%22%3A%22ref%22%7D\"\" class=\"\"placeholder placeholder-ext\"\" contenteditable=\"\"false\"\"><sup data-rte-washtml=\"\"1\"\" id=\"\"cite_ref-2\"\" class=\"\"reference\"\" data-rte-attribs=\"\"\n
While non-language texts in NAT was often menu bars, contact information, or navigation.
Kenneth (Session: 1)
Proportions:
correct_language
not_language
skipped
wrong_language
This section contain references to models trained on speech
Model Model type xls-r-300m-danish Pretrained wav2vec2.0 model xls-r-300m-danish-nst-cv9 Automatic speech recognition chcaa/xls-r-300m-nst-cv9-da Automatic speech recognition"},{"location":"models_text/","title":"Text","text":"This section contain references to models trained on text
Model Model type Size (parameters) dfm-encoder-large-v1 Encoder large (355M) dfm-encoder-medium-v1 Encoder medium (110M) dfm-encoder-small-v1 Encoder small (22M)"},{"location":"datasheets/danews/","title":"DaNews","text":"Version: 1.0.0
Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
license: Not publicly available.
DaNews consists of articles from Danish news and tabloid media from 1 December 2019 to 30 April 2021. The articles stem from multiple news sources, including both online of physical newspapers.
"},{"location":"datasheets/danews/#datasheet","title":"Datasheet","text":"Following the recommendation and framework of [5] we add the following datasheet.
"},{"location":"datasheets/danews/#motivation","title":"Motivation","text":"For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?
DANews was collected as a part of the HOPE project, examining news coverage during the COVID-19 pandemic. The purpose was to train a model to understand how the novelty and resonance imprint of COVID-19 as a case of crisis compared to non-crises news imprints.
Any other comments?
No.
"},{"location":"datasheets/danews/#composition","title":"Composition","text":"What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
Instances of the dataset are Danish articles derived from Danish tabloids or news media.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
Prior to filtering DaNews dataset contains all digitized news articles from the given period across the sources.
What data does each instance consist of? \u201cRaw\u201d data (e.g., unprocessed text or images) or features? In either case, please provide a description.
Each instance consists of the following columns
'ArticleUrl', 'Heading', 'SubHeading', 'Lead', 'Paragraph', 'PublishDate', 'BodyText', \n'Captions', 'Authors', 'Source', 'WordCount', 'ArticleId', 'PageIds', 'Section', 'text'\n
Where we constructed the columns text
column by joining the Heading
, SubHeading
using newline. If the text field is empty it is ignored and no newline is added. The we join the resulting string with the BodyText
using two newlines.
During the quality filtering, we add the following indicator columns:
'passed_quality_filter', 'filtered_by_max_chr_length', 'filtered_by_doc_length', \n'filtered_by_mean_word_length', 'filtered_by_alpha_ratio', 'filtered_by_stop_word', \n'filtered_by_symbol_2_word_hashtag', 'filtered_by_symbol_2_word_ellipsis',\n'filtered_by_line_bullets_or_ellipsis', 'filtered_by_duplicate_lines_chr_fraction',\n'filtered_by_duplicate_paragraph_chr_fraction', 'filtered_by_top_ngram_chr_fraction',\n'filtered_by_duplicate_ngram_chr_fraction', 'is_duplicate'\n
Is there a label or target associated with each instance? If so, please provide a description.
No.
Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information but might include, e.g., redacted text.
The team of researchers at the Humanities Computing Aarhus (CHCAA) have not removed any information from the instances.
Are relationships between individual instances made explicit (e.g., users\u2019 movie ratings, and social network links)? If so, please describe how these relationships are made explicit.
The metadata columns denote the relationship between articles including the date of publication, sections, and authors.
Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.
There are not splits performed on this dataset.
Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.
News sources can publish their content both in an online and printed format which would lead to similar instances in the dataset. To alleviate this redundancy by removing near-duplicates (see Preprocessing/cleaning/labeling).
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?
Articles are intended to tell a self-contained story but can include external references such as tweets or website URLs.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
Articles often describe content that is considered offensive, insulting, or threatening.
"},{"location":"datasheets/danews/#collection-process","title":"Collection Process","text":"If the dataset is a sample from a larger set, what was the sampling strategy?
The dataset is not a sample, but is a filtered version of the full dataset, see Preprocessing/cleaning/labeling for more on this.
Over what timeframe was the data collected?
The dataset includes articles from 1 December 2019 to 30 April 2021.
Were any ethical review processes conducted?
No.
"},{"location":"datasheets/danews/#preprocessingcleaninglabeling","title":"Preprocessing/cleaning/labeling","text":"Was any preprocessing/Cleaning/Labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
DaNews has been filtered using a series of heuristic filters as well as removing repetitious texts. Following the filtering, DaNews is deduplicated to remove exact and near-duplicates.
Of all documents, 9% were filtered based due to low-quality and 4% because they were near-duplicates.
For quality filtering, DaNews applies a filter akin to [2] which contains text that:
have less than 30% of lines ending with an ellipsis.
Have a low high degree of repetitious text:
The deduplication removed all documents with a 13-gram Jaccard similarity higher than 80% following the MinHash algorithm [1] using 128 permutations. The MinHash algorithm is a probabilistic data structure for approximating the Jaccard similarity between two sets.
Is the software used to preprocess/clean/label the instances available?
Yes, the scripts are available here. the scripts use version 0.0.2 of the dfm package.
"},{"location":"datasheets/danews/#uses","title":"Uses","text":"Has the dataset been used for any tasks already?
Yes, the dataset has been used to pre-train Danish language models. Parts of the dataset have also been used in [3] and [4]
Is there a repository that links to any or all papers or systems that use the dataset?
No.
What (other) tasks could the dataset be used for?
The scale of the dataset makes it suitable for NLP tasks such as language modeling. Similarly, the structure of the articles makes it a suitable dataset for training text summarisation models.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This dataset is static and thus does not evolve over time with the language. A consequence of this is that it will become increasingly outdated over time.
Are there tasks for which the dataset should not be used?
This dataset contains Danish articles and thus should not be used for non-Danish language tasks.
As the writers of the content are predominantly journalists, it reflects a certain writing style which is unlikely to reflect the Danish language as a whole.
"},{"location":"datasheets/danews/#distribution","title":"Distribution","text":"Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
Data will only be available at the entity during the project. If you wish access to the dataset you will have to come to an agreement with the individuals Danish newspapers.
"},{"location":"datasheets/danews/#citation","title":"Citation","text":"If you wish to cite this work please see our GitHub page for an up-to-date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models
"},{"location":"datasheets/danews/#references","title":"References:","text":"Version: 1.0.0
Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
License: Not publicly available.
DaRadio consists of radio broadcasts from the Danish radio stations DR P1 and Radio24Syv, and contains approximately 140.000 hours of speech. DaRadio includes all shows aired on DR P1 from 2005 to 2021, and all shows aired on Radio24Syv from 2011 to 2019.
DaRadio has been deduplicated using a series of heuristics based on metadata. For more on deduplication, see the data cleaning section further below.
"},{"location":"datasheets/daradio/#datasheet","title":"Datasheet","text":"Following the recommendation and framework of [1], we add the following datasheet.
"},{"location":"datasheets/daradio/#motivation","title":"Motivation:","text":"For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?
Data included in DaRadio was collected following the Danish Legal Deposit Act by the Royal Danish Library (RDL). From this, a dataset of Danish speech-only radio was derived by RDL. The dataset was created for research purposes, including training a Danish wav2vec2.0 model.
The dataset was preprocessed to remove duplicates by a team of researchers at the Center for Humanities Computing, Aarhus University (CHC) with collaborators from the Danish speech-processing company Alvenir.
"},{"location":"datasheets/daradio/#composition","title":"Composition","text":"What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
Instances of the dataset include an mp3 file for each show aired on the two staions within the period. Further metadata include information on date and time of airing, title, short description of the show, and various internal identifiers used by RDL.
How many instances are there in total (of each type, if appropriate)?
DaRadio consists of a total of 215.582 hours of unprocessed Danish speech radio shows across two stations, DR P1 and Radio24syv. The table below shows the distribution over the stations with and without heuristic rerun removal.
Source Duration (hours) Reruns removed P1 145.160 False 97.401 True Radio24syv 70.422 False 44.569 TrueDoes the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
The dataset contains all shows from the two stations in the time period (2005-2021 for DR P1 and 2011-2019 for Radio24syv).
If the dataset is a sample from a larger set, what was the sampling strategy?
The dataset is a subset of all Danish radio. The two stations were chosen for the dataset as they are talk-radio only.
Who was involved in the data collection process?
The RDL collects Danish radio shows and constructed DaRadio for handing to researchers at CHC.
Over what timeframe was the data collected?
The dataset includes radio shows from the period 2005 to 2021.
Were any ethical review processes conducted?
The RDL collects radio shows in adherence to Danish Archival laws. DaRadio was constructed for a research project, for which a project proposal was accepted by RDL. No other ethical review processes were conducted.
"},{"location":"datasheets/daradio/#preprocessingcleaninglabeling","title":"Preprocessing/cleaning/labeling","text":"Was any preprocessing/Cleaning/Labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
DaRadio has been deduplicated using a series of heuristic filters and all files have been converted to 16 Khz .wav files.
Reruns/duplicates were identified by the following rules:
The deduplication was coded and conducted by researchers at CHC.
Is the software used to preprocess/clean/label the instances available?
The scripts are available at the following GitHub repository: link.
"},{"location":"datasheets/daradio/#uses","title":"Uses","text":"Has the dataset been used for any tasks already?
Yes, the dataset has been used to pre-train a Danish wav2vec2.0 model.
Is there a repository that links to any or all papers or systems that use the dataset?
No, but as of 23/10/16 no others have used the dataset.
What (other) tasks could the dataset be used for?
As the dataset only contains un-labelled data, i.e. no transcriptions, it is mainly designed for pre-training language models. However, given the metadata and re-occuring hosts, further processing might make it possible to train e.g. text-to-speech systems.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This dataset is static and does not evolve over time with the language, thus will become increasingly outdated over time.
"},{"location":"datasheets/daradio/#distribution","title":"Distribution","text":"Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
Data will only be available at the entity during the project. An equivalent or updated dataset can be requested at the Royal Danish Library.
"},{"location":"datasheets/daradio/#citation","title":"Citation","text":"If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models
"},{"location":"datasheets/daradio/#references","title":"References:","text":"Version: 1.0.0
Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
license: Not publicly available.
HopeTwitter consists of tweets collected from the Twitter API using a stopword list and consists of 32.5 million tweets across 538,398 unique users. HopeTwitter includes tweets from 2019-01-01 to 2021-04-30.
HopeTwitter, have been filtered to only include Danish tweets, based on language tag from Twitter API. Similarly, HopeTwitter have had low-quality tweets have removed and then deduplicated to remove exact and near-duplicates. For more on data cleaning see section; \"Preprocessing/cleaning/labeling\".
HopeTwitter includes a total of 0.97 billion tokens before filtering and includes 0.48 billion (50%) after.
"},{"location":"datasheets/hopetwitter/#datasheet","title":"Datasheet","text":"Following the recommendation and framework of [3] we add the following datasheet.
"},{"location":"datasheets/hopetwitter/#motivation","title":"Motivation","text":"**For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset? **
HopeTwitter was initially collected as a part of the HOPE project, examining societal behaviour during the COVID-19 pandemic. Next, HopeTwitter was cleaned in preparation for pre-training Danish language models by a team of researchers at Center for Humanities Computing Aarhus (CHCAA), using a codebase jointly developed with partners from academia and industry, including KMD, Ekstra Bladet, Bristol University and Deepdivr. For more on collaborators on this project see the GitHub repository.
Any other comments?
No.
"},{"location":"datasheets/hopetwitter/#composition","title":"Composition","text":"What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
HopeTwitter consists of tweets containing at least one of a series of stopwords, collected through the Twitter API. See \"If the dataset is a sample from a larger set, what was the sampling strategy?\" for the stopword list.
How many instances are there in total (of each type, if appropriate)?
The dataset consist of 32,499,019 documents where 14,399,284 (44%) were considered duplicates.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
No. It does not contain all instances of Danish Twitter as there are likely some Danish tweets which does not include a stopword.
Is there a label or target associated with each instance? If so, please provide a description.
No.
Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.
No splits are performed on this dataset.
If the dataset is a sample from a larger set, what was the sampling strategy?
Tweets are streamed continuously using queried a set of the highest frequency Scandinavian-specific keywords from Danish, Norwegian (Bokm\u00e5l) and Swedish, resulting in the following list:
aften, aldrig, alltid, altid, andet, arbejde, bedste, beh\u00f6ver, beh\u00f8ver, beklager,\nber\u00e4tta, betyr, blev, blevet, blir, blitt, blive, bliver, bruge, burde, b\u00e4ttre, b\u00e5e\nb\u00f8r, deim, deires, ditt, drar, drepe, dykk, dykkar, d\u00e4r, d\u00f6d, d\u00f6da, d\u00f8d, d\u00f8de, efter,\nelsker, endnu, faen, fandt, feil, fikk, finner, flere, forst\u00e5r, fortelle, fortfarande,\nfortsatt, fort\u00e6lle, fr\u00e5n, f\u00e5, f\u00e5et, f\u00e5r, f\u00e5tt, f\u00f6rl\u00e5t, f\u00f6rsta, f\u00f6rs\u00f6ker, f\u00f8r, f\u00f8rst,\nf\u00f8rste, gick, gikk, gillar, gjennom, gjerne, gjorde, gjort, gj\u00f8r, gj\u00f8re, godt, g\u00e5, g\u00e5ng,\ng\u00e5r, g\u00f6ra, g\u00f8r, g\u00f8re, hadde, hall\u00e5, havde, hedder, helt, helvete, hende, hendes, hennes,\nherregud, hjelp, hjelpe, hjem, hj\u00e4lp, hj\u00e5, hj\u00e6lp, hj\u00e6lpe, honom, hossen, hvem, hvis,\nhvordan, hvorfor, h\u00e4nder, h\u00e4r, h\u00e5ll, h\u00e5ller, h\u00f8r, h\u00f8re, h\u00f8rer, igjen, ikkje, ingenting,\ninkje, inte, intet, jeres, j\u00e4vla, kanske, kanskje, kender, kjenner, korleis, kvarhelst,\nkveld, kven, kvifor, k\u00e4nner, ledsen, lenger, lidt, livet, l\u00e4ngre, l\u00e5t, l\u00e5ter, l\u00e6nge,\nmeget, menar, mycket, mykje, m\u00e5, m\u00e5de, m\u00e5nga, m\u00e5r, m\u00e5ske, m\u00e5ste, m\u00e5tte, navn, nogen,\nnoget, nogle, noko, nokon, nokor, nokre, n\u00e5gon, n\u00e5got, n\u00e5gra, n\u00e5n, n\u00e5r, n\u00e5t, n\u00f8dt,\nocks\u00e5, ogs\u00e5, pengar, penger, pratar, pr\u00f8ver, p\u00e5, redan, rundt, r\u00e4tt, sagde, saker,\nsamma, sammen, selv, selvf\u00f8lgelig, sidan, sidste, siger, sikker, sikkert, sj\u00e4lv, skete,\nskjedde, skjer, skulle, sluta, slutt, snakke, snakker, snill, sn\u00e4lla, somt, stadig,\nstanna, sted, st\u00e5r, synes, s\u00e4ger, s\u00e4tt, s\u00e5, s\u00e5dan, s\u00e5g, s\u00e5nn, tager, tiden, tilbage,\ntilbake, tillbaka, titta, trenger, trodde, troede, tror, tv\u00e5, tycker, t\u00e4nker, uden,\nundskyld, unnskyld, urs\u00e4kta, uten, varf\u00f6r, varit, varte, veldig, venner, verkligen,\nvidste, vilken, virkelig, visste, v\u00e4g, v\u00e4l, v\u00e4ldigt, v\u00e4n, v\u00e5r, v\u00e5ra, v\u00e5re, v\u00e6k, v\u00e6r, \nv\u00e6re, v\u00e6ret, \u00e4lskar, \u00e5h, \u00e5r, \u00e5t, \u00f6ver\n
Who was involved in the data collection process?
A team of researchers at the Center for Humanities Computing Aarhus (CHCAA), including Kristoffer Nielbo and Peter Bjerregaard Vahlstrup, in collaboration with Rebekah Baglini, at the School of Communcation and Culture at Aarhus university.
Over what timeframe was the data collected?
The dataset includes tweets from the period 2019-01-01 to 2021-04-30.
Were any ethical review processes conducted?
No
"},{"location":"datasheets/hopetwitter/#preprocessingcleaninglabeling","title":"Preprocessing/cleaning/labeling","text":"Was any preprocessing/Cleaning/Labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
Firstly, HopeTwitter had non-Danish tweets removed, after which a series of heuristic filters were applied, including the removal of repetitious texts. Following the filtering, HopeTwitter was deduplicated, removing both exact duplicates and near-duplicates.
Of all documents, 3,023,427 (9%) were filtered due to low-quality and 14,399,284 (33%) because they were near-duplicates.
For the quality filtering, HopeTwitter applies a filter akin to [2] which contains text that:
Have less than 60% of words containing an alphabetic character.
Have low high degree of repetitious text:
The deduplication removed all documents with a 10-gram Jaccard similarity higher than 80% following the MinHash algorithm [1] using 128 permutations. The MinHash algorithm is a probabilistic data structure for approximating the Jaccard similarity between two sets.
Is the software used to preprocess/clean/label the instances available?
Yes, the scripts are available here. The scripts use version 0.0.2 of the dfm package.
"},{"location":"datasheets/hopetwitter/#uses","title":"Uses","text":"Has the dataset been used for any tasks already?
Yes, the dataset has been used to pre-train Danish language models. Parts of the dataset have also been used in HOPE project reports and in [4].
Is there a repository that links to any or all papers or systems that use the dataset?
There is a website for the HOPE project for which the dataset was initially collected. This website contains report and articles regarding the dataset.
What (other) tasks could the dataset be used for?
The scale of the dataset makes it suitable for NLP tasks such as language modelling. Similarly, one could imagine using the conversation structure could be used to train conversational chatbots.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This dataset is static and thus does not evolve over time with the language. A consequence of this is that it will become increasingly outdated over time. However, it possible to extend the dataset by a continual collection of tweets.
Are there tasks for which the dataset should not be used?
HopeTwitter contains Danish tweets and thus should not be used for non-Danish language tasks.
As the writers of the content is predominantly journalists, politicians, influencers, and academics, it reflects a certain social group which is unlikely to reflect Danish population as a whole.
"},{"location":"datasheets/hopetwitter/#distribution","title":"Distribution","text":"Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
Data will only be available at the entity during the project. After the project the data will be archived for a period of five years to comply with the university [policy] for research integrity. After the five years, the data will be registered at the national archives as required by executive order 514 for potential long-term deposit.
"},{"location":"datasheets/hopetwitter/#citation","title":"Citation","text":"If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models
"},{"location":"datasheets/hopetwitter/#references","title":"References:","text":"Version: 1.0.0
Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
license: Not publicly available.
Netarkivet Text (NAT) consists of a subsection of Netarkivet and contains 2,332 million sites across 1.6 million domains. Netarkivet includes sites from the period 2006 to 2016.
NAT has been filtered using a series of heuristic filters and removing repetitious texts. Following the filtering, NAT is further deduplicated to remove exact and near-duplicates. For more on data cleaning, see the post processing section below.
The sites which passed the quality filter were deduplicated per year. NAT consist of 865 billion tokens of which 134 (15%) billion were left after filtering and deduplication.
"},{"location":"datasheets/netarkivet_text/#datasheet","title":"Datasheet","text":"Following the recommendation and framework of [3], we add the following datasheet.
"},{"location":"datasheets/netarkivet_text/#motivation","title":"Motivation:","text":"For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?
Netarkivet was created following the Danish Legal Deposit Act, from which a text-only corpus was derived for research purposes, see [4,5]. This is the part from which this dataset is derived. This part has then been filtered with the intention of training Danish language
models by a team of researchers at the Center for Humanities Computing Aarhus (CHCAA) using a codebase jointly developed with partners from industry (e.g. KMD, Ekstra Bladet) and other research institutions (e.g. Bristol University, Alexandra Institute). For more on collaborators on this project see the GitHub repository.
Any other comments?
No.
"},{"location":"datasheets/netarkivet_text/#composition","title":"Composition","text":"What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
Instances of the dataset are Danish domain sites, which further include metadata such as:
Column Dtype 0 harvest_id int32 1 job_id int32 2 sha1 object 3 mime_served object 4 language object 5 mime_droid object 6 timestamp object 7 uri object 9 domain_key objectWhere harvest_id
is the id of the associated Netarkivet web harvest. Each web harvest consists of jobs, each with their associated job-id
.
Language is the language classified using the following language detection library. uri
is the URI of the site e.g. \"http://www.apple.com/podcasting\"
. timestamp
is the date given in the format \"20060612105533\"
, indicating year, month, date, and time. The sha1
is the website hash. mime_*
indicates the mime/media type. mime_served
could for instance be \"text/html; charset=iso-8859-1\"
and mime_droid
could be \"text/html; version=2.0\"
and is the mime type identified by the server and by DROID, respectively. How many instances are there in total (of each type, if appropriate)?
NAT contains a total of 2,332 million sites distributed over 1.6 million domains. 1,370 million of these sites are Danish, with the largest secondary language being English with 718 million sites.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
These domains are a subset of Netarkivet, which again is a sample of all the Danish content on the internet.
If the dataset is a sample from a larger set, what was the sampling strategy?
Netarkivet has been scraped from the internet using the following procedures:
A selective subset of Netarkivet is then extracted per year from 2006 to 2016 such that it contains no duplicate sites. Apache Tika (v. 1.15) is then used to extract the text from the sites. During extraction, all HTML markup is removed, along with javascript and CSS code. The news media of textual HTML elements, such as <P>
and <H1>
are concatenated into one piece of text.
Who was involved in the data collection process?
The Royal Danish Library collects Netarkivet along with Br\u00fcgger et al. [4,5] helped with the construction of NAT.
Over what timeframe was the data collected?
The dataset includes articles from the period 2006 to 2016.
Were any ethical review processes conducted?
Netarkivet in collected in adherence to an update to the Danish archival law in 2005, which extended the law to also include internet domains.
Our text subset was constructed for a research project and thus a project proposal has been accepted by the Royal Danish Library. Besides these, the author is not aware of any ethical approvals.
"},{"location":"datasheets/netarkivet_text/#preprocessingcleaninglabeling","title":"Preprocessing/cleaning/labeling","text":"Was any preprocessing/Cleaning/Labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
NAT has been filtered using a series of heuristic filters as well as removing repetitious texts. Following the filtering, the corpus was deduplicated to remove exact and near-duplicates.
For quality filtering, NAT applies a filter akin to [2] which contains text that:
have less than 30% of lines ending with an ellipsis.
Have a low high degree of repetitious text:
The deduplication removed all documents with a 13-gram Jaccard similarity higher than 80% following the MinHash algorithm [1] using 128 permutations. The MinHash algorithm is a probabilistic data structure for approximating the Jaccard similarity between two sets.
Is the software used to preprocess/clean/label the instances available?
Yes, the scripts are available here. the scripts use version 0.0.2 of the dfm package.
"},{"location":"datasheets/netarkivet_text/#uses","title":"Uses","text":"Has the dataset been used for any tasks already?
Yes, the dataset has been used to pre-train Danish language models. Furthermore, the unfiltered dataset has also been used in [4] and [5], for examining the development of the Danish web.
Is there a repository that links to any or all papers or systems that use the dataset?
No.
What (other) tasks could the dataset be used for?
The scale of the dataset makes it suitable for NLP tasks such as language modelling. It is likely possible to extract reviews, social media posts and similar semi-labelled datasets from the dataset which can be used for NLP task such as sentiment analysis or hate-speech detection.
The content of dataset makes it useable in a wide range of other applications in media studies, social science or humanities, including development of written Danish, emerging conspiracy theories, and online information dynamics.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This dataset is static and thus does not evolve over time with the language, thus will become increasingly outdated over time. Netarkivet, from which it is derived, is not static however, and is thus likely to further develop, which will allow us to update the dataset going forward.
"},{"location":"datasheets/netarkivet_text/#distribution","title":"Distribution","text":"Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
Data will only be available at the entity during the project. An equivalent or updated dataset can be requested at the Royal Danish Library.
"},{"location":"datasheets/netarkivet_text/#citation","title":"Citation","text":"If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models
"},{"location":"datasheets/netarkivet_text/#references","title":"References:","text":"