diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/404.html b/404.html new file mode 100644 index 00000000..b0a4b4f3 --- /dev/null +++ b/404.html @@ -0,0 +1,518 @@ + + + +
+ + + + + + + + + + + + + +Version: 1.0.0
+Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
+license: Not publicly available.
+DaNews consists of articles from Danish news and tabloid media from 1 December 2019 to +30 April 2021. The articles stem from multiple news sources, including both online of physical newspapers.
+DaNews consists of 403 million tokens 93% were left after + quality filtering and deduplication.
+Following the recommendation and framework of [5] we add the following datasheet.
+For what purpose was the dataset created? Who created the dataset? Who funded the +creation of the dataset?
+DANews was collected as a part of the HOPE project, examining news coverage during the COVID-19 pandemic. The purpose was to train a model to understand how the novelty and resonance imprint of COVID-19 as a case of crisis compared to non-crises news imprints.
+Any other comments?
+No.
+How many instances are there in total (of each type, if appropriate)?
+The unfiltered dataset consists of 713 429 documents including a total of 403 089 625 tokens.
+What do the instances that comprise the dataset represent (e.g., documents, photos, +people, countries)?
+Instances of the dataset are Danish articles derived from Danish tabloids or news media.
+Does the dataset contain all possible instances or is it a sample (not necessarily +random) of instances from a larger set?
+Prior to filtering DaNews dataset contains all digitized news articles from the given +period across the sources.
+What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) +or features? In either case, please provide a description.
+Each instance consists of the following columns +
'ArticleUrl', 'Heading', 'SubHeading', 'Lead', 'Paragraph', 'PublishDate', 'BodyText',
+'Captions', 'Authors', 'Source', 'WordCount', 'ArticleId', 'PageIds', 'Section', 'text'
+
Where we constructed the columns text
column by joining the Heading
, SubHeading
+using newline. If the text field is empty it is ignored and no newline is added. The we
+join the resulting string with the BodyText
using two newlines.
During the quality filtering, we add the following indicator columns: +
'passed_quality_filter', 'filtered_by_max_chr_length', 'filtered_by_doc_length',
+'filtered_by_mean_word_length', 'filtered_by_alpha_ratio', 'filtered_by_stop_word',
+'filtered_by_symbol_2_word_hashtag', 'filtered_by_symbol_2_word_ellipsis',
+'filtered_by_line_bullets_or_ellipsis', 'filtered_by_duplicate_lines_chr_fraction',
+'filtered_by_duplicate_paragraph_chr_fraction', 'filtered_by_top_ngram_chr_fraction',
+'filtered_by_duplicate_ngram_chr_fraction', 'is_duplicate'
+
Is there a label or target associated with each instance? If so, please provide a +description.
+No.
+Is any information missing from individual instances? If so, please provide a +description, explaining why this information is missing (e.g., because it was +unavailable). This does not include intentionally removed information but might +include, e.g., redacted text.
+The team of researchers at the Humanities Computing Aarhus (CHCAA) have not +removed any information from the instances.
+Are relationships between individual instances made explicit (e.g., users’ movie +ratings, and social network links)? If so, please describe how these relationships are made +explicit.
+The metadata columns denote the relationship between articles including the date of +publication, sections, and authors.
+Are there recommended data splits (e.g., training, development/validation, testing)? +If so, please provide a description of these splits, explaining the rationale behind +them.
+There are not splits performed on this dataset.
+Are there any errors, sources of noise, or redundancies in the dataset? If so, please +provide a description.
+News sources can publish their content both in an online and printed format which would +lead to similar instances in the dataset. To alleviate this redundancy by removing +near-duplicates (see Preprocessing/cleaning/labeling).
+Is the dataset self-contained, or does it link to or otherwise rely on external +resources (e.g., websites, tweets, other datasets)?
+Articles are intended to tell a self-contained story but can include external +references such as tweets or website URLs.
+Does the dataset contain data that, if viewed directly, might be offensive, insulting, +threatening, or might otherwise cause anxiety?
+Articles often describe content that is considered offensive, insulting, or threatening.
+What mechanisms or procedures were used to collect the data (e.g., hardware + apparatuses or sensors, manual human curation, software programs, software APIs)?
+A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this + dataset using a third-party API as well as a manual transfer from one of the parties. The API was limited + to only a subset of articles agreed upon within the agreements.
+If the dataset is a sample from a larger set, what was the sampling strategy?
+The dataset is not a sample, but is a filtered version of the full dataset, see +Preprocessing/cleaning/labeling for more on this.
+Who was involved in the data collection process? +A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this +dataset using a third party API as well as a manual transfer from some of the parties and would like to thank the dataset owners for + access to their articles.
+Over what timeframe was the data collected?
+The dataset includes articles from 1 December 2019 to +30 April 2021.
+Were any ethical review processes conducted?
+No.
+Was any preprocessing/Cleaning/Labeling of the data done +(e.g., discretization or bucketing, tokenization, part-of-speech tagging, +SIFT feature extraction, removal of instances, processing of missing values)?
+DaNews has been filtered using a series of heuristic filters as well as removing +repetitious texts. Following the filtering, DaNews is deduplicated to remove exact and +near-duplicates.
+Of all documents, 9% were filtered based due to low-quality and 4% because they were near-duplicates.
+For quality filtering, DaNews applies a filter akin to [2] which contains text +that:
+have less than 30% of lines ending with an ellipsis.
+Have a low high degree of repetitious text:
+The deduplication removed all documents with a 13-gram Jaccard similarity higher than 80% +following the MinHash algorithm [1] using 128 permutations. The MinHash algorithm is a +probabilistic data structure for approximating the Jaccard similarity between two sets.
+Is the software used to preprocess/clean/label the instances available?
+Yes, the scripts are available +here. +the scripts use version 0.0.2 of the +dfm package.
+Has the dataset been used for any tasks already?
+Yes, the dataset has been used to pre-train Danish language models. +Parts of the dataset have also been used in [3] and [4]
+Is there a repository that links to any or all papers or systems that use the dataset?
+No.
+What (other) tasks could the dataset be used for?
+The scale of the dataset makes it suitable for NLP tasks such as language modeling. +Similarly, the structure of the articles makes it a suitable dataset for training text +summarisation models.
+Is there anything about the composition of the dataset or the way it was collected and +preprocessed/cleaned/labeled that might impact future uses?
+This dataset is static and thus does not evolve over time with the language. +A consequence of this is that it will become increasingly outdated over time.
+Are there tasks for which the dataset should not be used?
+This dataset contains Danish articles and thus should not be used for non-Danish +language tasks.
+As the writers of the content are predominantly journalists, it reflects a certain +writing style which is unlikely to reflect the Danish language as a whole.
+Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
+Data will only be available at the entity during the project. If you wish access to the dataset you will have to come to an agreement with the individuals +Danish newspapers.
+If you wish to cite this work please see our GitHub page for an up-to-date citation: +https://github.com/centre-for-humanities-computing/danish-foundation-models
+Version: 1.0.0
+Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
+License: Not publicly available.
+DaRadio consists of radio broadcasts from the Danish radio stations DR P1 and Radio24Syv, and contains approximately 140.000 hours of speech. DaRadio includes all shows aired on DR P1 from 2005 to 2021, and all shows aired on Radio24Syv from 2011 to 2019.
+DaRadio has been deduplicated using a series of heuristics based on metadata. For more on deduplication, see the data cleaning section further below.
+Following the recommendation and framework of [1], we add the following datasheet.
+For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?
+Data included in DaRadio was collected following the Danish Legal Deposit Act by the Royal Danish Library (RDL). From this, a dataset of Danish speech-only radio was derived by RDL. The dataset was created for research purposes, including training a Danish wav2vec2.0 model.
+The dataset was preprocessed to remove duplicates by a team of researchers at the Center for Humanities Computing, Aarhus University (CHC) with collaborators from the Danish speech-processing company Alvenir.
+What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
+Instances of the dataset include an mp3 file for each show aired on the two staions within the period. Further metadata include information on date and time of airing, title, short description of the show, and various internal identifiers used by RDL.
+How many instances are there in total (of each type, if appropriate)?
+DaRadio consists of a total of 215.582 hours of unprocessed Danish speech radio shows across two stations, DR P1 and Radio24syv. The table below shows the distribution over the stations with and without heuristic rerun removal.
+Source | +Duration (hours) | +Reruns removed | +
---|---|---|
P1 | +145.160 | +False | +
+ | 97.401 | +True | +
Radio24syv | +70.422 | +False | +
+ | 44.569 | +True | +
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
+The dataset contains all shows from the two stations in the time period (2005-2021 for DR P1 and 2011-2019 for Radio24syv).
+If the dataset is a sample from a larger set, what was the sampling strategy?
+The dataset is a subset of all Danish radio. The two stations were chosen for the dataset as they are talk-radio only.
+Who was involved in the data collection process?
+The RDL collects Danish radio shows and constructed DaRadio for handing to researchers at CHC.
+Over what timeframe was the data collected?
+The dataset includes radio shows from the period 2005 to 2021.
+Were any ethical review processes conducted?
+The RDL collects radio shows in adherence to Danish Archival laws. DaRadio was constructed for a research project, for which a project proposal was accepted by RDL. No other ethical review processes were conducted.
+Was any preprocessing/Cleaning/Labeling of the data done +(e.g., discretization or bucketing, tokenization, part-of-speech tagging, +SIFT feature extraction, removal of instances, processing of missing values)?
+DaRadio has been deduplicated using a series of heuristic filters and all files have been converted to 16 Khz .wav files.
+Reruns/duplicates were identified by the following rules:
+The deduplication was coded and conducted by researchers at CHC.
+Is the software used to preprocess/clean/label the instances available?
+The scripts are available at the following GitHub repository: link.
+Has the dataset been used for any tasks already?
+Yes, the dataset has been used to pre-train a Danish wav2vec2.0 model.
+Is there a repository that links to any or all papers or systems that use the dataset?
+No, but as of 23/10/16 no others have used the dataset.
+What (other) tasks could the dataset be used for?
+As the dataset only contains un-labelled data, i.e. no transcriptions, it is mainly designed for pre-training language models. However, given the metadata and re-occuring hosts, further processing might make it possible to train e.g. text-to-speech systems.
+Is there anything about the composition of the dataset or the way it was collected and +preprocessed/cleaned/labeled that might impact future uses?
+This dataset is static and does not evolve over time with the language, thus will become increasingly outdated over time.
+Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
+Data will only be available at the entity during the project. An equivalent or updated dataset can be requested at the Royal Danish Library.
+If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models
+Version: 1.0.0
+Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
+license: Not publicly available.
+HopeTwitter consists of tweets collected from the Twitter API using a stopword list +and consists of 32.5 million tweets across 538,398 unique users. HopeTwitter includes +tweets from 2019-01-01 to 2021-04-30.
+HopeTwitter, have been filtered to only include Danish tweets, based on language tag from Twitter API. Similarly, HopeTwitter +have had low-quality tweets have removed and then deduplicated to remove exact and +near-duplicates. For more on data cleaning see section; +"Preprocessing/cleaning/labeling".
+HopeTwitter includes a total of 0.97 billion tokens before filtering and includes 0.48 +billion (50%) after.
+Following the recommendation and framework of [3] we add the following datasheet.
+**For what purpose was the dataset created? Who created the dataset? Who funded the +creation of the dataset? **
+HopeTwitter was initially collected as a part of the +HOPE project, examining societal behaviour during the +COVID-19 pandemic. Next, HopeTwitter was cleaned in preparation for pre-training Danish language +models by a team of researchers at Center for Humanities Computing Aarhus +(CHCAA), using +a codebase jointly developed with partners from academia and industry, including KMD, +Ekstra Bladet, Bristol University and Deepdivr. For more on collaborators on this +project see the +GitHub repository.
+Any other comments?
+No.
+What do the instances that comprise the dataset represent (e.g., documents, photos, +people, countries)?
+HopeTwitter consists of tweets containing at least one of a series of stopwords, +collected through the Twitter API. See "If the dataset is a sample from a larger set, +what was the sampling strategy?" for the stopword list.
+How many instances are there in total (of each type, if appropriate)?
+The dataset consist of 32,499,019 documents where 14,399,284 (44%) were considered +duplicates.
+Does the dataset contain all possible instances or is it a sample (not necessarily +random) of instances from a larger set?
+No. It does not contain all instances of Danish Twitter as there are likely some Danish +tweets which does not include a stopword.
+Is there a label or target associated with each instance? If so, please provide a +description.
+No.
+Are there recommended data splits (e.g., training, development/validation, testing)? +If so, please provide a description of these splits, explaining the rationale behind +them.
+No splits are performed on this dataset.
+If the dataset is a sample from a larger set, what was the sampling strategy?
+Tweets are streamed continuously using queried a set of the highest +frequency Scandinavian-specific keywords from Danish, Norwegian (Bokmål) and Swedish, +resulting in the following list: +
aften, aldrig, alltid, altid, andet, arbejde, bedste, behöver, behøver, beklager,
+berätta, betyr, blev, blevet, blir, blitt, blive, bliver, bruge, burde, bättre, båe
+bør, deim, deires, ditt, drar, drepe, dykk, dykkar, där, död, döda, død, døde, efter,
+elsker, endnu, faen, fandt, feil, fikk, finner, flere, forstår, fortelle, fortfarande,
+fortsatt, fortælle, från, få, fået, får, fått, förlåt, första, försöker, før, først,
+første, gick, gikk, gillar, gjennom, gjerne, gjorde, gjort, gjør, gjøre, godt, gå, gång,
+går, göra, gør, gøre, hadde, hallå, havde, hedder, helt, helvete, hende, hendes, hennes,
+herregud, hjelp, hjelpe, hjem, hjälp, hjå, hjælp, hjælpe, honom, hossen, hvem, hvis,
+hvordan, hvorfor, händer, här, håll, håller, hør, høre, hører, igjen, ikkje, ingenting,
+inkje, inte, intet, jeres, jävla, kanske, kanskje, kender, kjenner, korleis, kvarhelst,
+kveld, kven, kvifor, känner, ledsen, lenger, lidt, livet, längre, låt, låter, længe,
+meget, menar, mycket, mykje, må, måde, många, mår, måske, måste, måtte, navn, nogen,
+noget, nogle, noko, nokon, nokor, nokre, någon, något, några, nån, når, nåt, nødt,
+också, også, pengar, penger, pratar, prøver, på, redan, rundt, rätt, sagde, saker,
+samma, sammen, selv, selvfølgelig, sidan, sidste, siger, sikker, sikkert, själv, skete,
+skjedde, skjer, skulle, sluta, slutt, snakke, snakker, snill, snälla, somt, stadig,
+stanna, sted, står, synes, säger, sätt, så, sådan, såg, sånn, tager, tiden, tilbage,
+tilbake, tillbaka, titta, trenger, trodde, troede, tror, två, tycker, tänker, uden,
+undskyld, unnskyld, ursäkta, uten, varför, varit, varte, veldig, venner, verkligen,
+vidste, vilken, virkelig, visste, väg, väl, väldigt, vän, vår, våra, våre, væk, vær,
+være, været, älskar, åh, år, åt, över
+
Who was involved in the data collection process?
+A team of researchers at the Center for Humanities +Computing Aarhus (CHCAA), including Kristoffer Nielbo and Peter Bjerregaard Vahlstrup, in collaboration with Rebekah Baglini, at the School of Communcation and Culture at Aarhus university.
+Over what timeframe was the data collected?
+The dataset includes tweets from the period 2019-01-01 to 2021-04-30.
+Were any ethical review processes conducted?
+No
+Was any preprocessing/Cleaning/Labeling of the data done +(e.g., discretization or bucketing, tokenization, part-of-speech tagging, +SIFT feature extraction, removal of instances, processing of missing values)?
+Firstly, HopeTwitter had non-Danish tweets removed, after which a series of +heuristic filters were applied, including the removal of repetitious texts. Following the filtering, +HopeTwitter was deduplicated, removing both exact duplicates and near-duplicates.
+Of all documents, 3,023,427 (9%) were filtered due to low-quality and +14,399,284 (33%) because they were near-duplicates.
+For the quality filtering, HopeTwitter applies a filter akin to [2] which contains text +that:
+Have less than 60% of words containing an alphabetic character.
+Have low high degree of repetitious text:
+The deduplication removed all documents with a 10-gram Jaccard similarity higher than 80% +following the MinHash algorithm [1] using 128 permutations. The MinHash algorithm is a +probabilistic data structure for approximating the Jaccard similarity between two sets.
+Is the software used to preprocess/clean/label the instances available?
+Yes, the scripts are available +here. +The scripts use version 0.0.2 of the +dfm package.
+Has the dataset been used for any tasks already?
+Yes, the dataset has been used to pre-train Danish language models. +Parts of the dataset have also been used in HOPE project reports +and in [4].
+Is there a repository that links to any or all papers or systems that use the dataset?
+There is a website for the HOPE project for which the dataset was initially collected. This website contains report and articles regarding the dataset.
+What (other) tasks could the dataset be used for?
+The scale of the dataset makes it suitable for NLP tasks such as language modelling. +Similarly, one could imagine using the conversation structure could be used to train +conversational chatbots.
+Is there anything about the composition of the dataset or the way it was collected and +preprocessed/cleaned/labeled that might impact future uses?
+This dataset is static and thus does not evolve over time with the language. +A consequence of this is that it will become increasingly outdated over time. However, +it possible to extend the dataset by a continual collection of tweets.
+Are there tasks for which the dataset should not be used?
+HopeTwitter contains Danish tweets and thus should not be used for non-Danish language tasks.
+As the writers of the content is predominantly journalists, politicians, influencers, +and academics, it reflects a certain social group which is unlikely to reflect Danish +population as a whole.
+Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
+Data will only be available at the entity during the project. After the project the data will be archived for a period of five years to comply with the university [policy] for research integrity. After the five years, the data will be registered at the national archives as required by executive order 514 for potential long-term deposit.
+If you wish to cite this work please see our GitHub page for an up to date citation: +https://github.com/centre-for-humanities-computing/danish-foundation-models
+The DCC is a composite corpus consisting of the following subcorpora. For more information about the specific subcorpora, feel free to check out the individual datasheets.
+Name | +Description | +Size | +Open Access | +Novel Corpus | +
---|---|---|---|---|
Text | ++ | + | + | + |
DAGW | +Danish Gigaword | +1B tokens | +✓ | +✗ | +
reddit-da | +Danish Reddit | +<.1B tokens | +✓ | +✗ | +
HopeTwitter | +Danish Tweets | +0.48B tokens | +✗ | +✓ | +
DaNews | +Danish newspapers | +0.5B tokens | +✗ | +✓ | +
Netarkivet Text | +Danish internet | +>100B tokens | +✗ | +✓ | +
Speech | ++ | + | + | + |
DaRadio | +Danish talk radio | +140,000 hours | +✗ | +✓ | +
DaTV | +Danish subtitled TV | +900 hours | +✗ | +✓ | +
Data are provided in agreement with the data owners and data collaborators. The data is generally accecible by the research collaborators, though +each data agreements has their own access restrictions and might not cover all research collaborators. Access restriction are specified on the +server hosting the data in accordance with the data agreements.
+Welcome to the Danish Foundation Models (DFM) project, a pioneering initiative in the field of machine learning and natural language processing (NLP) dedicated to the Danish language. Our mission is to develop, maintain, and provide open access to high-quality foundation models tailored for Danish, promoting innovation and inclusivity in language technologies.
+Read the paper
+You can read more about the argument for Danish Language models in our publication.
+As many of the datasets we use either contain personally sensitive information or fall under copyright they can't be shared publicly. However, we want to share as +much as possible from the project, while protecting privacy and adhering to copyright law. +Thus we organize it such that all parts of the project that can be shared and those which +can't are well-documented using datasheets and training logs. +Furthermore, during data processing and training, the data is stored on UCloud which follows the highest standards of information security management with a formal ISO27001 certification.
+ +The Danish Foundations models collaborate with the Danish Data Science Community, Centre for Humanities Computing Aarhus, The Alexandra Institute to promote the development of Danish language tools. We continually gather information about how to improve the Danish language technologies and how to best support the community. To this end we have created a list of missing pieces for Danish NLP and we invite any 1) to add to the list, 2) solve one of the problems or 3) upvote the problems you think are most important.
+We invite collaboration and contributions from industry professionals, researchers, and the open-source community. Together, we can advance the field of Danish NLP and create a more inclusive digital future. You can reach out to us using the following channels:
++ | + |
---|---|
- DDSC Slack | +Join the discussion in the "danish-foundation-models-text"-channel | +
- GitHub Discussion | +Ask questions or start a discussion | +
- GitHub Issues | +Noticed a bug in the code? Please create an issue | +
- Using the model? | +If you use the model, let us know it makes it easier for us to apply for funding and justify the devopment of the project. | +
Each user tagged 100 documents unless otherwise specified. Documents were split by newlines into text-blocks, block was rated. +Text-blocks longer than 1000 characters were split into multiple blocks of 1000 characters or less.
+This tagging scheme is similar to +(Kreutzer et al., 2022).
+Each block was put into one of the following categories: +Each user tagged 100 documents (unless otherwise specified). Each document were tagged
+wrong_language
: Not Danishskipped
: Unsure of categorycorrect_language
: Danish text where at least 80% of the text is reasonable.not_language
: Text where less than 80% of the text is reasonable. Takes priority over wrong_language
.Additionally, each block was tagged for pornography (yes/no) and offensiveness (yes/no).
+Kenneth (Session: test)
+Proportions:
+correct_language
not_language
skipped
wrong_language
Kenneth (Session: 1)
+Proportions:
+correct_language
not_language
skipped
wrong_language
Lasse (Session: 1)
+Proportions:
+correct_language
not_language
wrong_language
Kenneth (Session: test) vs Kenneth - (Session: 1)
+Cohen's Kappa (all categories): 0.8242 (Overlap in sentences: 98)
+Cohen's Kappa (correct_language vs not correct_language): 0.9075 (Overlap in sentences: 98)
+Kenneth (Session: test) vs Lasse - (Session: 1)
+Cohen's Kappa (all categories): 0.8140 (Overlap in sentences: 95)
+Cohen's Kappa (correct_language vs not correct_language): 0.8389 (Overlap in sentences: 95)
+Kenneth (Session: 1) vs Lasse - (Session: 1)
+Cohen's Kappa (all categories): 0.6767 (Overlap in sentences: 245)
+Cohen's Kappa (correct_language vs not correct_language): 0.7259 (Overlap in sentences: 245)
+Comparison with mC4
+Note: mC4 did have a high degree of repititious texts. Similarly it did when texts blocks where not language they were often something like:
+2lineStart%22%3A%22%22%2C%22placeholder%22%3A1%2C%22extName%22%3A%22nowiki%22%7D"" class=""placeholder placeholder-ext"" contenteditable=""false"">]​</span></a></sup>​</span>, at en lurifaks som Jimmy page, bruger MIT navn til opfindelsen! SV<span data-rte-instance=""1524-12953202845f3523698f3f1"" data-rte-meta=""%7B%22type%22%3A%22ext%22%2C%22wikitext%22%3A%22%3Cref%3ESVIN%3C%5C%2Fref%3E%22%2C%22lineStart%22%3A%22%22%2C%22placeholder%22%3A1%2C%22extName%22%3A%22ref%22%7D"" class=""placeholder placeholder-ext"" contenteditable=""false""><sup data-rte-washtml=""1"" id=""cite_ref-2"" class=""reference"" data-rte-attribs=""
+
While non-language texts in NAT was often menu bars, contact information, or navigation.
+Kenneth (Session: 1)
+Proportions:
+correct_language
not_language
skipped
wrong_language
This section gives an overview of the models available through the DFM project. The models are available through the Huggingface model hub. To avoid duplicating information surrounding the models and the information regarding the models are available at the models model sheet.
+Model | +Model type | +Size (parameters) | +
---|---|---|
dfm-encoder-large-v1 | +Encoder | +large (355M) | +
dfm-encoder-medium-v1 | +Encoder | +medium (110M) | +
dfm-encoder-small-v1 | +Encoder | +small (22M) | +
Model | +Model type | +
---|---|
xls-r-300m-danish | +Pretrained wav2vec2.0 model | +
xls-r-300m-danish-nst-cv9 | +Automatic speech recognition | +
chcaa/xls-r-300m-nst-cv9-da | +Automatic speech recognition | +
Welcome to the Danish Foundation Models (DFM) project, a pioneering initiative in the field of machine learning and natural language processing (NLP) dedicated to the Danish language. Our mission is to develop, maintain, and provide open access to high-quality foundation models tailored for Danish, promoting innovation and inclusivity in language technologies.
Read the paper
You can read more about the argument for Danish Language models in our publication.
"},{"location":"#why-danish-foundation-models","title":"Why Danish Foundation Models?","text":""},{"location":"#bridging-the-digital-language-divide","title":"Bridging the Digital Language Divide","text":"As many of the datasets we use either contain personally sensitive information or fall under copyright they can't be shared publicly. However, we want to share as much as possible from the project, while protecting privacy and adhering to copyright law. Thus we organize it such that all parts of the project that can be shared and those which can't are well-documented using datasheets and training logs. Furthermore, during data processing and training, the data is stored on UCloud which follows the highest standards of information security management with a formal ISO27001 certification.
"},{"location":"#improving-the-danish-language-technology-landscape","title":"Improving the Danish Language Technology Landscape","text":"The Danish Foundations models collaborate with the Danish Data Science Community, Centre for Humanities Computing Aarhus, The Alexandra Institute to promote the development of Danish language tools. We continually gather information about how to improve the Danish language technologies and how to best support the community. To this end we have created a list of missing pieces for Danish NLP and we invite any 1) to add to the list, 2) solve one of the problems or 3) upvote the problems you think are most important.
"},{"location":"#join-us","title":"Join Us","text":"We invite collaboration and contributions from industry professionals, researchers, and the open-source community. Together, we can advance the field of Danish NLP and create a more inclusive digital future. You can reach out to us using the following channels:
- DDSC Slack Join the discussion in the \"danish-foundation-models-text\"-channel - GitHub Discussion Ask questions or start a discussion - GitHub Issues Noticed a bug in the code? Please create an issue - Using the model? If you use the model, let us know it makes it easier for us to apply for funding and justify the devopment of the project.Contact us
"},{"location":"dcc/","title":"DCC v1","text":"The DCC is a composite corpus consisting of the following subcorpora. For more information about the specific subcorpora, feel free to check out the individual datasheets.
Name Description Size Open Access Novel Corpus Text DAGW Danish Gigaword 1B tokens \u2713 \u2717 reddit-da Danish Reddit <.1B tokens \u2713 \u2717 HopeTwitter Danish Tweets 0.48B tokens \u2717 \u2713 DaNews Danish newspapers 0.5B tokens \u2717 \u2713 Netarkivet Text Danish internet >100B tokens \u2717 \u2713 Speech DaRadio Danish talk radio 140,000 hours \u2717 \u2713 DaTV Danish subtitled TV 900 hours \u2717 \u2713"},{"location":"dcc/#collaborators-and-data-owners","title":"Collaborators and Data Owners","text":"Data are provided in agreement with the data owners and data collaborators. The data is generally accecible by the research collaborators, though each data agreements has their own access restrictions and might not cover all research collaborators. Access restriction are specified on the server hosting the data in accordance with the data agreements.
Each user tagged 100 documents unless otherwise specified. Documents were split by newlines into text-blocks, block was rated. Text-blocks longer than 1000 characters were split into multiple blocks of 1000 characters or less.
This tagging scheme is similar to (Kreutzer et al., 2022).
Each block was put into one of the following categories: Each user tagged 100 documents (unless otherwise specified). Each document were tagged
wrong_language
: Not Danishskipped
: Unsure of categorycorrect_language
: Danish text where at least 80% of the text is reasonable.not_language
: Text where less than 80% of the text is reasonable. Takes priority over wrong_language
.Additionally, each block was tagged for pornography (yes/no) and offensiveness (yes/no).
"},{"location":"intercoder_reliability/#text-proportions","title":"Text proportions","text":"Kenneth (Session: test)
Proportions:
correct_language
not_language
skipped
wrong_language
Kenneth (Session: 1)
Proportions:
correct_language
not_language
skipped
wrong_language
Lasse (Session: 1)
Proportions:
correct_language
not_language
wrong_language
Kenneth (Session: test) vs Kenneth - (Session: 1)
Cohen's Kappa (all categories): 0.8242 (Overlap in sentences: 98)
Cohen's Kappa (correct_language vs not correct_language): 0.9075 (Overlap in sentences: 98)
Kenneth (Session: test) vs Lasse - (Session: 1)
Cohen's Kappa (all categories): 0.8140 (Overlap in sentences: 95)
Cohen's Kappa (correct_language vs not correct_language): 0.8389 (Overlap in sentences: 95)
Kenneth (Session: 1) vs Lasse - (Session: 1)
Cohen's Kappa (all categories): 0.6767 (Overlap in sentences: 245)
Cohen's Kappa (correct_language vs not correct_language): 0.7259 (Overlap in sentences: 245)
Comparison with mC4
Note: mC4 did have a high degree of repititious texts. Similarly it did when texts blocks where not language they were often something like:
2lineStart%22%3A%22%22%2C%22placeholder%22%3A1%2C%22extName%22%3A%22nowiki%22%7D\"\" class=\"\"placeholder placeholder-ext\"\" contenteditable=\"\"false\"\">]​</span></a></sup>​</span>, at en lurifaks som Jimmy page, bruger MIT navn til opfindelsen! SV<span data-rte-instance=\"\"1524-12953202845f3523698f3f1\"\" data-rte-meta=\"\"%7B%22type%22%3A%22ext%22%2C%22wikitext%22%3A%22%3Cref%3ESVIN%3C%5C%2Fref%3E%22%2C%22lineStart%22%3A%22%22%2C%22placeholder%22%3A1%2C%22extName%22%3A%22ref%22%7D\"\" class=\"\"placeholder placeholder-ext\"\" contenteditable=\"\"false\"\"><sup data-rte-washtml=\"\"1\"\" id=\"\"cite_ref-2\"\" class=\"\"reference\"\" data-rte-attribs=\"\"\n
While non-language texts in NAT was often menu bars, contact information, or navigation.
Kenneth (Session: 1)
Proportions:
correct_language
not_language
skipped
wrong_language
This section gives an overview of the models available through the DFM project. The models are available through the Huggingface model hub. To avoid duplicating information surrounding the models and the information regarding the models are available at the models model sheet.
"},{"location":"models/#text-models","title":"Text Models","text":"Model Model type Size (parameters) dfm-encoder-large-v1 Encoder large (355M) dfm-encoder-medium-v1 Encoder medium (110M) dfm-encoder-small-v1 Encoder small (22M)"},{"location":"models/#speech-models","title":"Speech Models","text":"Model Model type xls-r-300m-danish Pretrained wav2vec2.0 model xls-r-300m-danish-nst-cv9 Automatic speech recognition chcaa/xls-r-300m-nst-cv9-da Automatic speech recognition"},{"location":"datasheets/danews/","title":"DaNews","text":"Version: 1.0.0
Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
license: Not publicly available.
DaNews consists of articles from Danish news and tabloid media from 1 December 2019 to 30 April 2021. The articles stem from multiple news sources, including both online of physical newspapers.
DaNews consists of 403 million tokens 93% were left after quality filtering and deduplication.
"},{"location":"datasheets/danews/#datasheet","title":"Datasheet","text":"Following the recommendation and framework of [5] we add the following datasheet.
"},{"location":"datasheets/danews/#motivation","title":"Motivation","text":"For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?
DANews was collected as a part of the HOPE project, examining news coverage during the COVID-19 pandemic. The purpose was to train a model to understand how the novelty and resonance imprint of COVID-19 as a case of crisis compared to non-crises news imprints.
Any other comments?
No.
"},{"location":"datasheets/danews/#composition","title":"Composition","text":"How many instances are there in total (of each type, if appropriate)?
The unfiltered dataset consists of 713 429 documents including a total of 403 089 625 tokens.
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
Instances of the dataset are Danish articles derived from Danish tabloids or news media.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
Prior to filtering DaNews dataset contains all digitized news articles from the given period across the sources.
What data does each instance consist of? \u201cRaw\u201d data (e.g., unprocessed text or images) or features? In either case, please provide a description.
Each instance consists of the following columns
'ArticleUrl', 'Heading', 'SubHeading', 'Lead', 'Paragraph', 'PublishDate', 'BodyText', \n'Captions', 'Authors', 'Source', 'WordCount', 'ArticleId', 'PageIds', 'Section', 'text'\n
Where we constructed the columns text
column by joining the Heading
, SubHeading
using newline. If the text field is empty it is ignored and no newline is added. The we join the resulting string with the BodyText
using two newlines.
During the quality filtering, we add the following indicator columns:
'passed_quality_filter', 'filtered_by_max_chr_length', 'filtered_by_doc_length', \n'filtered_by_mean_word_length', 'filtered_by_alpha_ratio', 'filtered_by_stop_word', \n'filtered_by_symbol_2_word_hashtag', 'filtered_by_symbol_2_word_ellipsis',\n'filtered_by_line_bullets_or_ellipsis', 'filtered_by_duplicate_lines_chr_fraction',\n'filtered_by_duplicate_paragraph_chr_fraction', 'filtered_by_top_ngram_chr_fraction',\n'filtered_by_duplicate_ngram_chr_fraction', 'is_duplicate'\n
Is there a label or target associated with each instance? If so, please provide a description.
No.
Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information but might include, e.g., redacted text.
The team of researchers at the Humanities Computing Aarhus (CHCAA) have not removed any information from the instances.
Are relationships between individual instances made explicit (e.g., users\u2019 movie ratings, and social network links)? If so, please describe how these relationships are made explicit.
The metadata columns denote the relationship between articles including the date of publication, sections, and authors.
Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.
There are not splits performed on this dataset.
Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.
News sources can publish their content both in an online and printed format which would lead to similar instances in the dataset. To alleviate this redundancy by removing near-duplicates (see Preprocessing/cleaning/labeling).
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?
Articles are intended to tell a self-contained story but can include external references such as tweets or website URLs.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
Articles often describe content that is considered offensive, insulting, or threatening.
"},{"location":"datasheets/danews/#collection-process","title":"Collection Process","text":"What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)?
A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this dataset using a third-party API as well as a manual transfer from one of the parties. The API was limited to only a subset of articles agreed upon within the agreements.
If the dataset is a sample from a larger set, what was the sampling strategy?
The dataset is not a sample, but is a filtered version of the full dataset, see Preprocessing/cleaning/labeling for more on this.
Who was involved in the data collection process? A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this dataset using a third party API as well as a manual transfer from some of the parties and would like to thank the dataset owners for access to their articles.
Over what timeframe was the data collected?
The dataset includes articles from 1 December 2019 to 30 April 2021.
Were any ethical review processes conducted?
No.
"},{"location":"datasheets/danews/#preprocessingcleaninglabeling","title":"Preprocessing/cleaning/labeling","text":"Was any preprocessing/Cleaning/Labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
DaNews has been filtered using a series of heuristic filters as well as removing repetitious texts. Following the filtering, DaNews is deduplicated to remove exact and near-duplicates.
Of all documents, 9% were filtered based due to low-quality and 4% because they were near-duplicates.
For quality filtering, DaNews applies a filter akin to [2] which contains text that:
have less than 30% of lines ending with an ellipsis.
Have a low high degree of repetitious text:
The deduplication removed all documents with a 13-gram Jaccard similarity higher than 80% following the MinHash algorithm [1] using 128 permutations. The MinHash algorithm is a probabilistic data structure for approximating the Jaccard similarity between two sets.
Is the software used to preprocess/clean/label the instances available?
Yes, the scripts are available here. the scripts use version 0.0.2 of the dfm package.
"},{"location":"datasheets/danews/#uses","title":"Uses","text":"Has the dataset been used for any tasks already?
Yes, the dataset has been used to pre-train Danish language models. Parts of the dataset have also been used in [3] and [4]
Is there a repository that links to any or all papers or systems that use the dataset?
No.
What (other) tasks could the dataset be used for?
The scale of the dataset makes it suitable for NLP tasks such as language modeling. Similarly, the structure of the articles makes it a suitable dataset for training text summarisation models.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This dataset is static and thus does not evolve over time with the language. A consequence of this is that it will become increasingly outdated over time.
Are there tasks for which the dataset should not be used?
This dataset contains Danish articles and thus should not be used for non-Danish language tasks.
As the writers of the content are predominantly journalists, it reflects a certain writing style which is unlikely to reflect the Danish language as a whole.
"},{"location":"datasheets/danews/#distribution","title":"Distribution","text":"Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
Data will only be available at the entity during the project. If you wish access to the dataset you will have to come to an agreement with the individuals Danish newspapers.
"},{"location":"datasheets/danews/#citation","title":"Citation","text":"If you wish to cite this work please see our GitHub page for an up-to-date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models
"},{"location":"datasheets/danews/#references","title":"References:","text":"Version: 1.0.0
Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
License: Not publicly available.
DaRadio consists of radio broadcasts from the Danish radio stations DR P1 and Radio24Syv, and contains approximately 140.000 hours of speech. DaRadio includes all shows aired on DR P1 from 2005 to 2021, and all shows aired on Radio24Syv from 2011 to 2019.
DaRadio has been deduplicated using a series of heuristics based on metadata. For more on deduplication, see the data cleaning section further below.
"},{"location":"datasheets/daradio/#datasheet","title":"Datasheet","text":"Following the recommendation and framework of [1], we add the following datasheet.
"},{"location":"datasheets/daradio/#motivation","title":"Motivation:","text":"For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?
Data included in DaRadio was collected following the Danish Legal Deposit Act by the Royal Danish Library (RDL). From this, a dataset of Danish speech-only radio was derived by RDL. The dataset was created for research purposes, including training a Danish wav2vec2.0 model.
The dataset was preprocessed to remove duplicates by a team of researchers at the Center for Humanities Computing, Aarhus University (CHC) with collaborators from the Danish speech-processing company Alvenir.
"},{"location":"datasheets/daradio/#composition","title":"Composition","text":"What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
Instances of the dataset include an mp3 file for each show aired on the two staions within the period. Further metadata include information on date and time of airing, title, short description of the show, and various internal identifiers used by RDL.
How many instances are there in total (of each type, if appropriate)?
DaRadio consists of a total of 215.582 hours of unprocessed Danish speech radio shows across two stations, DR P1 and Radio24syv. The table below shows the distribution over the stations with and without heuristic rerun removal.
Source Duration (hours) Reruns removed P1 145.160 False 97.401 True Radio24syv 70.422 False 44.569 TrueDoes the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
The dataset contains all shows from the two stations in the time period (2005-2021 for DR P1 and 2011-2019 for Radio24syv).
If the dataset is a sample from a larger set, what was the sampling strategy?
The dataset is a subset of all Danish radio. The two stations were chosen for the dataset as they are talk-radio only.
Who was involved in the data collection process?
The RDL collects Danish radio shows and constructed DaRadio for handing to researchers at CHC.
Over what timeframe was the data collected?
The dataset includes radio shows from the period 2005 to 2021.
Were any ethical review processes conducted?
The RDL collects radio shows in adherence to Danish Archival laws. DaRadio was constructed for a research project, for which a project proposal was accepted by RDL. No other ethical review processes were conducted.
"},{"location":"datasheets/daradio/#preprocessingcleaninglabeling","title":"Preprocessing/cleaning/labeling","text":"Was any preprocessing/Cleaning/Labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
DaRadio has been deduplicated using a series of heuristic filters and all files have been converted to 16 Khz .wav files.
Reruns/duplicates were identified by the following rules:
The deduplication was coded and conducted by researchers at CHC.
Is the software used to preprocess/clean/label the instances available?
The scripts are available at the following GitHub repository: link.
"},{"location":"datasheets/daradio/#uses","title":"Uses","text":"Has the dataset been used for any tasks already?
Yes, the dataset has been used to pre-train a Danish wav2vec2.0 model.
Is there a repository that links to any or all papers or systems that use the dataset?
No, but as of 23/10/16 no others have used the dataset.
What (other) tasks could the dataset be used for?
As the dataset only contains un-labelled data, i.e. no transcriptions, it is mainly designed for pre-training language models. However, given the metadata and re-occuring hosts, further processing might make it possible to train e.g. text-to-speech systems.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This dataset is static and does not evolve over time with the language, thus will become increasingly outdated over time.
"},{"location":"datasheets/daradio/#distribution","title":"Distribution","text":"Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
Data will only be available at the entity during the project. An equivalent or updated dataset can be requested at the Royal Danish Library.
"},{"location":"datasheets/daradio/#citation","title":"Citation","text":"If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models
"},{"location":"datasheets/daradio/#references","title":"References:","text":"Version: 1.0.0
Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
license: Not publicly available.
HopeTwitter consists of tweets collected from the Twitter API using a stopword list and consists of 32.5 million tweets across 538,398 unique users. HopeTwitter includes tweets from 2019-01-01 to 2021-04-30.
HopeTwitter, have been filtered to only include Danish tweets, based on language tag from Twitter API. Similarly, HopeTwitter have had low-quality tweets have removed and then deduplicated to remove exact and near-duplicates. For more on data cleaning see section; \"Preprocessing/cleaning/labeling\".
HopeTwitter includes a total of 0.97 billion tokens before filtering and includes 0.48 billion (50%) after.
"},{"location":"datasheets/hopetwitter/#datasheet","title":"Datasheet","text":"Following the recommendation and framework of [3] we add the following datasheet.
"},{"location":"datasheets/hopetwitter/#motivation","title":"Motivation","text":"**For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset? **
HopeTwitter was initially collected as a part of the HOPE project, examining societal behaviour during the COVID-19 pandemic. Next, HopeTwitter was cleaned in preparation for pre-training Danish language models by a team of researchers at Center for Humanities Computing Aarhus (CHCAA), using a codebase jointly developed with partners from academia and industry, including KMD, Ekstra Bladet, Bristol University and Deepdivr. For more on collaborators on this project see the GitHub repository.
Any other comments?
No.
"},{"location":"datasheets/hopetwitter/#composition","title":"Composition","text":"What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
HopeTwitter consists of tweets containing at least one of a series of stopwords, collected through the Twitter API. See \"If the dataset is a sample from a larger set, what was the sampling strategy?\" for the stopword list.
How many instances are there in total (of each type, if appropriate)?
The dataset consist of 32,499,019 documents where 14,399,284 (44%) were considered duplicates.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
No. It does not contain all instances of Danish Twitter as there are likely some Danish tweets which does not include a stopword.
Is there a label or target associated with each instance? If so, please provide a description.
No.
Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.
No splits are performed on this dataset.
If the dataset is a sample from a larger set, what was the sampling strategy?
Tweets are streamed continuously using queried a set of the highest frequency Scandinavian-specific keywords from Danish, Norwegian (Bokm\u00e5l) and Swedish, resulting in the following list:
aften, aldrig, alltid, altid, andet, arbejde, bedste, beh\u00f6ver, beh\u00f8ver, beklager,\nber\u00e4tta, betyr, blev, blevet, blir, blitt, blive, bliver, bruge, burde, b\u00e4ttre, b\u00e5e\nb\u00f8r, deim, deires, ditt, drar, drepe, dykk, dykkar, d\u00e4r, d\u00f6d, d\u00f6da, d\u00f8d, d\u00f8de, efter,\nelsker, endnu, faen, fandt, feil, fikk, finner, flere, forst\u00e5r, fortelle, fortfarande,\nfortsatt, fort\u00e6lle, fr\u00e5n, f\u00e5, f\u00e5et, f\u00e5r, f\u00e5tt, f\u00f6rl\u00e5t, f\u00f6rsta, f\u00f6rs\u00f6ker, f\u00f8r, f\u00f8rst,\nf\u00f8rste, gick, gikk, gillar, gjennom, gjerne, gjorde, gjort, gj\u00f8r, gj\u00f8re, godt, g\u00e5, g\u00e5ng,\ng\u00e5r, g\u00f6ra, g\u00f8r, g\u00f8re, hadde, hall\u00e5, havde, hedder, helt, helvete, hende, hendes, hennes,\nherregud, hjelp, hjelpe, hjem, hj\u00e4lp, hj\u00e5, hj\u00e6lp, hj\u00e6lpe, honom, hossen, hvem, hvis,\nhvordan, hvorfor, h\u00e4nder, h\u00e4r, h\u00e5ll, h\u00e5ller, h\u00f8r, h\u00f8re, h\u00f8rer, igjen, ikkje, ingenting,\ninkje, inte, intet, jeres, j\u00e4vla, kanske, kanskje, kender, kjenner, korleis, kvarhelst,\nkveld, kven, kvifor, k\u00e4nner, ledsen, lenger, lidt, livet, l\u00e4ngre, l\u00e5t, l\u00e5ter, l\u00e6nge,\nmeget, menar, mycket, mykje, m\u00e5, m\u00e5de, m\u00e5nga, m\u00e5r, m\u00e5ske, m\u00e5ste, m\u00e5tte, navn, nogen,\nnoget, nogle, noko, nokon, nokor, nokre, n\u00e5gon, n\u00e5got, n\u00e5gra, n\u00e5n, n\u00e5r, n\u00e5t, n\u00f8dt,\nocks\u00e5, ogs\u00e5, pengar, penger, pratar, pr\u00f8ver, p\u00e5, redan, rundt, r\u00e4tt, sagde, saker,\nsamma, sammen, selv, selvf\u00f8lgelig, sidan, sidste, siger, sikker, sikkert, sj\u00e4lv, skete,\nskjedde, skjer, skulle, sluta, slutt, snakke, snakker, snill, sn\u00e4lla, somt, stadig,\nstanna, sted, st\u00e5r, synes, s\u00e4ger, s\u00e4tt, s\u00e5, s\u00e5dan, s\u00e5g, s\u00e5nn, tager, tiden, tilbage,\ntilbake, tillbaka, titta, trenger, trodde, troede, tror, tv\u00e5, tycker, t\u00e4nker, uden,\nundskyld, unnskyld, urs\u00e4kta, uten, varf\u00f6r, varit, varte, veldig, venner, verkligen,\nvidste, vilken, virkelig, visste, v\u00e4g, v\u00e4l, v\u00e4ldigt, v\u00e4n, v\u00e5r, v\u00e5ra, v\u00e5re, v\u00e6k, v\u00e6r, \nv\u00e6re, v\u00e6ret, \u00e4lskar, \u00e5h, \u00e5r, \u00e5t, \u00f6ver\n
Who was involved in the data collection process?
A team of researchers at the Center for Humanities Computing Aarhus (CHCAA), including Kristoffer Nielbo and Peter Bjerregaard Vahlstrup, in collaboration with Rebekah Baglini, at the School of Communcation and Culture at Aarhus university.
Over what timeframe was the data collected?
The dataset includes tweets from the period 2019-01-01 to 2021-04-30.
Were any ethical review processes conducted?
No
"},{"location":"datasheets/hopetwitter/#preprocessingcleaninglabeling","title":"Preprocessing/cleaning/labeling","text":"Was any preprocessing/Cleaning/Labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
Firstly, HopeTwitter had non-Danish tweets removed, after which a series of heuristic filters were applied, including the removal of repetitious texts. Following the filtering, HopeTwitter was deduplicated, removing both exact duplicates and near-duplicates.
Of all documents, 3,023,427 (9%) were filtered due to low-quality and 14,399,284 (33%) because they were near-duplicates.
For the quality filtering, HopeTwitter applies a filter akin to [2] which contains text that:
Have less than 60% of words containing an alphabetic character.
Have low high degree of repetitious text:
The deduplication removed all documents with a 10-gram Jaccard similarity higher than 80% following the MinHash algorithm [1] using 128 permutations. The MinHash algorithm is a probabilistic data structure for approximating the Jaccard similarity between two sets.
Is the software used to preprocess/clean/label the instances available?
Yes, the scripts are available here. The scripts use version 0.0.2 of the dfm package.
"},{"location":"datasheets/hopetwitter/#uses","title":"Uses","text":"Has the dataset been used for any tasks already?
Yes, the dataset has been used to pre-train Danish language models. Parts of the dataset have also been used in HOPE project reports and in [4].
Is there a repository that links to any or all papers or systems that use the dataset?
There is a website for the HOPE project for which the dataset was initially collected. This website contains report and articles regarding the dataset.
What (other) tasks could the dataset be used for?
The scale of the dataset makes it suitable for NLP tasks such as language modelling. Similarly, one could imagine using the conversation structure could be used to train conversational chatbots.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This dataset is static and thus does not evolve over time with the language. A consequence of this is that it will become increasingly outdated over time. However, it possible to extend the dataset by a continual collection of tweets.
Are there tasks for which the dataset should not be used?
HopeTwitter contains Danish tweets and thus should not be used for non-Danish language tasks.
As the writers of the content is predominantly journalists, politicians, influencers, and academics, it reflects a certain social group which is unlikely to reflect Danish population as a whole.
"},{"location":"datasheets/hopetwitter/#distribution","title":"Distribution","text":"Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
Data will only be available at the entity during the project. After the project the data will be archived for a period of five years to comply with the university [policy] for research integrity. After the five years, the data will be registered at the national archives as required by executive order 514 for potential long-term deposit.
"},{"location":"datasheets/hopetwitter/#citation","title":"Citation","text":"If you wish to cite this work please see our GitHub page for an up to date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models
"},{"location":"datasheets/hopetwitter/#references","title":"References:","text":"Version: 1.0.0
Homepage: https://github.com/centre-for-humanities-computing/danish-foundation-models
license: Not publicly available.
This datasheet is currently being revised \ud83d\udee0\ufe0f
"}]} \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml new file mode 100644 index 00000000..0f8724ef --- /dev/null +++ b/sitemap.xml @@ -0,0 +1,3 @@ + +