diff --git a/docs/datasheets/danews.md b/docs/datasheets/danews.md index 19a86125..e9186aec 100644 --- a/docs/datasheets/danews.md +++ b/docs/datasheets/danews.md @@ -9,10 +9,11 @@ --- -DaNews consist of articles from Danish news and tabloid media from 1 December 2000 to -30 April 2021. It consists of articles derived from infomedia API through the [HOPE project](https://hope-project.dk/#/). The articles stems from multiple news sources such as Politiken, including both online of physical news papers. -DaNews consists of 9.29 billion tokens of which 8.67 Billion (93%) were left after -quality filtering and deduplication. +DaNews consists of articles from Danish news and tabloid media from 1 December 2019 to +30 April 2021. The articles stem from multiple news sources, including both online of physical newspapers. + +DaNews consists of 403 million tokens 93% were left after + quality filtering and deduplication. ## Datasheet @@ -23,14 +24,7 @@ Following the recommendation and framework of [5] we add the following datasheet **For what purpose was the dataset created? Who created the dataset? Who funded the creation of the dataset?** -The preprocessed dataset was created with the purpose of pre-training Danish language models. It was -created by a team of researchers at the Center for Humanities Computing Aarhus ([CHCAA](https://chcaa.io/#/)) using -a codebase jointly developed with partners from academia and industry, including KMD, -Ekstra Bladet, Bristol University and Deepdivr. For more on collaborators on this -project see the -[GitHub repository](https://github.com/centre-for-humanities-computing/danish-foundation-models -). - +DANews was collected as a part of the HOPE project, examining news coverage during the COVID-19 pandemic. The purpose was to train a model to understand how the novelty and resonance imprint of COVID-19 as a case of crisis compared to non-crises news imprints. **Any other comments?** @@ -38,15 +32,15 @@ No. ## Composition +**How many instances are there in total (of each type, if appropriate)?** + +The unfiltered dataset consists of 713 429 documents including a total of 403 089 625 tokens. + **What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?** Instances of the dataset are Danish articles derived from Danish tabloids or news media. -**How many instances are there in total (of each type, if appropriate)?** - -There are 25,874,862 documents in the unfiltered dataset, with 24,826,047 (96%) remaining -after filtering. **Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?** @@ -57,17 +51,17 @@ period across the sources. **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.** -Each instance constist of the following columns +Each instance consists of the following columns ``` 'ArticleUrl', 'Heading', 'SubHeading', 'Lead', 'Paragraph', 'PublishDate', 'BodyText', 'Captions', 'Authors', 'Source', 'WordCount', 'ArticleId', 'PageIds', 'Section', 'text' ``` Where we constructed the columns `text` column by joining the `Heading`, `SubHeading` -using newline. If the textfield is empty it is ignored and no newline is added. The we +using newline. If the text field is empty it is ignored and no newline is added. The we join the resulting string with the `BodyText` using two newlines. -During the quality filtering we add the following indicator columns: +During the quality filtering, we add the following indicator columns: ``` 'passed_quality_filter', 'filtered_by_max_chr_length', 'filtered_by_doc_length', 'filtered_by_mean_word_length', 'filtered_by_alpha_ratio', 'filtered_by_stop_word', @@ -84,17 +78,17 @@ No. **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was -unavailable). This does not include intentionally removed information, but might +unavailable). This does not include intentionally removed information but might include, e.g., redacted text.** The team of researchers at the Humanities Computing Aarhus (CHCAA) have not removed any information from the instances. **Are relationships between individual instances made explicit (e.g., users’ movie -ratings, social network links)? If so, please describe how these relationships are made +ratings, and social network links)? If so, please describe how these relationships are made explicit.** -The metadata columns, denote the relationship between articles including date of +The metadata columns denote the relationship between articles including the date of publication, sections, and authors. @@ -102,7 +96,7 @@ publication, sections, and authors. If so, please provide a description of these splits, explaining the rationale behind them.** -There is not splits performed on this dataset. +There are not splits performed on this dataset. **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.** @@ -114,37 +108,39 @@ near-duplicates (see Preprocessing/cleaning/labeling). **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?** -Articles are intended to tell a self-contained story, but can include external -references such as tweets or website urls. +Articles are intended to tell a self-contained story but can include external +references such as tweets or website URLs. **Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?** -Articles often describe content which is considered offensive, insulting or threatening. +Articles often describe content that is considered offensive, insulting, or threatening. ## Collection Process **What mechanisms or procedures were used to collect the data (e.g., hardware -apparatuses or sensors, manual human curation, software programs, software APIs)?** + apparatuses or sensors, manual human curation, software programs, software APIs)?** -A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this -dataset using the Infomedia API. + A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this + dataset using a third-party API as well as a manual transfer from one of the parties. The API was limited + to only a subset of articles agreed upon within the agreements. **If the dataset is a sample from a larger set, what was the sampling strategy?** The dataset is not a sample, but _is_ a filtered version of the full dataset, see Preprocessing/cleaning/labeling for more on this. -**Who was involved in the data collection process?** +**Who was involved in the data collection process?** A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this -dataset using the Infomedia API and would like to thank the dataset owners for -access to their articles. +dataset using a third party API as well as a manual transfer from some of the parties and would like to thank the dataset owners for + access to their articles. + **Over what timeframe was the data collected?** -The dataset includes articles from 1 December 2000 to +The dataset includes articles from 1 December 2019 to 30 April 2021. **Were any ethical review processes conducted?** @@ -162,10 +158,9 @@ DaNews has been filtered using a series of heuristic filters as well as removing repetitious texts. Following the filtering, DaNews is deduplicated to remove exact and near-duplicates. -Of all documents, 2,338,728 (9%) were filtered based due to low-quality and 1,048,815 -(4%) because they were near-duplicates. +Of all documents, 9% were filtered based due to low-quality and 4% because they were near-duplicates. -For the quality filtering, DaNews applies a filter akin to [2] which contains text +For quality filtering, DaNews applies a filter akin to [2] which contains text that: - Contain at least 2 Danish stopwords. For the stopword list we use the one used in @@ -174,11 +169,11 @@ SpaCy v.3.1.4. - Have a token length between 50 and 100,000. - Have less than 5,000,000 characters. - Have less than 60% of words containing an alphabetic character. -- Have a symbol to word ratio lower than 10% for hashtags and ellipsis. +- Have a symbol-to-word ratio lower than 10% for hashtags and ellipsis. - Have less than 90% of lines starting with a bullet point. - have less than 30% of lines ending with an ellipsis. -- Have low high degree of repetitious text: +- Have a low high degree of repetitious text: - Have less than 20% of characters contained within duplicate lines. - Have less than 20% of characters contained within duplicate paragraphs. - Where the top 2-4 grams constitute less than 20%, 18%, 16%, respectively, of the text. @@ -209,7 +204,7 @@ No. **What (other) tasks could the dataset be used for?** -The scale of the dataset makes it suitable for NLP tasks such as language modelling. +The scale of the dataset makes it suitable for NLP tasks such as language modeling. Similarly, the structure of the articles makes it a suitable dataset for training text summarisation models. @@ -233,11 +228,11 @@ writing style which is unlikely to reflect the Danish language as a whole. **Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?** Data will only be available at the entity during the project. If you wish access to the dataset you will have to come to an agreement with the individuals -Danish newspapers potentially through Infomedia. +Danish newspapers. ### Citation -If you wish to cite this work please see our GitHub page for an up to date citation: +If you wish to cite this work please see our GitHub page for an up-to-date citation: https://github.com/centre-for-humanities-computing/danish-foundation-models # References: