Merge branch 'main' into dolma_taggers

centre-for-humanities-computing · Nov 8, 2023 · d1520df · d1520df
2 parents 6ba0fc8 + a8f440f
commit d1520df
Showing 1 changed file with 36 additions and 41 deletions.
diff --git a/docs/datasheets/danews.md b/docs/datasheets/danews.md
@@ -9,10 +9,11 @@
 
 ---
 
-DaNews consist of articles from Danish news and tabloid media from 1 December 2000 to 
-30 April 2021. It consists of articles derived from infomedia API through the [HOPE project](https://hope-project.dk/#/). The articles stems from multiple news sources such as Politiken, including both online of physical news papers.
-DaNews consists of 9.29 billion tokens of which 8.67 Billion (93%) were left after
-quality filtering and deduplication.
+DaNews consists of articles from Danish news and tabloid media from 1 December 2019 to 
+30 April 2021. The articles stem from multiple news sources, including both online of physical newspapers.
+
+DaNews consists of 403 million tokens 93% were left after
+ quality filtering and deduplication.
 
 ## Datasheet
 
@@ -23,30 +24,23 @@ Following the recommendation and framework of [5] we add the following datasheet
 **For what purpose was the dataset created? Who created the dataset? Who funded the
 creation of the dataset?**
 
-The preprocessed dataset was created with the purpose of pre-training Danish language models. It was
-created by a team of researchers at the Center for Humanities Computing Aarhus ([CHCAA](https://chcaa.io/#/))  using
-a codebase jointly developed with partners from academia and industry, including KMD,
-Ekstra Bladet, Bristol University and Deepdivr. For more on collaborators on this
-project see the
-[GitHub repository](https://github.com/centre-for-humanities-computing/danish-foundation-models
-).
-
+DANews was collected as a part of the HOPE project, examining news coverage during the COVID-19 pandemic. The purpose was to train a model to understand how the novelty and resonance imprint of COVID-19 as a case of crisis compared to non-crises news imprints.
 
 **Any other comments?**
 
 No.
 
 ## Composition
 
+**How many instances are there in total (of each type, if appropriate)?**
+
+The unfiltered dataset consists of 713 429 documents including a total of 403 089 625 tokens.
+
 **What do the instances that comprise the dataset represent (e.g., documents, photos,
 people, countries)?**
 
 Instances of the dataset are Danish articles derived from Danish tabloids or news media. 
 
-**How many instances are there in total (of each type, if appropriate)?**
-
-There are 25,874,862 documents in the unfiltered dataset, with 24,826,047 (96%) remaining
-after filtering.
 
 **Does the dataset contain all possible instances or is it a sample (not necessarily
 random) of instances from a larger set?**
@@ -57,17 +51,17 @@ period across the sources.
 **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images)
 or features? In either case, please provide a description.**
 
-Each instance constist of the following columns
+Each instance consists of the following columns
 ```
 'ArticleUrl', 'Heading', 'SubHeading', 'Lead', 'Paragraph', 'PublishDate', 'BodyText', 
 'Captions', 'Authors', 'Source', 'WordCount', 'ArticleId', 'PageIds', 'Section', 'text'
 ```
 
 Where we constructed the columns `text` column by joining the `Heading`, `SubHeading`
-using newline. If the textfield is empty it is ignored and no newline is added. The we
+using newline. If the text field is empty it is ignored and no newline is added. The we
 join the resulting string with the `BodyText` using two newlines.
 
-During the quality filtering we add the following indicator columns:
+During the quality filtering, we add the following indicator columns:
 ```
 'passed_quality_filter', 'filtered_by_max_chr_length', 'filtered_by_doc_length', 
 'filtered_by_mean_word_length', 'filtered_by_alpha_ratio', 'filtered_by_stop_word', 
@@ -84,25 +78,25 @@ No.
 
 **Is any information missing from individual instances? If so, please provide a
 description, explaining why this information is missing (e.g., because it was
-unavailable). This does not include intentionally removed information, but might
+unavailable). This does not include intentionally removed information but might
 include, e.g., redacted text.**
 
 The team of researchers at the Humanities Computing Aarhus (CHCAA) have not
 removed any information from the instances.
 
 **Are relationships between individual instances made explicit (e.g., users’ movie
-ratings, social network links)? If so, please describe how these relationships are made
+ratings, and social network links)? If so, please describe how these relationships are made
 explicit.**
 
-The metadata columns, denote the relationship between articles including date of
+The metadata columns denote the relationship between articles including the date of
 publication, sections, and authors.
 
 
 **Are there recommended data splits (e.g., training, development/validation, testing)?
 If so, please provide a description of these splits, explaining the rationale behind
 them.**
 
-There is not splits performed on this dataset.
+There are not splits performed on this dataset.
 
 **Are there any errors, sources of noise, or redundancies in the dataset? If so, please
 provide a description.**
@@ -114,37 +108,39 @@ near-duplicates (see Preprocessing/cleaning/labeling).
 **Is the dataset self-contained, or does it link to or otherwise rely on external
 resources (e.g., websites, tweets, other datasets)?**
 
-Articles are intended to tell a self-contained story, but can include external
-references such as tweets or website urls.
+Articles are intended to tell a self-contained story but can include external
+references such as tweets or website URLs.
 
 
 **Does the dataset contain data that, if viewed directly, might be offensive, insulting,
 threatening, or might otherwise cause anxiety?**
 
-Articles often describe content which is considered offensive, insulting or threatening. 
+Articles often describe content that is considered offensive, insulting, or threatening. 
 
 ## Collection Process
 
 **What mechanisms or procedures were used to collect the data (e.g., hardware
-apparatuses or sensors, manual human curation, software programs, software APIs)?**
+ apparatuses or sensors, manual human curation, software programs, software APIs)?**
 
-A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this
-dataset using the Infomedia API.
+ A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this
+ dataset using a third-party API as well as a manual transfer from one of the parties. The API was limited
+ to only a subset of articles agreed upon within the agreements.
 
 **If the dataset is a sample from a larger set, what was the sampling strategy?**
 
 The dataset is not a sample, but _is_ a filtered version of the full dataset, see
 Preprocessing/cleaning/labeling for more on this.
 
-**Who was involved in the data collection process?**
 
+**Who was involved in the data collection process?**
 A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this
-dataset using the Infomedia API and would like to thank the dataset owners for
-access to their articles.
+dataset using a third party API as well as a manual transfer from some of the parties and would like to thank the dataset owners for
+ access to their articles.
 
+
 **Over what timeframe was the data collected?**
 
-The dataset includes articles from 1 December 2000 to 
+The dataset includes articles from 1 December 2019 to 
 30 April 2021.
 
 **Were any ethical review processes conducted?**
@@ -162,10 +158,9 @@ DaNews has been filtered using a series of heuristic filters as well as removing
 repetitious texts. Following the filtering, DaNews is deduplicated to remove exact and
 near-duplicates.
 
-Of all documents, 2,338,728 (9%) were filtered based due to low-quality and 1,048,815
-(4%) because they were near-duplicates.
+Of all documents, 9% were filtered based due to low-quality and 4% because they were near-duplicates.
 
-For the quality filtering, DaNews applies a filter akin to [2] which contains text
+For quality filtering, DaNews applies a filter akin to [2] which contains text
 that:
 
 - Contain at least 2 Danish stopwords. For the stopword list we use the one used in
@@ -174,11 +169,11 @@ SpaCy v.3.1.4.
 - Have a token length between 50 and 100,000.
 - Have less than 5,000,000 characters.
 - Have less than 60% of words containing an alphabetic character.
-- Have a symbol to word ratio lower than 10% for hashtags and ellipsis.
+- Have a symbol-to-word ratio lower than 10% for hashtags and ellipsis.
 - Have less than 90% of lines starting with a bullet point.
 - have less than 30% of lines ending with an ellipsis.
 
-- Have low high degree of repetitious text:
+- Have a low high degree of repetitious text:
   - Have less than 20% of characters contained within duplicate lines.
   - Have less than 20% of characters contained within duplicate paragraphs.
   - Where the top 2-4 grams constitute less than 20%, 18%, 16%, respectively, of the text. 
@@ -209,7 +204,7 @@ No.
 
 **What (other) tasks could the dataset be used for?**
 
-The scale of the dataset makes it suitable for NLP tasks such as language modelling.
+The scale of the dataset makes it suitable for NLP tasks such as language modeling.
 Similarly, the structure of the articles makes it a suitable dataset for training text
 summarisation models.
 
@@ -233,11 +228,11 @@ writing style which is unlikely to reflect the Danish language as a whole.
 **Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?**
 
 Data will only be available at the entity during the project. If you wish access to the dataset you will have to come to an agreement with the individuals
-Danish newspapers potentially through Infomedia.
+Danish newspapers.
 
 ### Citation
 
-If you wish to cite this work please see our GitHub page for an up to date citation:
+If you wish to cite this work please see our GitHub page for an up-to-date citation:
 https://github.com/centre-for-humanities-computing/danish-foundation-models
 
 # References: