Skip to content

Commit

Permalink
Merge branch 'main' into dolma_taggers
Browse files Browse the repository at this point in the history
  • Loading branch information
peterbjorgensen committed Nov 8, 2023
2 parents 6ba0fc8 + a8f440f commit d1520df
Showing 1 changed file with 36 additions and 41 deletions.
77 changes: 36 additions & 41 deletions docs/datasheets/danews.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,11 @@

---

DaNews consist of articles from Danish news and tabloid media from 1 December 2000 to
30 April 2021. It consists of articles derived from infomedia API through the [HOPE project](https://hope-project.dk/#/). The articles stems from multiple news sources such as Politiken, including both online of physical news papers.
DaNews consists of 9.29 billion tokens of which 8.67 Billion (93%) were left after
quality filtering and deduplication.
DaNews consists of articles from Danish news and tabloid media from 1 December 2019 to
30 April 2021. The articles stem from multiple news sources, including both online of physical newspapers.

DaNews consists of 403 million tokens 93% were left after
quality filtering and deduplication.

## Datasheet

Expand All @@ -23,30 +24,23 @@ Following the recommendation and framework of [5] we add the following datasheet
**For what purpose was the dataset created? Who created the dataset? Who funded the
creation of the dataset?**

The preprocessed dataset was created with the purpose of pre-training Danish language models. It was
created by a team of researchers at the Center for Humanities Computing Aarhus ([CHCAA](https://chcaa.io/#/)) using
a codebase jointly developed with partners from academia and industry, including KMD,
Ekstra Bladet, Bristol University and Deepdivr. For more on collaborators on this
project see the
[GitHub repository](https://github.com/centre-for-humanities-computing/danish-foundation-models
).

DANews was collected as a part of the HOPE project, examining news coverage during the COVID-19 pandemic. The purpose was to train a model to understand how the novelty and resonance imprint of COVID-19 as a case of crisis compared to non-crises news imprints.

**Any other comments?**

No.

## Composition

**How many instances are there in total (of each type, if appropriate)?**

The unfiltered dataset consists of 713 429 documents including a total of 403 089 625 tokens.

**What do the instances that comprise the dataset represent (e.g., documents, photos,
people, countries)?**

Instances of the dataset are Danish articles derived from Danish tabloids or news media.

**How many instances are there in total (of each type, if appropriate)?**

There are 25,874,862 documents in the unfiltered dataset, with 24,826,047 (96%) remaining
after filtering.

**Does the dataset contain all possible instances or is it a sample (not necessarily
random) of instances from a larger set?**
Expand All @@ -57,17 +51,17 @@ period across the sources.
**What data does each instance consist of? “Raw” data (e.g., unprocessed text or images)
or features? In either case, please provide a description.**

Each instance constist of the following columns
Each instance consists of the following columns
```
'ArticleUrl', 'Heading', 'SubHeading', 'Lead', 'Paragraph', 'PublishDate', 'BodyText',
'Captions', 'Authors', 'Source', 'WordCount', 'ArticleId', 'PageIds', 'Section', 'text'
```

Where we constructed the columns `text` column by joining the `Heading`, `SubHeading`
using newline. If the textfield is empty it is ignored and no newline is added. The we
using newline. If the text field is empty it is ignored and no newline is added. The we
join the resulting string with the `BodyText` using two newlines.

During the quality filtering we add the following indicator columns:
During the quality filtering, we add the following indicator columns:
```
'passed_quality_filter', 'filtered_by_max_chr_length', 'filtered_by_doc_length',
'filtered_by_mean_word_length', 'filtered_by_alpha_ratio', 'filtered_by_stop_word',
Expand All @@ -84,25 +78,25 @@ No.

**Is any information missing from individual instances? If so, please provide a
description, explaining why this information is missing (e.g., because it was
unavailable). This does not include intentionally removed information, but might
unavailable). This does not include intentionally removed information but might
include, e.g., redacted text.**

The team of researchers at the Humanities Computing Aarhus (CHCAA) have not
removed any information from the instances.

**Are relationships between individual instances made explicit (e.g., users’ movie
ratings, social network links)? If so, please describe how these relationships are made
ratings, and social network links)? If so, please describe how these relationships are made
explicit.**

The metadata columns, denote the relationship between articles including date of
The metadata columns denote the relationship between articles including the date of
publication, sections, and authors.


**Are there recommended data splits (e.g., training, development/validation, testing)?
If so, please provide a description of these splits, explaining the rationale behind
them.**

There is not splits performed on this dataset.
There are not splits performed on this dataset.

**Are there any errors, sources of noise, or redundancies in the dataset? If so, please
provide a description.**
Expand All @@ -114,37 +108,39 @@ near-duplicates (see Preprocessing/cleaning/labeling).
**Is the dataset self-contained, or does it link to or otherwise rely on external
resources (e.g., websites, tweets, other datasets)?**

Articles are intended to tell a self-contained story, but can include external
references such as tweets or website urls.
Articles are intended to tell a self-contained story but can include external
references such as tweets or website URLs.


**Does the dataset contain data that, if viewed directly, might be offensive, insulting,
threatening, or might otherwise cause anxiety?**

Articles often describe content which is considered offensive, insulting or threatening.
Articles often describe content that is considered offensive, insulting, or threatening.

## Collection Process

**What mechanisms or procedures were used to collect the data (e.g., hardware
apparatuses or sensors, manual human curation, software programs, software APIs)?**
apparatuses or sensors, manual human curation, software programs, software APIs)?**

A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this
dataset using the Infomedia API.
A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this
dataset using a third-party API as well as a manual transfer from one of the parties. The API was limited
to only a subset of articles agreed upon within the agreements.

**If the dataset is a sample from a larger set, what was the sampling strategy?**

The dataset is not a sample, but _is_ a filtered version of the full dataset, see
Preprocessing/cleaning/labeling for more on this.

**Who was involved in the data collection process?**

**Who was involved in the data collection process?**
A team of researchers at the Center for Humanities Computing Aarhus (CHCAA) obtained this
dataset using the Infomedia API and would like to thank the dataset owners for
access to their articles.
dataset using a third party API as well as a manual transfer from some of the parties and would like to thank the dataset owners for
access to their articles.


**Over what timeframe was the data collected?**

The dataset includes articles from 1 December 2000 to
The dataset includes articles from 1 December 2019 to
30 April 2021.

**Were any ethical review processes conducted?**
Expand All @@ -162,10 +158,9 @@ DaNews has been filtered using a series of heuristic filters as well as removing
repetitious texts. Following the filtering, DaNews is deduplicated to remove exact and
near-duplicates.

Of all documents, 2,338,728 (9%) were filtered based due to low-quality and 1,048,815
(4%) because they were near-duplicates.
Of all documents, 9% were filtered based due to low-quality and 4% because they were near-duplicates.

For the quality filtering, DaNews applies a filter akin to [2] which contains text
For quality filtering, DaNews applies a filter akin to [2] which contains text
that:

- Contain at least 2 Danish stopwords. For the stopword list we use the one used in
Expand All @@ -174,11 +169,11 @@ SpaCy v.3.1.4.
- Have a token length between 50 and 100,000.
- Have less than 5,000,000 characters.
- Have less than 60% of words containing an alphabetic character.
- Have a symbol to word ratio lower than 10% for hashtags and ellipsis.
- Have a symbol-to-word ratio lower than 10% for hashtags and ellipsis.
- Have less than 90% of lines starting with a bullet point.
- have less than 30% of lines ending with an ellipsis.

- Have low high degree of repetitious text:
- Have a low high degree of repetitious text:
- Have less than 20% of characters contained within duplicate lines.
- Have less than 20% of characters contained within duplicate paragraphs.
- Where the top 2-4 grams constitute less than 20%, 18%, 16%, respectively, of the text.
Expand Down Expand Up @@ -209,7 +204,7 @@ No.

**What (other) tasks could the dataset be used for?**

The scale of the dataset makes it suitable for NLP tasks such as language modelling.
The scale of the dataset makes it suitable for NLP tasks such as language modeling.
Similarly, the structure of the articles makes it a suitable dataset for training text
summarisation models.

Expand All @@ -233,11 +228,11 @@ writing style which is unlikely to reflect the Danish language as a whole.
**Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?**

Data will only be available at the entity during the project. If you wish access to the dataset you will have to come to an agreement with the individuals
Danish newspapers potentially through Infomedia.
Danish newspapers.

### Citation

If you wish to cite this work please see our GitHub page for an up to date citation:
If you wish to cite this work please see our GitHub page for an up-to-date citation:
https://github.com/centre-for-humanities-computing/danish-foundation-models

# References:
Expand Down

0 comments on commit d1520df

Please sign in to comment.