Retrieving Wikipedia mentions of patient-centered outcomes research
This project retrieves articles from Pubmed and collects available altmetrics from Altmetric.com and Paperbuzz which ingests data from Crossref Event Data (CED).
For each specified Pubmed query the whole pipeline will thus produce a single spreadsheet containing metadata from Pubmed and specified metrics from Altmetric and Paperbuzz.
The processing pipeline:
- Collect papers from Pubmed based on query. This step also produces a new timestamped folder for the results. The following steps will always be applied to the newest data collection (i.e. newest folder) available. Thus, make sure to have the correct folders and files in the data folder.
- Retrieve available metrics from Altmetric based on PMID
- Retrieve available metrics from Paperbuzz based on DOI
- Merge results into one single spreadsheet
To run the code you require Python 3 and R installed on your system. Python requirements are specified in requirements.txt
. Make sure to install the following 4 packages for the R script: yaml
, jsonlite
, rentrez
, XML
.
Create a copy of example_config.yml
and rename it to config.yml
.
- Define your Pubmed queries.
- Define the queries that need to be merged together. The entrez API does has a character limit for URLs, thus, some longer Pubmed queries need to broken down into separate sub-queries, even if the long query is working for the advanced search interface.
- Specify which metrics should be exported from the Altmetric/Paperbuzz results
- Altmetric: The names of (most) available fields can be found in this spreadsheet. The key in the YAML config corresponds to the desired name of the metric in the output CSV. You can choose the name of the key! The value on the other hand corresponds to the name of the field within Altmetric's database (you can find them in the link above)
- Paperbuzz/CED: Once again, you can name the key as you wish. In this case, the value should be one of the available
source_id
s defined in the Crossref Event Data docs. For example, The CED docs page for Twitter shows that thesource_id
for Twitter istwitter
.
- Provide your altmetric key to access their API
- Provide contact details for the Paperbuzz API (which is free)
Once the configuration is done, simply execute scripts in the following order:
collect_pubmed.R
to collect all raw results from Pubmed. This script will create temporary JSON files in the respective folders (bothRscript collect_pubmed.R
from the console or Rstudio should work)process_pubmed.py
to process these temporary JSON files. This script creates a CSV with basic metadata for all articlescollect_altmetric.py
to retrieve altmetrics from Altmetric.com. This step creates a CSV with a dump of the JSON responses from the APIcollect_paperbuzz.py
creates a CSV with the JSON dumps from the Paperbuzz APIexport_data.py
finally extracts specified metrics from the two spreadsheets and creates a final CSV containing metadata + metricsmerge_results.py
is a final step required for this project to merge two of the queries as the initial Pubmed query was too long
While we are calling both types of returned results "metrics" it is important to note that Altmetric and CED represent fundamentally different types of observations. While the exact definitions differ for various types of engagements, Altmetric tries to aggregate events that relate to a singular source. Crossref on the other hand sticks to observations of activities related to DOIs.
- More readings on Altmetric here and the previously mentioned link also contains some useful descriptives of each metric.
- CED has an amazing user guide that is quite comprehensive and useful.
The project has been created by Asura Enkhbayar and is shared under the MIT license. It also relies on the rOpenSci package rentrez.