5 datasets found

E
A meta analysis of Wikipedia's coronavirus sources during the COVID-19...
live.european-language-grid.eu
zenodo.org
txt
Updated Sep 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7806
Explore at:
txtAvailable download formats
Dataset updated
Sep 8, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.
Yearly pageviews of English Wikipedia articles with potential links to green...
zenodo.org
data.niaid.nih.gov
csv, text/x-python
Updated Nov 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federico Leva; Federico Leva (2020). Yearly pageviews of English Wikipedia articles with potential links to green open access scholarly articles [Dataset]. http://doi.org/10.5281/zenodo.3783468
Explore at:
csv, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3783468
Dataset updated
Nov 16, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federico Leva; Federico Leva
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Number of visits in 2019 for a sample of 23462 English Wikipedia articles which contain references to academic sources which have a green open access copy available but not yet used. The consultation statistics were retrieved from the Wikimedia pageviews API using the Python client (script also included). The sample was selected among articles which in April 2020 had at least one citation of an academic paper (using the "cite journal" template) for which OAbot (through Unpaywall data) had found a green open access URL to add (gratis open access, not necessarily libre open access). Data shows that the top 1 % most visited articles received 30 % of the visits: over 500 million in the year, corresponding to 1 million potential citation link clicks to distribute across all references assuming a 0.2 % click-through rate per Piccardi et al. (2020).
Data for: Wikipedia as a gateway to biomedical research
zenodo.org
data.niaid.nih.gov
application/gzip, txt
Updated Sep 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joe Wass; Ryan Steinberg; Lauren Maggio; Joe Wass; Ryan Steinberg; Lauren Maggio (2020). Data for: Wikipedia as a gateway to biomedical research [Dataset]. http://doi.org/10.5281/zenodo.825222
Explore at:
application/gzip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.825222
Dataset updated
Sep 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joe Wass; Ryan Steinberg; Lauren Maggio; Joe Wass; Ryan Steinberg; Lauren Maggio
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Wikipedia has been described as a gateway to knowledge. However, the extent to which this gateway ends at Wikipedia or continues via supporting citations is unknown. This dataset was used to establish benchmarks for the relative distribution and referral (click) rate of citations, as indicated by presence of a Digital Object Identifier (DOI), from Wikipedia with a focus on medical citations.

This data set includes for each day in August 2016 a listing of all DOI present in the English language version of Wikipedia and whether or not the DOI are biomedical in nature. Source Code for these data are available at: Ryan Steinberg. (2017, July 9). Lane-Library/wiki-extract: initial Zenodo/DOI release. Zenodo. http://doi.org/10.5281/zenodo.824813

This dataset also includes a listing from Crossref of URL decoded DOIs that were referred from Wikipedia in August 2016 (Wikipedia_referred_DOI). Source code for these data sets is available at: Joe Wass. (2017, July 4). CrossRef/logppj: Initial DOI registered release. Zenodo. http://doi.org/10.5281/zenodo.822636
Z
English Wikipedia citations with possible SemanticScholar URLs as found by...
data.niaid.nih.gov
zenodo.org
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leva, Federico (2021). English Wikipedia citations with possible SemanticScholar URLs as found by Unpaywall and OAbot [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4725575
Explore at:
Dataset updated
Nov 12, 2021
Dataset authored and provided by
Leva, Federico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dump of 26025 JSON files each containing one or more URL suggestions for the citations of one English Wikipedia article, of which at least one being a SemanticScholar URL for a green open access copy of an academic work. Compiled between December 2020 and April 2021 with Dissemin's OAbot, using various API sources but mostly Unpaywall data.
E
OdiEnCorp 1.0
live.european-language-grid.eu
lindat.mff.cuni.cz
binary format
Updated Nov 25, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). OdiEnCorp 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1252
Explore at:
binary formatAvailable download formats
Dataset updated
Nov 25, 2018
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Data
----
We have collected English-Odia parallel and monolingual data from the
available public websites for NLP research in Odia.

The parallel corpus consists of English-Odia parallel Bible, Odia
digital library, and Odisha Goverment websites. It covers bible,
literature, goverment of Odisha and its policies. We have processed the
raw data collected from the websites, performed alignments (a mix of
manual and automatic alignments) and release the corpus in a form ready
for various NLP tasks.

The Odia monolingual data consists of Odia-Wikipedia and Odia e-magazine
websites. Because the major portion of data is extracted from
Odia-Wikipedia, it covers all kinds of domains. The e-magazines data
mostly cover the literature domain. We have preprocessed the monolingual
data including de-duplication, text normalization, and sentence
segmentation to make it ready for various NLP tasks.

Corpus Formats
--------------
Both corpora are in simple tab-delimited plain text files.

The parallel corpus files have three columns:
- the original book/source of the sentence pair
- the English sentence
- the corresponding Odia sentence

The monolingual corpus has a varying number of columns:
- each line corresponds to one paragraph (or related unit) of the
original source
- each tab-delimited unit corresponds to one sentence in the paragraph

Data Statistics
----------------
The statistics of the current release is given below.

Parallel Corpus Statistics
---------------------------

Dataset Sentences #English tokens #Odia tokens
------- --------- ---------------- -------------
Train 27136 706567 604147
Dev 948 21912 19513
Test 1262 28488 24365
------- --------- ---------------- -------------
Total 29346 756967 648025

Domain Level Statistics
------------------------

Domain Sentences #English tokens #Odia tokens
------------------ --------- ---------------- -------------
Bible 29069 756861 640157
Literature 424 7977 6611
Goverment policies 204 1411 1257
------------------ --------- ---------------- -------------
Total 29697 766249 648025

Monolingual Corpus Statistics
-----------------------------

Paragraphs Sentences #Odia tokens
---------- --------- ------------
71698 221546 2641308

Domain Level Statistics
-----------------------

Domain Paragraphs Sentences #Odia tokens
-------------- -------------- --------- -------------
General (wiki) 30468 (42.49%) 102085 1320367
Literature 41230 (57.50%) 119461 1320941
-------------- -------------- --------- -------------
Total 71698 221546 2641308

Citation
--------

If you use this corpus, please cite it directly (see above), but please cite also the following paper:

Title: OdiEnCorp: Odia-English and Odia-Only Corpus for Machine Translation
Author: Shantipriya Parida, Ondrej Bojar, and Satya Ranjan Dash
Proceedings of the Third International Conference on Smart Computing & Informatics (SCI) 2018
Series: Smart Innovation, Systems and Technologies (SIST)
Publisher: Springer Singapore
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7806

A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic

Explore at:

txtAvailable download formats

Dataset updated

Sep 8, 2022

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.

Clear search

Close search

Google apps

Main menu

A meta analysis of Wikipedia's coronavirus sources during the COVID-19...

Yearly pageviews of English Wikipedia articles with potential links to green...

Data for: Wikipedia as a gateway to biomedical research

English Wikipedia citations with possible SemanticScholar URLs as found by...

OdiEnCorp 1.0

A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic