5 datasets found
  1. E

    A meta analysis of Wikipedia's coronavirus sources during the COVID-19...

    • live.european-language-grid.eu
    • zenodo.org
    txt
    Updated Sep 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7806
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 8, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.

  2. Yearly pageviews of English Wikipedia articles with potential links to green...

    • zenodo.org
    • data.niaid.nih.gov
    csv, text/x-python
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federico Leva; Federico Leva (2020). Yearly pageviews of English Wikipedia articles with potential links to green open access scholarly articles [Dataset]. http://doi.org/10.5281/zenodo.3783468
    Explore at:
    csv, text/x-pythonAvailable download formats
    Dataset updated
    Nov 16, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Federico Leva; Federico Leva
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Number of visits in 2019 for a sample of 23462 English Wikipedia articles which contain references to academic sources which have a green open access copy available but not yet used. The consultation statistics were retrieved from the Wikimedia pageviews API using the Python client (script also included). The sample was selected among articles which in April 2020 had at least one citation of an academic paper (using the "cite journal" template) for which OAbot (through Unpaywall data) had found a green open access URL to add (gratis open access, not necessarily libre open access). Data shows that the top 1 % most visited articles received 30 % of the visits: over 500 million in the year, corresponding to 1 million potential citation link clicks to distribute across all references assuming a 0.2 % click-through rate per Piccardi et al. (2020).

  3. Data for: Wikipedia as a gateway to biomedical research

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, txt
    Updated Sep 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joe Wass; Ryan Steinberg; Lauren Maggio; Joe Wass; Ryan Steinberg; Lauren Maggio (2020). Data for: Wikipedia as a gateway to biomedical research [Dataset]. http://doi.org/10.5281/zenodo.825222
    Explore at:
    application/gzip, txtAvailable download formats
    Dataset updated
    Sep 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joe Wass; Ryan Steinberg; Lauren Maggio; Joe Wass; Ryan Steinberg; Lauren Maggio
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia has been described as a gateway to knowledge. However, the extent to which this gateway ends at Wikipedia or continues via supporting citations is unknown. This dataset was used to establish benchmarks for the relative distribution and referral (click) rate of citations, as indicated by presence of a Digital Object Identifier (DOI), from Wikipedia with a focus on medical citations.

    This data set includes for each day in August 2016 a listing of all DOI present in the English language version of Wikipedia and whether or not the DOI are biomedical in nature. Source Code for these data are available at: Ryan Steinberg. (2017, July 9). Lane-Library/wiki-extract: initial Zenodo/DOI release. Zenodo. http://doi.org/10.5281/zenodo.824813

    This dataset also includes a listing from Crossref of URL decoded DOIs that were referred from Wikipedia in August 2016 (Wikipedia_referred_DOI). Source code for these data sets is available at: Joe Wass. (2017, July 4). CrossRef/logppj: Initial DOI registered release. Zenodo. http://doi.org/10.5281/zenodo.822636

  4. Z

    English Wikipedia citations with possible SemanticScholar URLs as found by...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leva, Federico (2021). English Wikipedia citations with possible SemanticScholar URLs as found by Unpaywall and OAbot [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4725575
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset authored and provided by
    Leva, Federico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dump of 26025 JSON files each containing one or more URL suggestions for the citations of one English Wikipedia article, of which at least one being a SemanticScholar URL for a green open access copy of an academic work. Compiled between December 2020 and April 2021 with Dissemin's OAbot, using various API sources but mostly Unpaywall data.

  5. E

    OdiEnCorp 1.0

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    binary format
    Updated Nov 25, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). OdiEnCorp 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1252
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 25, 2018
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Data

    ----

    We have collected English-Odia parallel and monolingual data from the

    available public websites for NLP research in Odia.

    The parallel corpus consists of English-Odia parallel Bible, Odia

    digital library, and Odisha Goverment websites. It covers bible,

    literature, goverment of Odisha and its policies. We have processed the

    raw data collected from the websites, performed alignments (a mix of

    manual and automatic alignments) and release the corpus in a form ready

    for various NLP tasks.

    The Odia monolingual data consists of Odia-Wikipedia and Odia e-magazine

    websites. Because the major portion of data is extracted from

    Odia-Wikipedia, it covers all kinds of domains. The e-magazines data

    mostly cover the literature domain. We have preprocessed the monolingual

    data including de-duplication, text normalization, and sentence

    segmentation to make it ready for various NLP tasks.

    Corpus Formats

    --------------

    Both corpora are in simple tab-delimited plain text files.

    The parallel corpus files have three columns:

    - the original book/source of the sentence pair

    - the English sentence

    - the corresponding Odia sentence

    The monolingual corpus has a varying number of columns:

    - each line corresponds to one paragraph (or related unit) of the

    original source

    - each tab-delimited unit corresponds to one sentence in the paragraph

    Data Statistics

    ----------------

    The statistics of the current release is given below.

    Parallel Corpus Statistics

    ---------------------------

    Dataset Sentences #English tokens #Odia tokens

    ------- --------- ---------------- -------------

    Train 27136 706567 604147

    Dev 948 21912 19513

    Test 1262 28488 24365

    ------- --------- ---------------- -------------

    Total 29346 756967 648025

    Domain Level Statistics

    ------------------------

    Domain Sentences #English tokens #Odia tokens

    ------------------ --------- ---------------- -------------

    Bible 29069 756861 640157

    Literature 424 7977 6611

    Goverment policies 204 1411 1257

    ------------------ --------- ---------------- -------------

    Total 29697 766249 648025

    Monolingual Corpus Statistics

    -----------------------------

    Paragraphs Sentences #Odia tokens

    ---------- --------- ------------

    71698 221546 2641308

    Domain Level Statistics

    -----------------------

    Domain Paragraphs Sentences #Odia tokens

    -------------- -------------- --------- -------------

    General (wiki) 30468 (42.49%) 102085 1320367

    Literature 41230 (57.50%) 119461 1320941

    -------------- -------------- --------- -------------

    Total 71698 221546 2641308

    Citation

    --------

    If you use this corpus, please cite it directly (see above), but please cite also the following paper:

    Title: OdiEnCorp: Odia-English and Odia-Only Corpus for Machine Translation

    Author: Shantipriya Parida, Ondrej Bojar, and Satya Ranjan Dash

    Proceedings of the Third International Conference on Smart Computing & Informatics (SCI) 2018

    Series: Smart Innovation, Systems and Technologies (SIST)

    Publisher: Springer Singapore

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7806

A meta analysis of Wikipedia's coronavirus sources during the COVID-19 pandemic

Explore at:
txtAvailable download formats
Dataset updated
Sep 8, 2022
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

At the height of the coronavirus pandemic, on the last day of March 2020, Wikipedia in all languages broke a record for most traffic in a single day. Since the breakout of the Covid-19 pandemic at the start of January, tens if not hundreds of millions of people have come to Wikipedia to read - and in some cases also contribute - knowledge, information and data about the virus to an ever-growing pool of articles. Our study focuses on the scientific backbone behind the content people across the world read: which sources informed Wikipedia’s coronavirus content, and how was the scientific research on this field represented on Wikipedia. Using citation as readout we try to map how COVID-19 related research was used in Wikipedia and analyse what happened to it before and during the pandemic. Understanding how scientific and medical information was integrated into Wikipedia, and what were the different sources that informed the Covid-19 content, is key to understanding the digital knowledge echosphere during the pandemic. To delimitate the corpus of Wikipedia articles containing Digital Object Identifier (DOI), we applied two different strategies. First we scraped every Wikipedia pages form the COVID-19 Wikipedia project (about 3000 pages) and we filtered them to keep only page containing DOI citations. For our second strategy, we made a search with EuroPMC on Covid-19, SARS-CoV2, SARS-nCoV19 (30’000 sci papers, reviews and preprints) and a selection on scientific papers form 2019 onwards that we compared to the Wikipedia extracted citations from the english Wikipedia dump of May 2020 (2’000’000 DOIs). This search led to 231 Wikipedia articles containing at least one citation of the EuroPMC search or part of the wikipedia COVID-19 project pages containing DOIs. Next, from our 231 Wikipedia articles corpus we extracted DOIs, PMIDs, ISBNs, websites and URLs using a set of regular expressions. Subsequently, we computed several statistics for each wikipedia article and we retrive Atmetics, CrossRef and EuroPMC infromations for each DOI. Finally, our method allowed to produce tables of citations annotated and extracted infromations in each wikipadia articles such as books, websites, newspapers.Files used as input and extracted information on Wikipedia's COVID-19 sources are presented in this archive.See the WikiCitationHistoRy Github repository for the R codes, and other bash/python scripts utilities related to this project.

Search
Clear search
Close search
Google apps
Main menu