100+ datasets found
  1. POCI CSV dataset of all the citation data

    • figshare.com
    zip
    Updated Dec 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCitations ​ (2022). POCI CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.21776351.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 27, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    OpenCitations ​
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:

    [field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).

    This version of the dataset contains:

    717,654,703 citations; 26,024,862 bibliographic resources.

    The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.

  2. I

    Self-citation analysis data based on PubMed Central subset (2002-2005)

    • aws-databank-alb.library.illinois.edu
    • databank.illinois.edu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
    Explore at:
    Authors
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Science Foundation (NSF)
    U.S. National Institutes of Health (NIH)
    Description

    Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.

  3. Citations to software and data in Zenodo via open sources

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephanie van de Sandt; Stephanie van de Sandt; Alex Ioannidis; Alex Ioannidis; Lars Holm Nielsen; Lars Holm Nielsen (2020). Citations to software and data in Zenodo via open sources [Dataset]. http://doi.org/10.5281/zenodo.3482927
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stephanie van de Sandt; Stephanie van de Sandt; Alex Ioannidis; Alex Ioannidis; Lars Holm Nielsen; Lars Holm Nielsen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In January 2019, the Asclepias Broker harvested citation links to Zenodo objects from three discovery systems: the NASA Astrophysics Datasystem (ADS), Crossref Event Data and Europe PMC. Each row of our dataset represents one unique link between a citing publication and a Zenodo DOI. Both endpoints are described by basic metadata. The second dataset contains usage metrics for every cited Zenodo DOI of our data sample.

  4. h

    SemanticCite-Dataset

    • huggingface.co
    Updated Nov 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seb Haan (2025). SemanticCite-Dataset [Dataset]. https://huggingface.co/datasets/sebsigma/SemanticCite-Dataset
    Explore at:
    Dataset updated
    Nov 30, 2025
    Authors
    Seb Haan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    SemanticCite Dataset

    The SemanticCite Dataset is a collection of citation-reference pairs with expert annotations for training and evaluating citation verification systems. Each entry contains a citation claim, reference document context, and detailed classification with reasoning.

      Dataset Format
    

    The dataset is provided as a JSON file where each entry contains the following structure:

      Input Fields
    

    claim: The core assertion extracted from the citation text… See the full description on the dataset page: https://huggingface.co/datasets/sebsigma/SemanticCite-Dataset.

  5. Z

    All Computer Science Papers @ arXiv.org -- A High-Quality Gold Standard for...

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Färber, Michael; Thiemann, Alexander; Jatowt, Adam (2020). All Computer Science Papers @ arXiv.org -- A High-Quality Gold Standard for Citation-based Tasks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3535001
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Karlsruhe Institute of Technology
    Authors
    Färber, Michael; Thiemann, Alexander; Jatowt, Adam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We propose a newly-created gold standard data set for citation-based tasks. This gold standard is based on all computer science papers in arXiv.org.

    Abstract. Analyzing and recommending citations with their specific citation contexts have recently received much attention due to the growing number of available publications. Although data sets such as CiteSeerX have been created for evaluating approaches for such tasks, those data sets exhibit striking defects. This is understandable if one considers that both information extraction and entity linking as well as entity resolution need to be performed. In this paper, we propose a new evaluation data set for citation-dependent tasks based on arXiv.org publications. Our data set is characterized by the fact that it exhibits almost zero noise in the extracted content and that all citations are linked to their correct publications. Besides the pure content, available on a sentence-basis, cited publications are annotated directly in the text via global identifiers. As far as possible, referenced publications are further linked to DBLP. Our data set consists of over 15M sentences and is freely available for research purposes. It can be used for training and testing citation-based tasks, such as recommending citations, determining the functions or importance of citations, and summarizing documents based on their citations.

    More information can be found in our publication "A High-Quality Gold Standard for Citation-based Tasks" (LREC'18).

    You can cite the data set as follows:

    @inproceedings{DBLP:conf/lrec/0001TJ18, author = {Michael F{"{a}}rber and Alexander Thiemann and Adam Jatowt}, title = "{A High-Quality Gold Standard for Citation-based Tasks}", booktitle = "{Proceedings of the Eleventh International Conference on Language Resources and Evaluation}", series = "{LREC'18}", location = "{Miyazaki, Japan}", year = {2018}, url = {http://www.lrec-conf.org/proceedings/lrec2018/summaries/283.html} }

  6. n

    Data from: Data reuse and the open data citation advantage

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    zip
    Updated Oct 1, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather A. Piwowar; Todd J. Vision (2013). Data reuse and the open data citation advantage [Dataset]. http://doi.org/10.5061/dryad.781pv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 1, 2013
    Dataset provided by
    National Evolutionary Synthesis Center
    Authors
    Heather A. Piwowar; Todd J. Vision
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

  7. H

    GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Dec 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Grennan; Martin Schibel; Andrew Collins; Joeran Beel (2019). GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [Data] [Dataset]. http://doi.org/10.7910/DVN/LXQXAO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Mark Grennan; Martin Schibel; Andrew Collins; Joeran Beel
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAOhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAO

    Description

    Extracting and parsing reference strings from research articles is a challenging task. State-of-the-art tools like GROBID apply rather simple machine learning models such as conditional random fields (CRF). Recent research has shown a high potential of deep-learning for reference string parsing. The challenge with deep learning is, however, that the training step requires enormous amounts of labeled data – which does not exist for reference string parsing. Creating such a large dataset manually, through human labor, seems hardly feasible. Therefore, we created GIANT. GIANT is a large dataset with 991,411,100 XML labeled reference strings. The strings were automatically created based on 677,000 entries from CrossRef, 1,500 citation styles in the citation-style language, and the citation processor citeproc-js. GIANT can be used to train machine learning models, particularly deep learning models, for citation parsing. While we have not yet tested GIANT for training such models, we hypothesise that the dataset will be able to significantly improve the accuracy of citation parsing. The dataset and code to create it, are freely available at https://github.com/BeelGroup/.

  8. Bibliometric dataset: list of highly cited papers in bibliometric

    • zenodo.org
    • data.niaid.nih.gov
    bin, png, txt
    Updated Jul 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dasapta Erwin Irawan; Dasapta Erwin Irawan; Dini Sofiani Permatasari; Dini Sofiani Permatasari; Lusia Marliana Nurani; Lusia Marliana Nurani (2024). Bibliometric dataset: list of highly cited papers in bibliometric [Dataset]. http://doi.org/10.5281/zenodo.2544533
    Explore at:
    png, bin, txtAvailable download formats
    Dataset updated
    Jul 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dasapta Erwin Irawan; Dasapta Erwin Irawan; Dini Sofiani Permatasari; Dini Sofiani Permatasari; Lusia Marliana Nurani; Lusia Marliana Nurani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation

    My motivation in providing this dataset is to invite more interests from Indonesia's librarian to understand their diverse field of study.

    Method

    This dataset is harvested in 19 January 2019 from Scopus database provided by The University of Sydney. I used the keyword "bibliometric" in title, sort the search results by total citation, then download the first 2000 papers as RIS file. This file can be converted to other formats like bibtex or csv using available reference manager, like Zotero.

    Visualisations

    I did two small visualisations using the following options:

    1. "create a map based on bibliographic data"
    2. "create a map based on text data"

    Both mappings are done using VosViewer open source app from CWTS Leiden University.

  9. Citation Graph

    • kaggle.com
    zip
    Updated Jun 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caselaw Access Project (2020). Citation Graph [Dataset]. https://www.kaggle.com/datasets/harvardlil/citation-graph
    Explore at:
    zip(306688738 bytes)Available download formats
    Dataset updated
    Jun 30, 2020
    Authors
    Caselaw Access Project
    Description

    Context

    The Caselaw Access Project makes 40 million pages of U.S. caselaw freely available online from the collections of Harvard Law School Library.

    The CAP citation graph shows the connections between cases in the Caselaw Access Project dataset. You can use the citation graph to answer questions like "what is the most influential case?" and "what jurisdictions cite most often to this jurisdiction?".

    Learn More: https://case.law/download/citation_graph/

    Access Limits: https://case.law/api/#limits

    Content

    This dataset includes citations and metadata for the CAP citation graph in CSV format.

    Acknowledgements

    The Caselaw Access Project is by the Library Innovation Lab at Harvard Law School Library.

    Inspiration

    People are using CAP data to create research, applications, and more. We're sharing examples in our gallery.

    Cite Grid is the first visualization we've created based on data from our citation graph.

    Have something to share? We're excited to hear about it.

  10. Reference count CSV dataset of all bibliographic resources in OpenCitations...

    • figshare.com
    zip
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCitations ​ (2023). Reference count CSV dataset of all bibliographic resources in OpenCitations Index [Dataset]. http://doi.org/10.6084/m9.figshare.24747498.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    OpenCitations ​
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A CSV dataset containing the number of references of each bibliographic entity identified by an OMID in the OpenCitations Index (https://opencitations.net/index).The dataset is based on the last release of the OpenCitations Index (https://opencitations.net/download) – November 2023. The size of the zipped archive is 0.35 GB, while the size of the unzipped CSV file is 1.7 GB.The CSV dataset contains the reference count of 71,805,806 bibliographic entities. The first column (omid) lists the entities, while the second column (references) indicates the corresponding number of incoming citations.

  11. a

    Stanford Drone Dataset

    • academictorrents.com
    • opendatalab.com
    bittorrent
    Updated Apr 27, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. Robicquet and A. Sadeghian and A. Alahi and S. Savarese (2019). Stanford Drone Dataset [Dataset]. https://academictorrents.com/details/01f95ea32e160e6c251ea55a87bd5a24b23cb03d
    Explore at:
    bittorrent(71002113639)Available download formats
    Dataset updated
    Apr 27, 2019
    Dataset authored and provided by
    A. Robicquet and A. Sadeghian and A. Alahi and S. Savarese
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    When humans navigate a crowed space such as a university campus or the sidewalks of a busy street, they follow common sense rules based on social etiquette. In order to enable the design of new algorithms that can fully take advantage of these rules to better solve tasks such as target tracking or trajectory forecasting, we need to have access to better data. To that end, we contribute the very first large scale dataset (to the best of our knowledge) that collects images and videos of various types of agents (not just pedestrians, but also bicyclists, skateboarders, cars, buses, and golf carts) that navigate in a real world outdoor environment such as a university campus. In the above images, pedestrians are labeled in pink, bicyclists in red, skateboarders in orange, and cars in green. ### CITATION If you find this dataset useful, please cite this paper (and

  12. Methodology data of "A qualitative and quantitative citation analysis toward...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Heibi; Ivan Heibi; Silvio Peroni; Silvio Peroni (2022). Methodology data of "A qualitative and quantitative citation analysis toward retracted articles: a case of study" [Dataset]. http://doi.org/10.5281/zenodo.4323221
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Heibi; Ivan Heibi; Silvio Peroni; Silvio Peroni
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This document contains the datasets and visualizations generated after the application of the methodology defined in our work: "A qualitative and quantitative citation analysis toward retracted articles: a case of study". The methodology defines a citation analysis of the Wakefield et al. [1] retracted article from a quantitative and qualitative point of view. The data contained in this repository are based on the first two steps of the methodology. The first step of the methodology (i.e. “Data gathering”) builds an annotated dataset of the citing entities, this step is largely discussed also in [2]. The second step (i.e. "Topic Modelling") runs a topic modeling analysis on the textual features contained in the dataset generated by the first step.

    Note: the data are all contained inside the "method_data.zip" file. You need to unzip the file to get access to all the files and directories listed below.

    Data gathering

    The data generated by this step are stored in "data/":

    1. "cits_features.csv": a dataset containing all the entities (rows in the CSV) which have cited the Wakefield et al. retracted article, and a set of features characterizing each citing entity (columns in the CSV). The features included are: DOI ("doi"), year of publication ("year"), the title ("title"), the venue identifier ("source_id"), the title of the venue ("source_title"), yes/no value in case the entity is retracted as well ("retracted"), the subject area ("area"), the subject category ("category"), the sections of the in-text citations ("intext_citation.section"), the value of the reference pointer ("intext_citation.pointer"), the in-text citation function ("intext_citation.intent"), the in-text citation perceived sentiment ("intext_citation.sentiment"), and a yes/no value to denote whether the in-text citation context mentions the retraction of the cited entity ("intext_citation.section.ret_mention").
      Note: this dataset is licensed under a Creative Commons public domain dedication (CC0).
    2. "cits_text.csv": this dataset stores the abstract ("abstract") and the in-text citations context ("intext_citation.context") for each citing entity identified using the DOI value ("doi").
      Note: the data keep their original license (the one provided by their publisher). This dataset is provided in order to favor the reproducibility of the results obtained in our work.

    Topic modeling
    We run a topic modeling analysis on the textual features gathered (i.e. abstracts and citation contexts). The results are stored inside the "topic_modeling/" directory. The topic modeling has been done using MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [3]. The topic modeling results for each textual feature are separated into two different folders, "abstracts/" for the abstracts, and "intext_cit/" for the in-text citation contexts. Both the directories contain the following directories/files:

    1. "mitao_workflows/": the workflows of MITAO. These are JSON files that could be reloaded in MITAO to reproduce the results following the same workflows.

    2. "corpus_and_dictionary/": it contains the dictionary and the vectorized corpus given as inputs for the LDA topic modeling.

    3. "coherence/coherence.csv": the coherence score of several topic models trained on a number of topics from 1 - 40.

    4. "datasets_and_views/": the datasets and visualizations generated using MITAO.

    References

    1. Wakefield, A., Murch, S., Anthony, A., Linnell, J., Casson, D., Malik, M., Berelowitz, M., Dhillon, A., Thomson, M., Harvey, P., Valentine, A., Davies, S., & Walker-Smith, J. (1998). RETRACTED: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet, 351(9103), 637–641. https://doi.org/10.1016/S0140-6736(97)11096-0
    2. Heibi, I., & Peroni, S. (2020). A methodology for gathering and annotating the raw-data/characteristics of the documents citing a retracted article v1 (protocols.io.bdc4i2yw) [Data set]. In protocols.io. ZappyLab, Inc. https://doi.org/10.17504/protocols.io.bdc4i2yw

    3. Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3

  13. UIEB Dataset-reference

    • kaggle.com
    zip
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kaggle6 (2023). UIEB Dataset-reference [Dataset]. https://www.kaggle.com/datasets/larjeck/uieb-dataset-reference
    Explore at:
    zip(823157904 bytes)Available download formats
    Dataset updated
    Aug 3, 2023
    Authors
    kaggle6
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    950 algorithmically restored underwater images, used for image enhancement, image generation, etc.

  14. News Category Dataset

    • kaggle.com
    zip
    Updated Sep 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishabh Misra (2022). News Category Dataset [Dataset]. https://www.kaggle.com/datasets/rmisra/news-category-dataset/
    Explore at:
    zip(27829769 bytes)Available download formats
    Dataset updated
    Sep 24, 2022
    Authors
    Rishabh Misra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    ** Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! **

    This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost. This is one of the biggest news datasets and can serve as a benchmark for a variety of computational linguistic tasks. HuffPost stopped maintaining an extensive archive of news articles sometime after this dataset was first collected in 2018, so it is not possible to collect such a dataset in the present day. Due to changes in the website, there are about 200k headlines between 2012 and May 2018 and 10k headlines between May 2018 and 2022.

    Content

    Each record in the dataset consists of the following attributes: - category: category in which the article was published. - headline: the headline of the news article. - authors: list of authors who contributed to the article. - link: link to the original news article. - short_description: Abstract of the news article. - date: publication date of the article.

    There are a total of 42 news categories in the dataset. The top-15 categories and corresponding article counts are as follows:

    • POLITICS: 35602

    • WELLNESS: 17945

    • ENTERTAINMENT: 17362

    • TRAVEL: 9900

    • STYLE & BEAUTY: 9814

    • PARENTING: 8791

    • HEALTHY LIVING: 6694

    • QUEER VOICES: 6347

    • FOOD & DRINK: 6340

    • BUSINESS: 5992

    • COMEDY: 5400

    • SPORTS: 5077

    • BLACK VOICES: 4583

    • HOME & LIVING: 4320

    • PARENTS: 3955

    Citation

    If you're using this dataset for your work, please cite the following articles:

    Citation in text format: 1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022). 2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

    Citation in BibTex format: @article{misra2022news, title={News Category Dataset}, author={Misra, Rishabh}, journal={arXiv preprint arXiv:2209.11429}, year={2022} } @book{misra2021sculpting, author = {Misra, Rishabh and Grover, Jigyasa}, year = {2021}, month = {01}, pages = {}, title = {Sculpting Data for ML: The first act of Machine Learning}, isbn = {9798585463570} }

    Please link to rishabhmisra.github.io/publications as the source of this dataset. Thanks!

    Acknowledgements

    This dataset was collected from HuffPost.

    Inspiration

    • Can you categorize news articles based on their headlines and short descriptions?

    • Do news articles from different categories have different writing styles?

    • A classifier trained on this dataset could be used on a free text to identify the type of language being used.

    Want to contribute your own datasets?

    If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.

    Other datasets

    Please also checkout the following datasets collected by me:

  15. I

    Cline Center Coup d’État Project Dataset

    • databank.illinois.edu
    Updated May 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto (2025). Cline Center Coup d’État Project Dataset [Dataset]. http://doi.org/10.13012/B2IDB-9651987_V7
    Explore at:
    Dataset updated
    May 11, 2025
    Authors
    Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Coups d'Ètat are important events in the life of a country. They constitute an important subset of irregular transfers of political power that can have significant and enduring consequences for national well-being. There are only a limited number of datasets available to study these events (Powell and Thyne 2011, Marshall and Marshall 2019). Seeking to facilitate research on post-WWII coups by compiling a more comprehensive list and categorization of these events, the Cline Center for Advanced Social Research (previously the Cline Center for Democracy) initiated the Coup d’État Project as part of its Societal Infrastructures and Development (SID) project. More specifically, this dataset identifies the outcomes of coup events (i.e., realized, unrealized, or conspiracy) the type of actor(s) who initiated the coup (i.e., military, rebels, etc.), as well as the fate of the deposed leader. Version 2.1.3 adds 19 additional coup events to the data set, corrects the date of a coup in Tunisia, and reclassifies an attempted coup in Brazil in December 2022 to a conspiracy. Version 2.1.2 added 6 additional coup events that occurred in 2022 and updated the coding of an attempted coup event in Kazakhstan in January 2022. Version 2.1.1 corrected a mistake in version 2.1.0, where the designation of “dissident coup” had been dropped in error for coup_id: 00201062021. Version 2.1.1 fixed this omission by marking the case as both a dissident coup and an auto-coup. Version 2.1.0 added 36 cases to the data set and removed two cases from the v2.0.0 data. This update also added actor coding for 46 coup events and added executive outcomes to 18 events from version 2.0.0. A few other changes were made to correct inconsistencies in the coup ID variable and the date of the event. Version 2.0.0 improved several aspects of the previous version (v1.0.0) and incorporated additional source material to include: • Reconciling missing event data • Removing events with irreconcilable event dates • Removing events with insufficient sourcing (each event needs at least two sources) • Removing events that were inaccurately coded as coup events • Removing variables that fell below the threshold of inter-coder reliability required by the project • Removing the spreadsheet ‘CoupInventory.xls’ because of inadequate attribution and citations in the event summaries • Extending the period covered from 1945-2005 to 1945-2019 • Adding events from Powell and Thyne’s Coup Data (Powell and Thyne, 2011)
    Items in this Dataset 1. Cline Center Coup d'État Codebook v.2.1.3 Codebook.pdf - This 15-page document describes the Cline Center Coup d’État Project dataset. The first section of this codebook provides a summary of the different versions of the data. The second section provides a succinct definition of a coup d’état used by the Coup d'État Project and an overview of the categories used to differentiate the wide array of events that meet the project's definition. It also defines coup outcomes. The third section describes the methodology used to produce the data. Revised February 2024 2. Coup Data v2.1.3.csv - This CSV (Comma Separated Values) file contains all of the coup event data from the Cline Center Coup d’État Project. It contains 29 variables and 1000 observations. Revised February 2024 3. Source Document v2.1.3.pdf - This 325-page document provides the sources used for each of the coup events identified in this dataset. Please use the value in the coup_id variable to identify the sources used to identify that particular event. Revised February 2024 4. README.md - This file contains useful information for the user about the dataset. It is a text file written in markdown language. Revised February 2024
    Citation Guidelines 1. To cite the codebook (or any other documentation associated with the Cline Center Coup d’État Project Dataset) please use the following citation: Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Scott Althaus. 2024. “Cline Center Coup d’État Project Dataset Codebook”. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7 2. To cite data from the Cline Center Coup d’État Project Dataset please use the following citation (filling in the correct date of access): Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Emilio Soto. 2024. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7

  16. Z

    A dataset from a survey investigating disciplinary differences in data...

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +1more
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ninkov, Anton Boudreau; Ripp, Chantal; Gregory, Kathleen; Peters, Isabella; Haustein, Stefanie (2024). A dataset from a survey investigating disciplinary differences in data citation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7555362
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    University of Ottawa
    ZBW Leibniz Information Center for Economics
    Université de Montréal
    Authors
    Ninkov, Anton Boudreau; Ripp, Chantal; Gregory, Kathleen; Peters, Isabella; Haustein, Stefanie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GENERAL INFORMATION

    Title of Dataset: A dataset from a survey investigating disciplinary differences in data citation

    Date of data collection: January to March 2022

    Collection instrument: SurveyMonkey

    Funding: Alfred P. Sloan Foundation

    SHARING/ACCESS INFORMATION

    Licenses/restrictions placed on the data: These data are available under a CC BY 4.0 license

    Links to publications that cite or use the data:

    Gregory, K., Ninkov, A., Ripp, C., Peters, I., & Haustein, S. (2022). Surveying practices of data citation and reuse across disciplines. Proceedings of the 26th International Conference on Science and Technology Indicators. International Conference on Science and Technology Indicators, Granada, Spain. https://doi.org/10.5281/ZENODO.6951437

    Gregory, K., Ninkov, A., Ripp, C., Roblin, E., Peters, I., & Haustein, S. (2023). Tracing data: A survey investigating disciplinary differences in data citation. Zenodo. https://doi.org/10.5281/zenodo.7555266

    DATA & FILE OVERVIEW

    File List

    Filename: MDCDatacitationReuse2021Codebookv2.pdf Codebook

    Filename: MDCDataCitationReuse2021surveydatav2.csv Dataset format in csv

    Filename: MDCDataCitationReuse2021surveydatav2.sav Dataset format in SPSS

    Filename: MDCDataCitationReuseSurvey2021QNR.pdf Questionnaire

    Additional related data collected that was not included in the current data package: Open ended questions asked to respondents

    METHODOLOGICAL INFORMATION

    Description of methods used for collection/generation of data:

    The development of the questionnaire (Gregory et al., 2022) was centered around the creation of two main branches of questions for the primary groups of interest in our study: researchers that reuse data (33 questions in total) and researchers that do not reuse data (16 questions in total). The population of interest for this survey consists of researchers from all disciplines and countries, sampled from the corresponding authors of papers indexed in the Web of Science (WoS) between 2016 and 2020.

    Received 3,632 responses, 2,509 of which were completed, representing a completion rate of 68.6%. Incomplete responses were excluded from the dataset. The final total contains 2,492 complete responses and an uncorrected response rate of 1.57%. Controlling for invalid emails, bounced emails and opt-outs (n=5,201) produced a response rate of 1.62%, similar to surveys using comparable recruitment methods (Gregory et al., 2020).

    Methods for processing the data:

    Results were downloaded from SurveyMonkey in CSV format and were prepared for analysis using Excel and SPSS by recoding ordinal and multiple choice questions and by removing missing values.

    Instrument- or software-specific information needed to interpret the data:

    The dataset is provided in SPSS format, which requires IBM SPSS Statistics. The dataset is also available in a coded format in CSV. The Codebook is required to interpret to values.

    DATA-SPECIFIC INFORMATION FOR: MDCDataCitationReuse2021surveydata

    Number of variables: 95

    Number of cases/rows: 2,492

    Missing data codes: 999 Not asked

    Refer to MDCDatacitationReuse2021Codebook.pdf for detailed variable information.

  17. NIST SAMATE Software Assurance Reference Dataset

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). NIST SAMATE Software Assurance Reference Dataset [Dataset]. https://catalog.data.gov/dataset/nist-samate-software-assurance-reference-dataset
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This dataset provides the NIST Software Assurance Metrics And Tool Evaluation (SAMATE) Software Assurance Reference Dataset (SARD) - a set of programs with known security flaws. This will allow end users to evaluate tools and tool developers to test their methods.

  18. Data from: CRAWDAD wireless network data citation bibliography

    • figshare.com
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tristan Henderson; David Kotz (2016). CRAWDAD wireless network data citation bibliography [Dataset]. http://doi.org/10.6084/m9.figshare.1203646.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Tristan Henderson; David Kotz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This BibTeX file contains the corpus of papers that cite CRAWDAD wireless network datasets, as used in the paper: Tristan Henderson and David Kotz. Data citation practices in the CRAWDAD wireless network data archive. Proceedings of the Second Workshop on Linking and Contextualizing Publications and Datasets, London, UK, September 2014. Most of the fields are standard BibTeX fields. There are two that require further explanation. "citations" - this field contains the citations for a paper as countedby Google Scholar as of 24 September 2014. "keywords" - this field contains a set of tags indicating data citation practice. These are as follows:- "uses_crawdad_data" - this paper uses a CRAWDAD dataset- "cites_insufficiently" - this paper does not meet our sufficiency criteria- "cites_by_description" - this paper cites a dataset by description rather than dataset identifier- "cites_canonical_paper" - this paper cites the original ("canonical") paper that collected a dataset, rather than pointing to the dataset- "cites_by_name" - this paper cites a dataset by a colloquial name rather than dataset identifier- "cites_crawdad_url" - this paper cites the main CRAWDAD URL rather than a particular dataset- "cites_without_url" - this paper does not provide a URL for dataset access- "cites_wrong_attribution" - this paper attributes a dataset to CRAWDAD, Dartmouth etc rather than the dataset authors- "cites_vaguely" - this paper cites the used datasets (if any) too vaguely to be sufficient If you have any questions about the data, please contact us atcrawdad@crawdad.org

  19. E

    Data from: Global hydrological dataset of daily streamflow data from the...

    • catalogue.ceh.ac.uk
    • hosted-metadata.bgs.ac.uk
    • +3more
    zip
    Updated May 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S. Turner; J. Hannaford; L.J. Barker; G. Suman; R. Armitage; A. Killeen; A. Griffin; H. Davies; A. Kumar; H. Dixon; M.T.D. Albuquerque; N. Almeida Ribeiro; C. Alvarez-Garreton; E. Amoussou; B. Arheimer; Y. Asano; T. Berezowski; A. Bodian; H. Boutaghane; R. Capell; H. Dakhaoui; J. Daňhelka; H.X. Do; C. Ekkawatpanit; E.M. El Khalki; A.K. Fleig; R. Fonseca; J.D. Giraldo-Osorio; A.B.T. Goula; M. Hanel; G Hodgkins; S. Horton; C. Kan; D.G. Kingston; G. Laaha; R. Laugesen; W. Lopes; S. Mager; Y. Markonis; L. Mediero; G. Midgley; C. Murphy; P. O'Connor; A.I. Pedersen; H.T. Pham; M. Piniewski; M. Rachdane; B. Renard; M.E. Saidi; P. Schmocker-Facker; K. Stahl; M. Thyler; M. Toucher; Y. Tramblay; J. Uusikivi; N. Venegas-Cordero; S. Vissesri; A. Watson; S. Westra; P.H. Whitfield (2024). Global hydrological dataset of daily streamflow data from the Reference Observatory of Basins for INternational hydrological climate change detection (ROBIN), 1863 - 2022 [Dataset]. http://doi.org/10.5285/3b077711-f183-42f1-bac6-c892922c81f4
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 28, 2024
    Dataset provided by
    NERC EDS Environmental Information Data Centre
    Authors
    S. Turner; J. Hannaford; L.J. Barker; G. Suman; R. Armitage; A. Killeen; A. Griffin; H. Davies; A. Kumar; H. Dixon; M.T.D. Albuquerque; N. Almeida Ribeiro; C. Alvarez-Garreton; E. Amoussou; B. Arheimer; Y. Asano; T. Berezowski; A. Bodian; H. Boutaghane; R. Capell; H. Dakhaoui; J. Daňhelka; H.X. Do; C. Ekkawatpanit; E.M. El Khalki; A.K. Fleig; R. Fonseca; J.D. Giraldo-Osorio; A.B.T. Goula; M. Hanel; G Hodgkins; S. Horton; C. Kan; D.G. Kingston; G. Laaha; R. Laugesen; W. Lopes; S. Mager; Y. Markonis; L. Mediero; G. Midgley; C. Murphy; P. O'Connor; A.I. Pedersen; H.T. Pham; M. Piniewski; M. Rachdane; B. Renard; M.E. Saidi; P. Schmocker-Facker; K. Stahl; M. Thyler; M. Toucher; Y. Tramblay; J. Uusikivi; N. Venegas-Cordero; S. Vissesri; A. Watson; S. Westra; P.H. Whitfield
    License

    https://eidc.ac.uk/licences/ogl/plainhttps://eidc.ac.uk/licences/ogl/plain

    Time period covered
    Jan 1, 1863 - Dec 31, 2022
    Area covered
    Earth
    Dataset funded by
    Natural Environment Research Councilhttps://www.ukri.org/councils/nerc
    Description

    The Reference Observatory of Basins for INternational hydrological climate change detection (ROBIN) dataset is a global hydrological dataset containing publicly available daily flow data for 2,386 gauging stations across the globe which have natural or near-natural catchments. Metadata is also provided alongside these stations for the Full ROBIN Dataset consisting of 3,060 gauging stations. Data were quality controlled by the central ROBIN team before being added to the dataset, and two levels of data quality are applied to guide users towards appropriate the data usage. Most records have data of at least 40 years with minimal missing data with data records starting in the late 19th Century for some sites through to 2022. ROBIN represents a significant advance in global-scale, accessible streamflow data. The project was funded the UK Natural Environment Research Council Global Partnership Seedcorn Fund - NE/W004038/1 and the NC-International programme [NE/X006247/1] delivering National Capability

  20. Animal -5 Mammal

    • kaggle.com
    zip
    Updated Jan 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiv28 (2022). Animal -5 Mammal [Dataset]. https://www.kaggle.com/datasets/shiv28/animal-5-mammal
    Explore at:
    zip(935548424 bytes)Available download formats
    Dataset updated
    Jan 4, 2022
    Authors
    Shiv28
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Hello everyone!

    ****About This Data**** This is the dataset I have used for my matriculation thesis. It contains about 15K medium quality animal images belonging to 10 categories: dog, cat, horse, elephant ,lion. All the images have been collected from "google images" and have been checked by human. There is some erroneous data to simulate real conditions (eg. images taken by users of your app).

    ****How to Cite this Dataset**** If you use this dataset in your research, please credit the authors

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenCitations ​ (2022). POCI CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.21776351.v1
Organization logoOrganization logo

POCI CSV dataset of all the citation data

Explore at:
zipAvailable download formats
Dataset updated
Dec 27, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
OpenCitations ​
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:

[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).

This version of the dataset contains:

717,654,703 citations; 26,024,862 bibliographic resources.

The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.

Search
Clear search
Close search
Google apps
Main menu