100+ datasets found
  1. POCI CSV dataset of all the citation data

    • figshare.com
    zip
    Updated Dec 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCitations ​ (2022). POCI CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.21776351.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 27, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    OpenCitations ​
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:

    [field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).

    This version of the dataset contains:

    717,654,703 citations; 26,024,862 bibliographic resources.

    The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.

  2. Citations to software and data in Zenodo via open sources

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephanie van de Sandt; Stephanie van de Sandt; Alex Ioannidis; Alex Ioannidis; Lars Holm Nielsen; Lars Holm Nielsen (2020). Citations to software and data in Zenodo via open sources [Dataset]. http://doi.org/10.5281/zenodo.3482927
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stephanie van de Sandt; Stephanie van de Sandt; Alex Ioannidis; Alex Ioannidis; Lars Holm Nielsen; Lars Holm Nielsen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In January 2019, the Asclepias Broker harvested citation links to Zenodo objects from three discovery systems: the NASA Astrophysics Datasystem (ADS), Crossref Event Data and Europe PMC. Each row of our dataset represents one unique link between a citing publication and a Zenodo DOI. Both endpoints are described by basic metadata. The second dataset contains usage metrics for every cited Zenodo DOI of our data sample.

  3. I

    Self-citation analysis data based on PubMed Central subset (2002-2005)

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
    Explore at:
    Authors
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    U.S. National Science Foundation (NSF)
    Description

    Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.

  4. d

    August 2024 data-update for "Updated science-wide author databases of...

    • elsevier.digitalcommonsdata.com
    Updated Sep 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John P.A. Ioannidis (2024). August 2024 data-update for "Updated science-wide author databases of standardized citation indicators" [Dataset]. http://doi.org/10.17632/btchxktzyw.7
    Explore at:
    Dataset updated
    Sep 16, 2024
    Authors
    John P.A. Ioannidis
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added in the most recent iteration. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2023 and single recent year data pertain to citations received during calendar year 2023. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2024 snapshot from Scopus, updated to end of citation year 2023. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2024. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a

  5. Methodology data of "A qualitative and quantitative citation analysis toward...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jul 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Heibi; Ivan Heibi; Silvio Peroni; Silvio Peroni (2022). Methodology data of "A qualitative and quantitative citation analysis toward retracted articles: a case of study" [Dataset]. http://doi.org/10.5281/zenodo.4323221
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Heibi; Ivan Heibi; Silvio Peroni; Silvio Peroni
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This document contains the datasets and visualizations generated after the application of the methodology defined in our work: "A qualitative and quantitative citation analysis toward retracted articles: a case of study". The methodology defines a citation analysis of the Wakefield et al. [1] retracted article from a quantitative and qualitative point of view. The data contained in this repository are based on the first two steps of the methodology. The first step of the methodology (i.e. “Data gathering”) builds an annotated dataset of the citing entities, this step is largely discussed also in [2]. The second step (i.e. "Topic Modelling") runs a topic modeling analysis on the textual features contained in the dataset generated by the first step.

    Note: the data are all contained inside the "method_data.zip" file. You need to unzip the file to get access to all the files and directories listed below.

    Data gathering

    The data generated by this step are stored in "data/":

    1. "cits_features.csv": a dataset containing all the entities (rows in the CSV) which have cited the Wakefield et al. retracted article, and a set of features characterizing each citing entity (columns in the CSV). The features included are: DOI ("doi"), year of publication ("year"), the title ("title"), the venue identifier ("source_id"), the title of the venue ("source_title"), yes/no value in case the entity is retracted as well ("retracted"), the subject area ("area"), the subject category ("category"), the sections of the in-text citations ("intext_citation.section"), the value of the reference pointer ("intext_citation.pointer"), the in-text citation function ("intext_citation.intent"), the in-text citation perceived sentiment ("intext_citation.sentiment"), and a yes/no value to denote whether the in-text citation context mentions the retraction of the cited entity ("intext_citation.section.ret_mention").
      Note: this dataset is licensed under a Creative Commons public domain dedication (CC0).
    2. "cits_text.csv": this dataset stores the abstract ("abstract") and the in-text citations context ("intext_citation.context") for each citing entity identified using the DOI value ("doi").
      Note: the data keep their original license (the one provided by their publisher). This dataset is provided in order to favor the reproducibility of the results obtained in our work.

    Topic modeling
    We run a topic modeling analysis on the textual features gathered (i.e. abstracts and citation contexts). The results are stored inside the "topic_modeling/" directory. The topic modeling has been done using MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [3]. The topic modeling results for each textual feature are separated into two different folders, "abstracts/" for the abstracts, and "intext_cit/" for the in-text citation contexts. Both the directories contain the following directories/files:

    1. "mitao_workflows/": the workflows of MITAO. These are JSON files that could be reloaded in MITAO to reproduce the results following the same workflows.

    2. "corpus_and_dictionary/": it contains the dictionary and the vectorized corpus given as inputs for the LDA topic modeling.

    3. "coherence/coherence.csv": the coherence score of several topic models trained on a number of topics from 1 - 40.

    4. "datasets_and_views/": the datasets and visualizations generated using MITAO.

    References

    1. Wakefield, A., Murch, S., Anthony, A., Linnell, J., Casson, D., Malik, M., Berelowitz, M., Dhillon, A., Thomson, M., Harvey, P., Valentine, A., Davies, S., & Walker-Smith, J. (1998). RETRACTED: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet, 351(9103), 637–641. https://doi.org/10.1016/S0140-6736(97)11096-0
    2. Heibi, I., & Peroni, S. (2020). A methodology for gathering and annotating the raw-data/characteristics of the documents citing a retracted article v1 (protocols.io.bdc4i2yw) [Data set]. In protocols.io. ZappyLab, Inc. https://doi.org/10.17504/protocols.io.bdc4i2yw

    3. Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3

  6. Data from: CRAWDAD wireless network data citation bibliography

    • figshare.com
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tristan Henderson; David Kotz (2016). CRAWDAD wireless network data citation bibliography [Dataset]. http://doi.org/10.6084/m9.figshare.1203646.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Tristan Henderson; David Kotz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This BibTeX file contains the corpus of papers that cite CRAWDAD wireless network datasets, as used in the paper: Tristan Henderson and David Kotz. Data citation practices in the CRAWDAD wireless network data archive. Proceedings of the Second Workshop on Linking and Contextualizing Publications and Datasets, London, UK, September 2014. Most of the fields are standard BibTeX fields. There are two that require further explanation. "citations" - this field contains the citations for a paper as countedby Google Scholar as of 24 September 2014. "keywords" - this field contains a set of tags indicating data citation practice. These are as follows:- "uses_crawdad_data" - this paper uses a CRAWDAD dataset- "cites_insufficiently" - this paper does not meet our sufficiency criteria- "cites_by_description" - this paper cites a dataset by description rather than dataset identifier- "cites_canonical_paper" - this paper cites the original ("canonical") paper that collected a dataset, rather than pointing to the dataset- "cites_by_name" - this paper cites a dataset by a colloquial name rather than dataset identifier- "cites_crawdad_url" - this paper cites the main CRAWDAD URL rather than a particular dataset- "cites_without_url" - this paper does not provide a URL for dataset access- "cites_wrong_attribution" - this paper attributes a dataset to CRAWDAD, Dartmouth etc rather than the dataset authors- "cites_vaguely" - this paper cites the used datasets (if any) too vaguely to be sufficient If you have any questions about the data, please contact us atcrawdad@crawdad.org

  7. h

    SemanticCite-Dataset

    • huggingface.co
    Updated Nov 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seb Haan (2025). SemanticCite-Dataset [Dataset]. https://huggingface.co/datasets/sebsigma/SemanticCite-Dataset
    Explore at:
    Dataset updated
    Nov 30, 2025
    Authors
    Seb Haan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    SemanticCite Dataset

    The SemanticCite Dataset is a collection of citation-reference pairs with expert annotations for training and evaluating citation verification systems. Each entry contains a citation claim, reference document context, and detailed classification with reasoning.

      Dataset Format
    

    The dataset is provided as a JSON file where each entry contains the following structure:

      Input Fields
    

    claim: The core assertion extracted from the citation text… See the full description on the dataset page: https://huggingface.co/datasets/sebsigma/SemanticCite-Dataset.

  8. d

    Enriched Citation API (Version 2)

    • catalog.data.gov
    • gimi9.com
    • +1more
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Data Portal Team (2025). Enriched Citation API (Version 2) [Dataset]. https://catalog.data.gov/dataset/enriched-citation-api-version-2
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    Open Data Portal Team
    Description

    The Enriched Citation API provides the Intellectual Property 5 (IP5 - EPO, JPO, KIPO, CNIPA, and USPTO) and the Public with greater insight into the patent evaluation process. It allows users to quickly view information about which references, or prior art, were cited in specific patent application Office Actions, including: bibliographic information of the reference, the claims that the prior art was cited against, and the relevant sections that the examiner relied upon. The API allows for daily refresh and retrieval of enrich citation data from Office Actions mailed from October 1, 2017 to 30 days prior to the current date.

  9. d

    Louisville Metro KY - Uniform Citation Data 2020

    • catalog.data.gov
    • s.cnmilf.com
    • +5more
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louisville/Jefferson County Information Consortium (2025). Louisville Metro KY - Uniform Citation Data 2020 [Dataset]. https://catalog.data.gov/dataset/louisville-metro-ky-uniform-citation-data-2020
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    Louisville/Jefferson County Information Consortium
    Area covered
    Kentucky, Louisville
    Description

    A list of all uniform citations from the Louisville Metro Police Department, the CSV file is updated daily, including case number, date, location, division, beat, offender demographics, statutes and charges, and UCR codes can be found in this Link.INCIDENT_NUMBER or CASE_NUMBER links these data sets together:Crime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/UCR_DESC - the description of the UCR_CODE. For more details visit https://ucr.fbi.gov/

  10. Data from: Dataset for 'A Matter of Culture? Conceptualising and...

    • zenodo.org
    csv
    Updated Oct 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rhodri Leng; Rhodri Leng; Justyna Bandola-Gill; Justyna Bandola-Gill; Katherine Smith; Katherine Smith; Valerie Pattyn; Valerie Pattyn; Niklas Andersen; Niklas Andersen (2024). Dataset for 'A Matter of Culture? Conceptualising and Investigating 'Evidence Cultures' within Research on Evidence-Informed Policymaking' [Dataset]. http://doi.org/10.5281/zenodo.13972074
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rhodri Leng; Rhodri Leng; Justyna Bandola-Gill; Justyna Bandola-Gill; Katherine Smith; Katherine Smith; Valerie Pattyn; Valerie Pattyn; Niklas Andersen; Niklas Andersen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 22, 2024
    Description

    Introduction
    This document describes the data collection and datasets used in the manuscript "A Matter of Culture? Conceptualising and Investigating ‘Evidence Cultures’ within Research on Evidence-Informed Policymaking" [1].

    Data Collection

    To construct the citation network analysed in the manuscript, we first designed a series of queries to capture a large sample of literature exploring the relationship between evidence, policy, and culture from various perspectives. Our team of domain experts developed the following queries based on terms common in the literature. These queries search for the terms included in the titles, abstracts, and associated keywords of WoS indexed records (i.e. ‘TS=’). While these are separated below for ease of reading, they combined into a single query via the OR operator in our search. Our search was conducted on the Web of Science’s (WoS) Core Collection through the University of Edinburgh Library subscription on 29/11/2023, returning a total of 2,089 records.

    TS = ((“cultures of evidence” OR “culture of evidence” OR “culture of knowledge” OR “cultures of knowledge” OR “research culture” OR “research cultures” OR “culture of research” OR “cultures of research” OR “epistemic culture” OR “epistemic cultures” OR “epistemic community” OR “epistemic communities” OR “epistemic infrastructure” OR “evaluation culture” OR “evaluation cultures” OR “culture of evaluation” OR “cultures of evaluation” OR “thought style” OR “thought styles” OR “thought collective” OR “thought collectives” OR “knowledge regime” OR “knowledge regimes” OR “knowledge system” OR “knowledge systems” OR “civic epistemology” OR “civic epistemologies”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “policy decision” OR “policy decisions” OR “political decision” OR “political decisions” OR “political decision making”))

    OR

    TS = ((“culture” OR “cultures”) AND ((“evidence-based” OR “evidence-informed” OR “evidence-led” OR “science-based” OR “science-informed” OR “science-led” OR “research-based” OR “research-informed” OR “evidence use” OR “evidence user” OR “evidence utilisation” OR “evidence utilization” OR “research use” OR “researcher user” OR “research utilisation” OR “research utilization” OR “research in” OR “evidence in” OR “science in”) NEAR/1 (“policymaking” OR “policy making” OR “policy maker” OR “policy makers”)))

    OR

    TS = ((“culture” OR “cultures”) AND (“scientific advice” OR “technical advice” OR “scientific expertise” OR “technical expertise” OR “expert advice”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “political decision” OR “political decisions” OR “political decision making”))

    OR

    TS = ((“culture” OR “cultures”) AND (“post-normal science” OR “trans-science” OR “transdisciplinary” OR “transdisiplinarity” OR “science-policy interface” OR “policy sciences” OR “sociology of knowledge” OR “sociology of science” OR “knowledge transfer” OR “knowledge translation” OR “knowledge broker” OR “implementation science” OR “risk society”) AND (“policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers”))

    Citation Network Construction

    All bibliographic metadata on these 2,089 records were downloaded in five batches in plain text and then merged in R. We then parsed these data into network readable files. All unique reference strings are given unique node IDs. A node-attribute-list (‘CE_Node’) links identifying information of each document with its node ID, including authors, title, year of publication, journal WoS ID, and WoS citations. An edge-list (‘CE_Edge’) records all citations from these documents to their bibliographies – with edges going from a citing document to the cited – using the relevant node IDs. These data were then cleaned by (a) matching DOIs for reference strings that differ but point to the same paper, and (b) manual merging of obvious duplicates caused by referencing errors.

    Our initial dataset consisted of 2,089 retrieved documents and 123,772 unretrieved cited documents (i.e. documents that were cited within the publications we retrieved but which were not one of these 2,089 documents). These documents were connected by 157,229 citation links, but ~87% of the documents in the network were cited just once. To focus on relevant literature, we filtered the network to include only documents with at least three citation or reference links. We further refined the dataset by focusing on the main connected component, resulting in 6,650 nodes and 29,198 edges. It is this dataset that we publish here, and it is this network that underpins Figure 1, Table 1, and the qualitative examination of documents (see manuscript for further details).

    Our final network dataset contains 1,819 of the documents in our original query (~87% of the original retrieved records), and 4,831 documents not retrieved via our Web of Science search but cited by at least three of the retrieved documents. We then clustered this network by modularity maximization via the Leiden algorithm [2], detecting 14 clusters with Q=0.59. Citations to documents within the same cluster constitute ~77% of all citations in the network.

    Citation Network Dataset Description

    We include two network datasets: (i) ‘CE_Node.csv’ that contains 1,819 retrieved documents, 4,831 unretrieved referenced documents, making for a total of 6,650 documents (nodes); (ii)’CE_Edge.csv’ that records citations (edges) between the documents (nodes), including a total of 29,198 citation links. These files can be used to construct a network with many different tools, but we have formatted these to be used in Gephi 0.10[3].

    ‘CE_Node.csv’ is a comma-separate values file that contains two types of nodes:

    i. Retrieved documents – these are documents captured by our query. These include full bibliographic metadata and reference lists.

    ii. Non-retrieved documents – these are documents referenced by our retrieved documents but were not retrieved via our query. These only have data contained within their reference string (i.e. first author, journal or book title, year of publication, and possibly DOI).

    The columns in the .csv refer to:

    - Id, the node ID

    - Label, the reference string of the document

    - DOI, the DOI for the document, if available

    - WOS_ID, WoS accession number

    - Authors, named authors

    - Title, title of document

    - Document_type, variable indicating whether a document is an article, review, etc.

    - Journal_book_title, journal of publication or title of book

    - Publication year, year of publication.

    - WOS_times_cited, total Core Collection citations as of 29/11/2023

    - Indegree, number of within network citations to a given document

    - Cluster, provides the cluster membership number as discussed in the manuscript (Figure 1)

    ‘CE_Edge.csv’ is a comma-separated values file that contains edges (citation links) between nodes (documents) (n=29,198). The columns refer to:

    - Source, node ID of the citing document

    - Target, node ID of the cited document

    Cluster Analysis

    We qualitatively analyse a set of publications from seven of the largest clusters in our manuscript.

  11. H

    GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Dec 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Grennan; Martin Schibel; Andrew Collins; Joeran Beel (2019). GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [Data] [Dataset]. http://doi.org/10.7910/DVN/LXQXAO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Mark Grennan; Martin Schibel; Andrew Collins; Joeran Beel
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAOhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAO

    Description

    Extracting and parsing reference strings from research articles is a challenging task. State-of-the-art tools like GROBID apply rather simple machine learning models such as conditional random fields (CRF). Recent research has shown a high potential of deep-learning for reference string parsing. The challenge with deep learning is, however, that the training step requires enormous amounts of labeled data – which does not exist for reference string parsing. Creating such a large dataset manually, through human labor, seems hardly feasible. Therefore, we created GIANT. GIANT is a large dataset with 991,411,100 XML labeled reference strings. The strings were automatically created based on 677,000 entries from CrossRef, 1,500 citation styles in the citation-style language, and the citation processor citeproc-js. GIANT can be used to train machine learning models, particularly deep learning models, for citation parsing. While we have not yet tested GIANT for training such models, we hypothesise that the dataset will be able to significantly improve the accuracy of citation parsing. The dataset and code to create it, are freely available at https://github.com/BeelGroup/.

  12. Z

    Uncovering the Citation Landscape: Exploring OpenCitations COCI,...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Sep 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olga Pagnotta; Marta Soricetti; Lorenzo Paolini; Sara Vellone (2023). Uncovering the Citation Landscape: Exploring OpenCitations COCI, OpenCitations Meta, and ERIH-PLUS in Social Sciences and Humanities Journals - DATA PRODUCED [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7974815
    Explore at:
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    University of Bologna
    Authors
    Olga Pagnotta; Marta Soricetti; Lorenzo Paolini; Sara Vellone
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This zipped folders contain all the data produced for the research "Uncovering the Citation Landscape: Exploring OpenCitations COCI, OpenCitations Meta, and ERIH-PLUS in Social Sciences and Humanities Journals": the results datasets (dataset_map_disciplines, dataset_no_SSH, dataset_SSH, erih_meta_with_disciplines and erih_meta_without_disciplines).

    dataset_map_disciplines.zip contains CSV files with four columns ("id", "citing", "cited", "disciplines") giving information about publications stored in OpenCitations META (version 3 released on February 2023) and part of SSH journals, according to ERIH PLUS (version downloaded on 2023-04-27), specifying the disciplines associated to them and a boolean value stating if they cite or are cited, according to the OpenCitations COCI dataset (version 19 released on January 2023).

    dataset_no_SSH.zip and dataset_SSH.zip contain CSV files with the same structure. Each dataset has four columns: "citing", "is_citing_SSH", "cited", and "is_cited_SSH". ”Citing” and “cited” columns are filled with DOIs of publications stored in OpenCitations META that according to OpenCitations COCI are involved in a citation. The "is_citing_SSH" and "is_cited_SSH" columns contain boolean values: "True" if the corresponding publication is associated with a SSH (Social Sciences and Humanities) discipline, according to ERIH PLUS, and "False" otherwise. The two datasets are built starting from the two different subsets obtained as a result of the union between OpenCitations META and ERIH PLUS: dataset_SSH comes from erih_meta_with_disciplines and dataset_no_SSH from erih_meta_without_disciplines. dataset_no_SSH comes from erih_meta_with_disciplines.zip and erih_meta_without_disciplines.zip, as explained before, contain CSV files originating from ERIH PLUS and META. erih_meta_without_disciplines has just one column “id” and contains the DOIs of all the publications in META that do not have any discipline associated, that is, have not been published on a SSH journal, while erih_meta_with_disciplines derives from all the publications in META that have at least one linked discipline and has two columns: “id” and “erih_disciplines”, containing a string with all the disciplines linked to that publication like "History, Interdisciplinary research in the Humanities, Interdisciplinary research in the Social Sciences, Sociology".

    Software: https://doi.org/10.5281/zenodo.8326023

    Data preprocessed: https://doi.org/10.5281/zenodo.7973159

    Article: https://zenodo.org/record/8326044

    DMP: https://zenodo.org/record/8324973

    Protocol: https://doi.org/10.17504/protocols.io.n92ldpeenl5b/v5

  13. es-publications-researchareas

    • huggingface.co
    Updated Apr 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA Goddard Earth Sciences Data and Information Services Center (GES-DISC) (2018). es-publications-researchareas [Dataset]. http://doi.org/10.57967/hf/2914
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2018
    Dataset provided by
    NASAhttp://nasa.gov/
    Authors
    NASA Goddard Earth Sciences Data and Information Services Center (GES-DISC)
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Publications Citing NASA GES-DISC Datasets with Applied Research Areas

      Dataset Description
    
    
    
    
    
      Dataset Summary
    

    This dataset includes a curated collection of scientific publications that cite datasets from NASA's Goddard Earth Sciences Data and Information Services Center (GES-DISC). The dataset is designed to provide insights into the impact and reach of NASA's data products, particularly in supporting Earth science research. Each publication is… See the full description on the dataset page: https://huggingface.co/datasets/nasa-gesdisc/es-publications-researchareas.

  14. Data from: BIP! NDR (NoDoiRefs): a dataset of citations from papers without...

    • data.europa.eu
    unknown
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo, BIP! NDR (NoDoiRefs): a dataset of citations from papers without DOIs in computer science conferences and workshops [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8356790?locale=cs
    Explore at:
    unknown(200611095)Available download formats
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the field of Computer Science, conference and workshop papers serve as important contributions, carrying substantial weight in research assessment processes, compared to other disciplines. However, a considerable number of these papers are not assigned a Digital Object Identifier (DOI), hence their citations are not reported in widely used citation datasets like OpenCitations and Crossref, raising limitations to citation analysis. While the Microsoft Academic Graph (MAG) previously addressed this issue by providing substantial coverage, its discontinuation has created a void in available data. BIP! NDR aims to alleviate this issue and enhance the research assessment processes within the field of Computer Science. To accomplish this, it leverages a workflow that identifies and retrieves Open Science papers lacking DOIs from the DBLP Corpus, and by performing text analysis, it extracts citation information directly from their full text. The current version of the dataset contains more than 2.1M citations made by approximately 147K open access Computer Science conference or workshop papers that, according to DBLP, do not have a DOI. File Structure: The dataset is formatted as a JSON Lines (JSONL) file (one JSON Object per line) to facilitate file splitting and streaming. Each JSON object has three main fields: “_id”: a unique identifier, “citing_paper”, the “dblp_id” of the citing paper, “cited_papers”: array containing the objects that correspond to each reference found in the text of the “citing_paper”; each object may contain the following fields: “dblp_id”: the “dblp_id” of the cited paper. Optional - this field is required if a “doi” is not present. “doi”: the doi of the cited paper. Optional - this field is required if a “dblp_id” is not present. “bibliographic_reference”: the raw citation string as it appears in the citing paper. Changes from previous version: Replaced the PDF Downloader module with PublicationsRetriever (https://github.com/LSmyrnaios/PublicationsRetriever) to cover the full range of available URLs Fixed a bug that affected how the DBLP IDs were allocated to the downloaded PDF files (this bug affected records in the previous versions of the dataset).

  15. Bibliometric dataset: list of highly cited papers in bibliometric

    • zenodo.org
    • data.niaid.nih.gov
    bin, png, txt
    Updated Jul 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dasapta Erwin Irawan; Dasapta Erwin Irawan; Dini Sofiani Permatasari; Dini Sofiani Permatasari; Lusia Marliana Nurani; Lusia Marliana Nurani (2024). Bibliometric dataset: list of highly cited papers in bibliometric [Dataset]. http://doi.org/10.5281/zenodo.2544533
    Explore at:
    png, bin, txtAvailable download formats
    Dataset updated
    Jul 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dasapta Erwin Irawan; Dasapta Erwin Irawan; Dini Sofiani Permatasari; Dini Sofiani Permatasari; Lusia Marliana Nurani; Lusia Marliana Nurani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation

    My motivation in providing this dataset is to invite more interests from Indonesia's librarian to understand their diverse field of study.

    Method

    This dataset is harvested in 19 January 2019 from Scopus database provided by The University of Sydney. I used the keyword "bibliometric" in title, sort the search results by total citation, then download the first 2000 papers as RIS file. This file can be converted to other formats like bibtex or csv using available reference manager, like Zotero.

    Visualisations

    I did two small visualisations using the following options:

    1. "create a map based on bibliographic data"
    2. "create a map based on text data"

    Both mappings are done using VosViewer open source app from CWTS Leiden University.

  16. d

    Louisville Metro KY - Uniform Citation Data 2022

    • catalog.data.gov
    • data.lojic.org
    • +4more
    Updated Jul 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louisville/Jefferson County Information Consortium (2025). Louisville Metro KY - Uniform Citation Data 2022 [Dataset]. https://catalog.data.gov/dataset/louisville-metro-ky-uniform-citation-data-2022
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    Louisville/Jefferson County Information Consortium
    Area covered
    Kentucky, Louisville
    Description

    Note: Due to a system migration, this data will cease to update on March 14th, 2023. The current projection is to restart the updates on or around July 17th, 2024.A list of all uniform citations from the Louisville Metro Police Department, the CSV file is updated daily, including case number, date, location, division, beat, offender demographics, statutes and charges, and UCR codes can be found in this Link.INCIDENT_NUMBER or CASE_NUMBER links these data sets together:Crime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/UCR_DESC - the description of the UCR_CODE. For more details visit https://ucr.fbi.gov/

  17. h

    Cite

    • huggingface.co
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tynchtykbek (2025). Cite [Dataset]. https://huggingface.co/datasets/Mir2025/Cite
    Explore at:
    Dataset updated
    Oct 21, 2025
    Authors
    Tynchtykbek
    Description

    Mir2025/Cite dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. News Category Dataset

    • kaggle.com
    zip
    Updated Sep 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishabh Misra (2022). News Category Dataset [Dataset]. https://www.kaggle.com/datasets/rmisra/news-category-dataset/
    Explore at:
    zip(27829769 bytes)Available download formats
    Dataset updated
    Sep 24, 2022
    Authors
    Rishabh Misra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    ** Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! **

    This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost. This is one of the biggest news datasets and can serve as a benchmark for a variety of computational linguistic tasks. HuffPost stopped maintaining an extensive archive of news articles sometime after this dataset was first collected in 2018, so it is not possible to collect such a dataset in the present day. Due to changes in the website, there are about 200k headlines between 2012 and May 2018 and 10k headlines between May 2018 and 2022.

    Content

    Each record in the dataset consists of the following attributes: - category: category in which the article was published. - headline: the headline of the news article. - authors: list of authors who contributed to the article. - link: link to the original news article. - short_description: Abstract of the news article. - date: publication date of the article.

    There are a total of 42 news categories in the dataset. The top-15 categories and corresponding article counts are as follows:

    • POLITICS: 35602

    • WELLNESS: 17945

    • ENTERTAINMENT: 17362

    • TRAVEL: 9900

    • STYLE & BEAUTY: 9814

    • PARENTING: 8791

    • HEALTHY LIVING: 6694

    • QUEER VOICES: 6347

    • FOOD & DRINK: 6340

    • BUSINESS: 5992

    • COMEDY: 5400

    • SPORTS: 5077

    • BLACK VOICES: 4583

    • HOME & LIVING: 4320

    • PARENTS: 3955

    Citation

    If you're using this dataset for your work, please cite the following articles:

    Citation in text format: 1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022). 2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

    Citation in BibTex format: @article{misra2022news, title={News Category Dataset}, author={Misra, Rishabh}, journal={arXiv preprint arXiv:2209.11429}, year={2022} } @book{misra2021sculpting, author = {Misra, Rishabh and Grover, Jigyasa}, year = {2021}, month = {01}, pages = {}, title = {Sculpting Data for ML: The first act of Machine Learning}, isbn = {9798585463570} }

    Please link to rishabhmisra.github.io/publications as the source of this dataset. Thanks!

    Acknowledgements

    This dataset was collected from HuffPost.

    Inspiration

    • Can you categorize news articles based on their headlines and short descriptions?

    • Do news articles from different categories have different writing styles?

    • A classifier trained on this dataset could be used on a free text to identify the type of language being used.

    Want to contribute your own datasets?

    If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.

    Other datasets

    Please also checkout the following datasets collected by me:

  19. g

    Louisville Metro KY - Uniform Citation Data (2016-2019)

    • gimi9.com
    • data.lojic.org
    • +4more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louisville Metro KY - Uniform Citation Data (2016-2019) [Dataset]. https://gimi9.com/dataset/data-gov_louisville-metro-ky-uniform-citation-data-2016-2019-901a2
    Explore at:
    Area covered
    Kentucky, Louisville
    Description

    Crime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/

  20. Reference count CSV dataset of all bibliographic resources in OpenCitations...

    • figshare.com
    zip
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCitations ​ (2023). Reference count CSV dataset of all bibliographic resources in OpenCitations Index [Dataset]. http://doi.org/10.6084/m9.figshare.24747498.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    OpenCitations ​
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A CSV dataset containing the number of references of each bibliographic entity identified by an OMID in the OpenCitations Index (https://opencitations.net/index).The dataset is based on the last release of the OpenCitations Index (https://opencitations.net/download) – November 2023. The size of the zipped archive is 0.35 GB, while the size of the unzipped CSV file is 1.7 GB.The CSV dataset contains the reference count of 71,805,806 bibliographic entities. The first column (omid) lists the entities, while the second column (references) indicates the corresponding number of incoming citations.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenCitations ​ (2022). POCI CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.21776351.v1
Organization logoOrganization logo

POCI CSV dataset of all the citation data

Explore at:
zipAvailable download formats
Dataset updated
Dec 27, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
OpenCitations ​
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:

[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).

This version of the dataset contains:

717,654,703 citations; 26,024,862 bibliographic resources.

The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.

Search
Clear search
Close search
Google apps
Main menu