Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:
[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).
This version of the dataset contains:
717,654,703 citations; 26,024,862 bibliographic resources.
The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In January 2019, the Asclepias Broker harvested citation links to Zenodo objects from three discovery systems: the NASA Astrophysics Datasystem (ADS), Crossref Event Data and Europe PMC. Each row of our dataset represents one unique link between a citing publication and a Zenodo DOI. Both endpoints are described by basic metadata. The second dataset contains usage metrics for every cited Zenodo DOI of our data sample.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added in the most recent iteration. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2023 and single recent year data pertain to citations received during calendar year 2023. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2024 snapshot from Scopus, updated to end of citation year 2023. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2024. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This document contains the datasets and visualizations generated after the application of the methodology defined in our work: "A qualitative and quantitative citation analysis toward retracted articles: a case of study". The methodology defines a citation analysis of the Wakefield et al. [1] retracted article from a quantitative and qualitative point of view. The data contained in this repository are based on the first two steps of the methodology. The first step of the methodology (i.e. “Data gathering”) builds an annotated dataset of the citing entities, this step is largely discussed also in [2]. The second step (i.e. "Topic Modelling") runs a topic modeling analysis on the textual features contained in the dataset generated by the first step.
Note: the data are all contained inside the "method_data.zip" file. You need to unzip the file to get access to all the files and directories listed below.
Data gathering
The data generated by this step are stored in "data/":
Topic modeling
We run a topic modeling analysis on the textual features gathered (i.e. abstracts and citation contexts). The results are stored inside the "topic_modeling/" directory. The topic modeling has been done using MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [3]. The topic modeling results for each textual feature are separated into two different folders, "abstracts/" for the abstracts, and "intext_cit/" for the in-text citation contexts. Both the directories contain the following directories/files:
"mitao_workflows/": the workflows of MITAO. These are JSON files that could be reloaded in MITAO to reproduce the results following the same workflows.
"corpus_and_dictionary/": it contains the dictionary and the vectorized corpus given as inputs for the LDA topic modeling.
"coherence/coherence.csv": the coherence score of several topic models trained on a number of topics from 1 - 40.
"datasets_and_views/": the datasets and visualizations generated using MITAO.
References
Heibi, I., & Peroni, S. (2020). A methodology for gathering and annotating the raw-data/characteristics of the documents citing a retracted article v1 (protocols.io.bdc4i2yw) [Data set]. In protocols.io. ZappyLab, Inc. https://doi.org/10.17504/protocols.io.bdc4i2yw
Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This BibTeX file contains the corpus of papers that cite CRAWDAD wireless network datasets, as used in the paper: Tristan Henderson and David Kotz. Data citation practices in the CRAWDAD wireless network data archive. Proceedings of the Second Workshop on Linking and Contextualizing Publications and Datasets, London, UK, September 2014. Most of the fields are standard BibTeX fields. There are two that require further explanation. "citations" - this field contains the citations for a paper as countedby Google Scholar as of 24 September 2014. "keywords" - this field contains a set of tags indicating data citation practice. These are as follows:- "uses_crawdad_data" - this paper uses a CRAWDAD dataset- "cites_insufficiently" - this paper does not meet our sufficiency criteria- "cites_by_description" - this paper cites a dataset by description rather than dataset identifier- "cites_canonical_paper" - this paper cites the original ("canonical") paper that collected a dataset, rather than pointing to the dataset- "cites_by_name" - this paper cites a dataset by a colloquial name rather than dataset identifier- "cites_crawdad_url" - this paper cites the main CRAWDAD URL rather than a particular dataset- "cites_without_url" - this paper does not provide a URL for dataset access- "cites_wrong_attribution" - this paper attributes a dataset to CRAWDAD, Dartmouth etc rather than the dataset authors- "cites_vaguely" - this paper cites the used datasets (if any) too vaguely to be sufficient If you have any questions about the data, please contact us atcrawdad@crawdad.org
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
SemanticCite Dataset
The SemanticCite Dataset is a collection of citation-reference pairs with expert annotations for training and evaluating citation verification systems. Each entry contains a citation claim, reference document context, and detailed classification with reasoning.
Dataset Format
The dataset is provided as a JSON file where each entry contains the following structure:
Input Fields
claim: The core assertion extracted from the citation text… See the full description on the dataset page: https://huggingface.co/datasets/sebsigma/SemanticCite-Dataset.
Facebook
TwitterThe Enriched Citation API provides the Intellectual Property 5 (IP5 - EPO, JPO, KIPO, CNIPA, and USPTO) and the Public with greater insight into the patent evaluation process. It allows users to quickly view information about which references, or prior art, were cited in specific patent application Office Actions, including: bibliographic information of the reference, the claims that the prior art was cited against, and the relevant sections that the examiner relied upon. The API allows for daily refresh and retrieval of enrich citation data from Office Actions mailed from October 1, 2017 to 30 days prior to the current date.
Facebook
TwitterA list of all uniform citations from the Louisville Metro Police Department, the CSV file is updated daily, including case number, date, location, division, beat, offender demographics, statutes and charges, and UCR codes can be found in this Link.INCIDENT_NUMBER or CASE_NUMBER links these data sets together:Crime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/UCR_DESC - the description of the UCR_CODE. For more details visit https://ucr.fbi.gov/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This document describes the data collection and datasets used in the manuscript "A Matter of Culture? Conceptualising and Investigating ‘Evidence Cultures’ within Research on Evidence-Informed Policymaking" [1].
Data Collection
To construct the citation network analysed in the manuscript, we first designed a series of queries to capture a large sample of literature exploring the relationship between evidence, policy, and culture from various perspectives. Our team of domain experts developed the following queries based on terms common in the literature. These queries search for the terms included in the titles, abstracts, and associated keywords of WoS indexed records (i.e. ‘TS=’). While these are separated below for ease of reading, they combined into a single query via the OR operator in our search. Our search was conducted on the Web of Science’s (WoS) Core Collection through the University of Edinburgh Library subscription on 29/11/2023, returning a total of 2,089 records.
TS = ((“cultures of evidence” OR “culture of evidence” OR “culture of knowledge” OR “cultures of knowledge” OR “research culture” OR “research cultures” OR “culture of research” OR “cultures of research” OR “epistemic culture” OR “epistemic cultures” OR “epistemic community” OR “epistemic communities” OR “epistemic infrastructure” OR “evaluation culture” OR “evaluation cultures” OR “culture of evaluation” OR “cultures of evaluation” OR “thought style” OR “thought styles” OR “thought collective” OR “thought collectives” OR “knowledge regime” OR “knowledge regimes” OR “knowledge system” OR “knowledge systems” OR “civic epistemology” OR “civic epistemologies”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “policy decision” OR “policy decisions” OR “political decision” OR “political decisions” OR “political decision making”))
OR
TS = ((“culture” OR “cultures”) AND ((“evidence-based” OR “evidence-informed” OR “evidence-led” OR “science-based” OR “science-informed” OR “science-led” OR “research-based” OR “research-informed” OR “evidence use” OR “evidence user” OR “evidence utilisation” OR “evidence utilization” OR “research use” OR “researcher user” OR “research utilisation” OR “research utilization” OR “research in” OR “evidence in” OR “science in”) NEAR/1 (“policymaking” OR “policy making” OR “policy maker” OR “policy makers”)))
OR
TS = ((“culture” OR “cultures”) AND (“scientific advice” OR “technical advice” OR “scientific expertise” OR “technical expertise” OR “expert advice”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “political decision” OR “political decisions” OR “political decision making”))
OR
TS = ((“culture” OR “cultures”) AND (“post-normal science” OR “trans-science” OR “transdisciplinary” OR “transdisiplinarity” OR “science-policy interface” OR “policy sciences” OR “sociology of knowledge” OR “sociology of science” OR “knowledge transfer” OR “knowledge translation” OR “knowledge broker” OR “implementation science” OR “risk society”) AND (“policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers”))
Citation Network Construction
All bibliographic metadata on these 2,089 records were downloaded in five batches in plain text and then merged in R. We then parsed these data into network readable files. All unique reference strings are given unique node IDs. A node-attribute-list (‘CE_Node’) links identifying information of each document with its node ID, including authors, title, year of publication, journal WoS ID, and WoS citations. An edge-list (‘CE_Edge’) records all citations from these documents to their bibliographies – with edges going from a citing document to the cited – using the relevant node IDs. These data were then cleaned by (a) matching DOIs for reference strings that differ but point to the same paper, and (b) manual merging of obvious duplicates caused by referencing errors.
Our initial dataset consisted of 2,089 retrieved documents and 123,772 unretrieved cited documents (i.e. documents that were cited within the publications we retrieved but which were not one of these 2,089 documents). These documents were connected by 157,229 citation links, but ~87% of the documents in the network were cited just once. To focus on relevant literature, we filtered the network to include only documents with at least three citation or reference links. We further refined the dataset by focusing on the main connected component, resulting in 6,650 nodes and 29,198 edges. It is this dataset that we publish here, and it is this network that underpins Figure 1, Table 1, and the qualitative examination of documents (see manuscript for further details).
Our final network dataset contains 1,819 of the documents in our original query (~87% of the original retrieved records), and 4,831 documents not retrieved via our Web of Science search but cited by at least three of the retrieved documents. We then clustered this network by modularity maximization via the Leiden algorithm [2], detecting 14 clusters with Q=0.59. Citations to documents within the same cluster constitute ~77% of all citations in the network.
Citation Network Dataset Description
We include two network datasets: (i) ‘CE_Node.csv’ that contains 1,819 retrieved documents, 4,831 unretrieved referenced documents, making for a total of 6,650 documents (nodes); (ii)’CE_Edge.csv’ that records citations (edges) between the documents (nodes), including a total of 29,198 citation links. These files can be used to construct a network with many different tools, but we have formatted these to be used in Gephi 0.10[3].
‘CE_Node.csv’ is a comma-separate values file that contains two types of nodes:
i. Retrieved documents – these are documents captured by our query. These include full bibliographic metadata and reference lists.
ii. Non-retrieved documents – these are documents referenced by our retrieved documents but were not retrieved via our query. These only have data contained within their reference string (i.e. first author, journal or book title, year of publication, and possibly DOI).
The columns in the .csv refer to:
- Id, the node ID
- Label, the reference string of the document
- DOI, the DOI for the document, if available
- WOS_ID, WoS accession number
- Authors, named authors
- Title, title of document
- Document_type, variable indicating whether a document is an article, review, etc.
- Journal_book_title, journal of publication or title of book
- Publication year, year of publication.
- WOS_times_cited, total Core Collection citations as of 29/11/2023
- Indegree, number of within network citations to a given document
- Cluster, provides the cluster membership number as discussed in the manuscript (Figure 1)
‘CE_Edge.csv’ is a comma-separated values file that contains edges (citation links) between nodes (documents) (n=29,198). The columns refer to:
- Source, node ID of the citing document
- Target, node ID of the cited document
Cluster Analysis
We qualitatively analyse a set of publications from seven of the largest clusters in our manuscript.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAOhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAO
Extracting and parsing reference strings from research articles is a challenging task. State-of-the-art tools like GROBID apply rather simple machine learning models such as conditional random fields (CRF). Recent research has shown a high potential of deep-learning for reference string parsing. The challenge with deep learning is, however, that the training step requires enormous amounts of labeled data – which does not exist for reference string parsing. Creating such a large dataset manually, through human labor, seems hardly feasible. Therefore, we created GIANT. GIANT is a large dataset with 991,411,100 XML labeled reference strings. The strings were automatically created based on 677,000 entries from CrossRef, 1,500 citation styles in the citation-style language, and the citation processor citeproc-js. GIANT can be used to train machine learning models, particularly deep learning models, for citation parsing. While we have not yet tested GIANT for training such models, we hypothesise that the dataset will be able to significantly improve the accuracy of citation parsing. The dataset and code to create it, are freely available at https://github.com/BeelGroup/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This zipped folders contain all the data produced for the research "Uncovering the Citation Landscape: Exploring OpenCitations COCI, OpenCitations Meta, and ERIH-PLUS in Social Sciences and Humanities Journals": the results datasets (dataset_map_disciplines, dataset_no_SSH, dataset_SSH, erih_meta_with_disciplines and erih_meta_without_disciplines).
dataset_map_disciplines.zip contains CSV files with four columns ("id", "citing", "cited", "disciplines") giving information about publications stored in OpenCitations META (version 3 released on February 2023) and part of SSH journals, according to ERIH PLUS (version downloaded on 2023-04-27), specifying the disciplines associated to them and a boolean value stating if they cite or are cited, according to the OpenCitations COCI dataset (version 19 released on January 2023).
dataset_no_SSH.zip and dataset_SSH.zip contain CSV files with the same structure. Each dataset has four columns: "citing", "is_citing_SSH", "cited", and "is_cited_SSH". ”Citing” and “cited” columns are filled with DOIs of publications stored in OpenCitations META that according to OpenCitations COCI are involved in a citation. The "is_citing_SSH" and "is_cited_SSH" columns contain boolean values: "True" if the corresponding publication is associated with a SSH (Social Sciences and Humanities) discipline, according to ERIH PLUS, and "False" otherwise. The two datasets are built starting from the two different subsets obtained as a result of the union between OpenCitations META and ERIH PLUS: dataset_SSH comes from erih_meta_with_disciplines and dataset_no_SSH from erih_meta_without_disciplines. dataset_no_SSH comes from erih_meta_with_disciplines.zip and erih_meta_without_disciplines.zip, as explained before, contain CSV files originating from ERIH PLUS and META. erih_meta_without_disciplines has just one column “id” and contains the DOIs of all the publications in META that do not have any discipline associated, that is, have not been published on a SSH journal, while erih_meta_with_disciplines derives from all the publications in META that have at least one linked discipline and has two columns: “id” and “erih_disciplines”, containing a string with all the disciplines linked to that publication like "History, Interdisciplinary research in the Humanities, Interdisciplinary research in the Social Sciences, Sociology".
Software: https://doi.org/10.5281/zenodo.8326023
Data preprocessed: https://doi.org/10.5281/zenodo.7973159
Article: https://zenodo.org/record/8326044
DMP: https://zenodo.org/record/8324973
Protocol: https://doi.org/10.17504/protocols.io.n92ldpeenl5b/v5
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Publications Citing NASA GES-DISC Datasets with Applied Research Areas
Dataset Description
Dataset Summary
This dataset includes a curated collection of scientific publications that cite datasets from NASA's Goddard Earth Sciences Data and Information Services Center (GES-DISC). The dataset is designed to provide insights into the impact and reach of NASA's data products, particularly in supporting Earth science research. Each publication is… See the full description on the dataset page: https://huggingface.co/datasets/nasa-gesdisc/es-publications-researchareas.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the field of Computer Science, conference and workshop papers serve as important contributions, carrying substantial weight in research assessment processes, compared to other disciplines. However, a considerable number of these papers are not assigned a Digital Object Identifier (DOI), hence their citations are not reported in widely used citation datasets like OpenCitations and Crossref, raising limitations to citation analysis. While the Microsoft Academic Graph (MAG) previously addressed this issue by providing substantial coverage, its discontinuation has created a void in available data. BIP! NDR aims to alleviate this issue and enhance the research assessment processes within the field of Computer Science. To accomplish this, it leverages a workflow that identifies and retrieves Open Science papers lacking DOIs from the DBLP Corpus, and by performing text analysis, it extracts citation information directly from their full text. The current version of the dataset contains more than 2.1M citations made by approximately 147K open access Computer Science conference or workshop papers that, according to DBLP, do not have a DOI. File Structure: The dataset is formatted as a JSON Lines (JSONL) file (one JSON Object per line) to facilitate file splitting and streaming. Each JSON object has three main fields: “_id”: a unique identifier, “citing_paper”, the “dblp_id” of the citing paper, “cited_papers”: array containing the objects that correspond to each reference found in the text of the “citing_paper”; each object may contain the following fields: “dblp_id”: the “dblp_id” of the cited paper. Optional - this field is required if a “doi” is not present. “doi”: the doi of the cited paper. Optional - this field is required if a “dblp_id” is not present. “bibliographic_reference”: the raw citation string as it appears in the citing paper. Changes from previous version: Replaced the PDF Downloader module with PublicationsRetriever (https://github.com/LSmyrnaios/PublicationsRetriever) to cover the full range of available URLs Fixed a bug that affected how the DBLP IDs were allocated to the downloaded PDF files (this bug affected records in the previous versions of the dataset).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation
My motivation in providing this dataset is to invite more interests from Indonesia's librarian to understand their diverse field of study.
Method
This dataset is harvested in 19 January 2019 from Scopus database provided by The University of Sydney. I used the keyword "bibliometric" in title, sort the search results by total citation, then download the first 2000 papers as RIS file. This file can be converted to other formats like bibtex or csv using available reference manager, like Zotero.
Visualisations
I did two small visualisations using the following options:
Both mappings are done using VosViewer open source app from CWTS Leiden University.
Facebook
TwitterNote: Due to a system migration, this data will cease to update on March 14th, 2023. The current projection is to restart the updates on or around July 17th, 2024.A list of all uniform citations from the Louisville Metro Police Department, the CSV file is updated daily, including case number, date, location, division, beat, offender demographics, statutes and charges, and UCR codes can be found in this Link.INCIDENT_NUMBER or CASE_NUMBER links these data sets together:Crime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/UCR_DESC - the description of the UCR_CODE. For more details visit https://ucr.fbi.gov/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
** Please cite the dataset using the BibTex provided in one of the following sections if you are using it in your research, thank you! **
This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost. This is one of the biggest news datasets and can serve as a benchmark for a variety of computational linguistic tasks. HuffPost stopped maintaining an extensive archive of news articles sometime after this dataset was first collected in 2018, so it is not possible to collect such a dataset in the present day. Due to changes in the website, there are about 200k headlines between 2012 and May 2018 and 10k headlines between May 2018 and 2022.
Each record in the dataset consists of the following attributes: - category: category in which the article was published. - headline: the headline of the news article. - authors: list of authors who contributed to the article. - link: link to the original news article. - short_description: Abstract of the news article. - date: publication date of the article.
There are a total of 42 news categories in the dataset. The top-15 categories and corresponding article counts are as follows:
POLITICS: 35602
WELLNESS: 17945
ENTERTAINMENT: 17362
TRAVEL: 9900
STYLE & BEAUTY: 9814
PARENTING: 8791
HEALTHY LIVING: 6694
QUEER VOICES: 6347
FOOD & DRINK: 6340
BUSINESS: 5992
COMEDY: 5400
SPORTS: 5077
BLACK VOICES: 4583
HOME & LIVING: 4320
PARENTS: 3955
If you're using this dataset for your work, please cite the following articles:
Citation in text format:
1. Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).
Citation in BibTex format:
@article{misra2022news,
title={News Category Dataset},
author={Misra, Rishabh},
journal={arXiv preprint arXiv:2209.11429},
year={2022}
}
@book{misra2021sculpting,
author = {Misra, Rishabh and Grover, Jigyasa},
year = {2021},
month = {01},
pages = {},
title = {Sculpting Data for ML: The first act of Machine Learning},
isbn = {9798585463570}
}
Please link to rishabhmisra.github.io/publications as the source of this dataset. Thanks!
This dataset was collected from HuffPost.
Can you categorize news articles based on their headlines and short descriptions?
Do news articles from different categories have different writing styles?
A classifier trained on this dataset could be used on a free text to identify the type of language being used.
If you are interested in learning how to collect high-quality datasets for various ML tasks and the overall importance of data in the ML ecosystem, consider reading my book Sculpting Data for ML.
Please also checkout the following datasets collected by me:
Facebook
TwitterCrime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A CSV dataset containing the number of references of each bibliographic entity identified by an OMID in the OpenCitations Index (https://opencitations.net/index).The dataset is based on the last release of the OpenCitations Index (https://opencitations.net/download) – November 2023. The size of the zipped archive is 0.35 GB, while the size of the unzipped CSV file is 1.7 GB.The CSV dataset contains the reference count of 71,805,806 bibliographic entities. The first column (omid) lists the entities, while the second column (references) indicates the corresponding number of incoming citations.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:
[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).
This version of the dataset contains:
717,654,703 citations; 26,024,862 bibliographic resources.
The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.