100+ datasets found
  1. P

    Data from: Citeseer Dataset

    • paperswithcode.com
    • huggingface.co
    Updated Mar 4, 2007
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C. Lee Giles; Kurt D. Bollacker; Steve Lawrence (2007). Citeseer Dataset [Dataset]. https://paperswithcode.com/dataset/citeseer
    Explore at:
    Dataset updated
    Mar 4, 2007
    Authors
    C. Lee Giles; Kurt D. Bollacker; Steve Lawrence
    Description

    The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.

  2. P

    DBLP Dataset

    • paperswithcode.com
    Updated Apr 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jie Tang; Jing Zhang; Limin Yao; Juanzi Li; Li Zhang; Zhong Su (2021). DBLP Dataset [Dataset]. https://paperswithcode.com/dataset/dblp
    Explore at:
    Dataset updated
    Apr 13, 2021
    Authors
    Jie Tang; Jing Zhang; Limin Yao; Juanzi Li; Li Zhang; Zhong Su
    Description

    The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.

  3. OpenCitations Index CSV dataset of all the citation data

    • figshare.com
    zip
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCitations ​ (2025). OpenCitations Index CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.24356626.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 27, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    OpenCitations ​
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains all the citation data (in CSV format) included in the OpenCitation Index (https://opencitations.net/index), released on March 24, 2025. In particular, each line of the CSV file defines a citation, and includes the following information:[field "oci"] the Open Citation Identifier (OCI) for the citation;[field "citing"] the OMID of the citing entity;[field "cited"] the OMID of the cited entity;[field "creation"] the creation date of the citation (i.e. the publication date of the citing entity);[field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity);[field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal);[field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).Note: the information for each citation is sourced from OpenCitations Meta (https://opencitations.net/meta), a database that stores and delivers bibliographic metadata for all bibliographic resources included in the OpenCitations Index. The data provided in this dump is therefore based on the state of OpenCitations Meta at the time this collection was generated.This version of the dataset contains:2,155,497,918 citationsThe size of the zipped archive is 34.4 GB, while the size of the unzipped CSV file is 220 GB.

  4. NIST Statistical Reference Datasets - SRD 140

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Jul 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). NIST Statistical Reference Datasets - SRD 140 [Dataset]. https://catalog.data.gov/dataset/nist-statistical-reference-datasets-srd-140-df30c
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.

  5. Data from: unarXive: A Large Scholarly Data Set with Publications'...

    • zenodo.org
    Updated Apr 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarek Saier; Tarek Saier; Michael Färber; Michael Färber (2024). unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata [Dataset]. http://doi.org/10.5281/zenodo.3385851
    Explore at:
    Dataset updated
    Apr 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tarek Saier; Tarek Saier; Michael Färber; Michael Färber
    Description

    Description

    unarXive is a scholarly data set containing publications' full-text, annotated in-text citations, and a citation network.

    The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files.

    Typical use cases are

    • Citation recommendation
    • Citation context analysis
    • Bibliographic analyses
    • Reference string parsing

    Note: This Zenodo record is an old version of unarXive. You can find the most recent version at https://zenodo.org/record/7752754 and https://zenodo.org/record/7752615

    Access

    ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓
    D O W N L O A D S A M P L E  ┃
    ┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛

    To download the whole data set send an access request and note the following:

    Note: this Zenodo record is a "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹

    ¹ For information on papers' licenses use arXiv's bulk metadata access.

    The code used for generating the data set is publicly available.

    Usage examples for our data set are provided at here on GitHub.

    Citing

    This initial version of unarXive is described in the following journal article.

    Tarek Saier, Michael Färber: "unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata", Scientometrics, 2020,
    [link to an author copy]

    The updated version is described in the following conference paper.

    Tarek Saier, Michael Färber. "unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network", JCDL 2023.
    [link to an author copy]

  6. Z

    Data from: Diversity in citations to a single study: Supplementary data set...

    • data.niaid.nih.gov
    Updated Aug 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leng, Rhodri Ivor (2021). Diversity in citations to a single study: Supplementary data set for citation context network analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5244799
    Explore at:
    Dataset updated
    Aug 25, 2021
    Dataset authored and provided by
    Leng, Rhodri Ivor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This document describes the data set used for all analyses in 'Diversity in citations to a single study: A citation context network analysis of how evidence from a prospective cohort study was cited' accepted for publication in Quantitative Science Studies [1].

    Data Collection

    The data collection procedure has been fully described [1]. Concisely, the data set contains bibliometric data collected from Web of Science Core Collection via the University of Edinburgh’s Library subscription concerning all papers that cited a cohort study, Paul et al. [2], in the period <1985. This includes a full list of citing papers, and the citations between these papers. Additionally, it includes textual passages (citation contexts) from 343 citing papers, which were manually recovered from the full-text documents accessible via the University of Edinburgh’s Library subscription. These data have been cleaned, converted into network readable datasets, and are coded into particular classifications reflecting content, which are described fully in the supplied code book and within the manuscript [1].

    Data description

    All relevant data can be found in the attached file 'Supplementary_material_Leng_QSS_2021.xlsx', which contains the following five workbooks:

    “Overview” includes a list of the content of the workbooks.

    “Code Book” contains the coding rules and definitions used for the classification of findings and paper titles.

    “Node attribute list” includes a workbook containing all node attributes for the citation network, which includes Paul et al. [2] and its citing papers as of 1984. Highlighted in yellow at the bottom of this workbook is two papers that were discarded due to duplication - remove these if analysing this dataset in a network analysis. The columns refer to:

    Id, the node identifier

    Label, the formal citation of the paper to which data within this row corresponds. Citation is in the following format: last name of first author, year of publication, journal of publication, volume number, start page, and DOI (if available).

    Title, the paper title for the paper in question.

    Publication_year, the year of publication.

    Document_type, the document type (e.g. review, article)

    WoS_ID, the paper’s unique Web of Science accession number.

    Citation_context, a column specifying whether citation context data is available from that paper

    Explanans, the title explanans terms for that paper;

    Explanandum, the explanandum terms for that paper.

    Combined_Title_Classification, the combined terms used for fig 2 of the published manuscript.

    Serum_cholesterol_(SC), a column identifying papers that cited the serum cholesterol findings.

    Blood_Pressure_(BP), a column identifying papers that cited the blood pressure findings.

    Coffee_(C), a column identifying papers that cited the coffee findings.

    Diet_(D), a column identifying papers that cited the dietary findings.

    Smoking_(S), a column identifying papers that cited the smoking findings.

    Alcohol_(A), a column identifying papers that cited the alcohol findings.

    Physical_Activity_(PA), a column identifying papers that cited the physical activity findings.

    Body_Fatness (BF), a column identifying papers that cited the body fatness findings.

    Indegree, the number of within network citations to that paper, calculated for the network shown in Fig 4 of the manuscript.

    Outdegree, the number of within network references of that paper as calculated for the network in Fig 4.

    Main_component, a column specifying whether a node is contained in the largest weakly connect component as shown in Fig 4 of the manuscript.

    Cluster, provides the cluster membership number as discussed within the manuscript (Fig 5).

    “Edge list” includes a workbook including the edges for the network. The columns refer to:

    Source, contains the node identifier of the citing paper.

    Target, contains the node identifier of the cited paper.

    “Citation context classification” includes a workbook containing the WoS accession number for the paper analysed, and any finding category discussed in that paper established via context analysis (see the code book for definitions). The columns refer to:

    Id, the node identifier

    Finding_Class, the findings discussed from Paul et al. within the body of the citing paper.

    “Citation context data” includes a workbook containing the WoS accession number for papers in which citation context data was available, the citation context passages, the reference number or format of Paul et al. within the citing paper, and the finding categories discussed in those contexts (see code book for definitions). The columns refer to:

    Id, the node identifier

    Citation_context, the passage copied from the full text of the citing paper containing discussion of the findings of Paul et al.

    Reference_in_citing_article, the reference number or format of Paul et al. within the citing paper.

    Finding_class, the findings discussed from Paul et al. within the body of the citing paper.

    Software recommended for analysis

    For the analyses performed within the manuscript, Gephi version 0.9.2 was used [3], and both the edge and node lists are in a format that is easily read into this software. The Sci2 tool was used to parse data initially [4].

    Notes

    Leng, R. I. (Forthcoming). Diversity in citations to a single study: A citation context network analysis of how evidence from a prospective cohort study was cited. Quantitative Science Studies.

    Paul, O., Lepper, M. H., Phelan, W. H., Dupertuis, G. W., Macmillan, A., McKean, H., et al. (1963). A longitudinal study of coronary heart disease. Circulation, 28, 20-31. https://doi.org/10.1161/01.cir.28.1.20.

    Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media.

    Sci2 Team. (2009). Science of Science (Sci2) Tool. Indiana University and SciTech Strategies. Stable URL: https://sci2.cns.iu.edu

  7. I

    Cline Center Coup d’État Project Dataset

    • databank.illinois.edu
    Updated May 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto (2025). Cline Center Coup d’État Project Dataset [Dataset]. http://doi.org/10.13012/B2IDB-9651987_V7
    Explore at:
    Dataset updated
    May 11, 2025
    Authors
    Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Coups d'Ètat are important events in the life of a country. They constitute an important subset of irregular transfers of political power that can have significant and enduring consequences for national well-being. There are only a limited number of datasets available to study these events (Powell and Thyne 2011, Marshall and Marshall 2019). Seeking to facilitate research on post-WWII coups by compiling a more comprehensive list and categorization of these events, the Cline Center for Advanced Social Research (previously the Cline Center for Democracy) initiated the Coup d’État Project as part of its Societal Infrastructures and Development (SID) project. More specifically, this dataset identifies the outcomes of coup events (i.e., realized, unrealized, or conspiracy) the type of actor(s) who initiated the coup (i.e., military, rebels, etc.), as well as the fate of the deposed leader. Version 2.1.3 adds 19 additional coup events to the data set, corrects the date of a coup in Tunisia, and reclassifies an attempted coup in Brazil in December 2022 to a conspiracy. Version 2.1.2 added 6 additional coup events that occurred in 2022 and updated the coding of an attempted coup event in Kazakhstan in January 2022. Version 2.1.1 corrected a mistake in version 2.1.0, where the designation of “dissident coup” had been dropped in error for coup_id: 00201062021. Version 2.1.1 fixed this omission by marking the case as both a dissident coup and an auto-coup. Version 2.1.0 added 36 cases to the data set and removed two cases from the v2.0.0 data. This update also added actor coding for 46 coup events and added executive outcomes to 18 events from version 2.0.0. A few other changes were made to correct inconsistencies in the coup ID variable and the date of the event. Version 2.0.0 improved several aspects of the previous version (v1.0.0) and incorporated additional source material to include: • Reconciling missing event data • Removing events with irreconcilable event dates • Removing events with insufficient sourcing (each event needs at least two sources) • Removing events that were inaccurately coded as coup events • Removing variables that fell below the threshold of inter-coder reliability required by the project • Removing the spreadsheet ‘CoupInventory.xls’ because of inadequate attribution and citations in the event summaries • Extending the period covered from 1945-2005 to 1945-2019 • Adding events from Powell and Thyne’s Coup Data (Powell and Thyne, 2011)
    Items in this Dataset 1. Cline Center Coup d'État Codebook v.2.1.3 Codebook.pdf - This 15-page document describes the Cline Center Coup d’État Project dataset. The first section of this codebook provides a summary of the different versions of the data. The second section provides a succinct definition of a coup d’état used by the Coup d'État Project and an overview of the categories used to differentiate the wide array of events that meet the project's definition. It also defines coup outcomes. The third section describes the methodology used to produce the data. Revised February 2024 2. Coup Data v2.1.3.csv - This CSV (Comma Separated Values) file contains all of the coup event data from the Cline Center Coup d’État Project. It contains 29 variables and 1000 observations. Revised February 2024 3. Source Document v2.1.3.pdf - This 325-page document provides the sources used for each of the coup events identified in this dataset. Please use the value in the coup_id variable to identify the sources used to identify that particular event. Revised February 2024 4. README.md - This file contains useful information for the user about the dataset. It is a text file written in markdown language. Revised February 2024
    Citation Guidelines 1. To cite the codebook (or any other documentation associated with the Cline Center Coup d’État Project Dataset) please use the following citation: Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Scott Althaus. 2024. “Cline Center Coup d’État Project Dataset Codebook”. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7 2. To cite data from the Cline Center Coup d’État Project Dataset please use the following citation (filling in the correct date of access): Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Emilio Soto. 2024. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7

  8. Computer Forensic Reference Data Set Portal

    • catalog.data.gov
    Updated Feb 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Computer Forensic Reference Data Set Portal [Dataset]. https://catalog.data.gov/dataset/computer-forensic-reference-data-set-portal-ce60e
    Explore at:
    Dataset updated
    Feb 23, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This portal is a gateway to documented digital forensic image datasets. These datasets can assist in a variety of tasks including tool testing, developing familiarity with tool behavior for given tasks, general practitioner training and other unforeseen uses that the user of the datasets can devise. Most datasets have a description of the type and locations of significant artifacts present in the dataset. There are descriptions and finding aides to help you locate datasets by the year produced, by author, or by attributes of the dataset.

  9. P

    unarXive Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarek Saier; Michael Färber, unarXive Dataset [Dataset]. https://paperswithcode.com/dataset/unarxive
    Explore at:
    Authors
    Tarek Saier; Michael Färber
    Description

    A scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata.

    The unarXive data set contains

    One million papers in plain text 63 million citation contexts 39 million reference strings A citation network of 16 million connections

    The data is generated from all LaTeX sources on arXiv from 1991–2020/07 and therefore of higher quality than data generated from PDF files. Furthermore, as all citing papers are available in full text, citation contexts of arbitrary size can be extracted.

    Typical uses of the data set are approaches in

    Citation recommendation Citation context analysis Reference string parsing

    The code for generating the data set is publicly available.

  10. Supplementary material to manuscript: Analyzing data citation practices to...

    • figshare.com
    xlsx
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Robinson-garcia; Evaristo Jiménez Contreras; Daniel Torres-Salinas (2016). Supplementary material to manuscript: Analyzing data citation practices to the Data Citation Index [Dataset]. http://doi.org/10.6084/m9.figshare.1250031.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nicolas Robinson-garcia; Evaristo Jiménez Contreras; Daniel Torres-Salinas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary material to an analysis on data citation practices based on the Data Citation Index from Thomson Reuters. This database launched in 2012 aims to link data sets and data studies with citation received from the rest of their citation indexes. Funding bodies and research organizations are increasingly demanding the need of researchers to make their scientific data available in a reusable and reproducible manner, aiming to maximize the allocation of funding while providing transparency on the scientific process. The DCI harvests citations to research data from papers indexed in the Web of Knowledge. It relies on the information provided by the data repository as data citation practices are inconsistent or inexistent in many cases. The findings of this study show that data citation practices are far from common in most research fields.. Some differences have been reported on the way researchers cite data: while in the areas of Science and Engineering & Technology data sets were the most cited, in Social Sciences and Arts & Humanities data studies play a greater role. 88.1% of the records have received no citation, but some repositories show very low uncitedness rates. While data citation practices are rare in most fields, they have expanded in disciplines such as Crystallography or Genomics. We conclude by emphasizing the role the DCI may play to encourage consistent and standardized citation of research data which will allow considering its use on following the research process developed by researchers, from data collection to publication.

  11. Science Citation Knowledge Extractor - Dataset - CyVerse Data Commons

    • ckan.cyverse.rocks
    Updated Jun 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.cyverse.rocks (2024). Science Citation Knowledge Extractor - Dataset - CyVerse Data Commons [Dataset]. https://ckan.cyverse.rocks/dataset/science-citation-knowledge-extractor
    Explore at:
    Dataset updated
    Jun 23, 2024
    Dataset provided by
    CKANhttps://ckan.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    SCKE is an open source tool that helps biomedical researchers understand how their work is being used by others, by analyzing the content in papers that cite them. This tool uses natural language processing and machine learning to extract the prominent themes and concepts discussed in the citing documents. By seeing what kinds of topics are discussed by citing articles, researchers can better understand how their work influences their peers and various disciplines in science. Additionally, SCKE allows biomedical researchers to explore other statistics about the publications that cite them, such as where citations are published (Journals), the distribution of keywords (Keywords), the similarity of papers to each other (Clustering), the similarity of papers to other famous works (TextCompare), and general statistics about the citations (Statistics). The data presented here are supplementary data required for anyone wishing to set up SCKE on their own system.

  12. A dataset from a survey investigating disciplinary differences in data...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, pdf, txt
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anton Boudreau Ninkov; Anton Boudreau Ninkov; Chantal Ripp; Chantal Ripp; Kathleen Gregory; Kathleen Gregory; Isabella Peters; Isabella Peters; Stefanie Haustein; Stefanie Haustein (2024). A dataset from a survey investigating disciplinary differences in data citation [Dataset]. http://doi.org/10.5281/zenodo.7853477
    Explore at:
    txt, pdf, bin, csvAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anton Boudreau Ninkov; Anton Boudreau Ninkov; Chantal Ripp; Chantal Ripp; Kathleen Gregory; Kathleen Gregory; Isabella Peters; Isabella Peters; Stefanie Haustein; Stefanie Haustein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GENERAL INFORMATION

    Title of Dataset: A dataset from a survey investigating disciplinary differences in data citation

    Date of data collection: January to March 2022

    Collection instrument: SurveyMonkey

    Funding: Alfred P. Sloan Foundation


    SHARING/ACCESS INFORMATION

    Licenses/restrictions placed on the data: These data are available under a CC BY 4.0 license

    Links to publications that cite or use the data:

    Gregory, K., Ninkov, A., Ripp, C., Peters, I., & Haustein, S. (2022). Surveying practices of data citation and reuse across disciplines. Proceedings of the 26th International Conference on Science and Technology Indicators. International Conference on Science and Technology Indicators, Granada, Spain. https://doi.org/10.5281/ZENODO.6951437

    Gregory, K., Ninkov, A., Ripp, C., Roblin, E., Peters, I., & Haustein, S. (2023). Tracing data:
    A survey investigating disciplinary differences in data citation.
    Zenodo. https://doi.org/10.5281/zenodo.7555266


    DATA & FILE OVERVIEW

    File List

    • Filename: MDCDatacitationReuse2021Codebookv2.pdf
      Codebook
    • Filename: MDCDataCitationReuse2021surveydatav2.csv
      Dataset format in csv
    • Filename: MDCDataCitationReuse2021surveydatav2.sav
      Dataset format in SPSS
    • Filename: MDCDataCitationReuseSurvey2021QNR.pdf
      Questionnaire

    Additional related data collected that was not included in the current data package: Open ended questions asked to respondents


    METHODOLOGICAL INFORMATION

    Description of methods used for collection/generation of data:

    The development of the questionnaire (Gregory et al., 2022) was centered around the creation of two main branches of questions for the primary groups of interest in our study: researchers that reuse data (33 questions in total) and researchers that do not reuse data (16 questions in total). The population of interest for this survey consists of researchers from all disciplines and countries, sampled from the corresponding authors of papers indexed in the Web of Science (WoS) between 2016 and 2020.

    Received 3,632 responses, 2,509 of which were completed, representing a completion rate of 68.6%. Incomplete responses were excluded from the dataset. The final total contains 2,492 complete responses and an uncorrected response rate of 1.57%. Controlling for invalid emails, bounced emails and opt-outs (n=5,201) produced a response rate of 1.62%, similar to surveys using comparable recruitment methods (Gregory et al., 2020).

    Methods for processing the data:

    Results were downloaded from SurveyMonkey in CSV format and were prepared for analysis using Excel and SPSS by recoding ordinal and multiple choice questions and by removing missing values.

    Instrument- or software-specific information needed to interpret the data:

    The dataset is provided in SPSS format, which requires IBM SPSS Statistics. The dataset is also available in a coded format in CSV. The Codebook is required to interpret to values.


    DATA-SPECIFIC INFORMATION FOR: MDCDataCitationReuse2021surveydata

    Number of variables: 95

    Number of cases/rows: 2,492

    Missing data codes: 999 Not asked

    Refer to MDCDatacitationReuse2021Codebook.pdf for detailed variable information.

  13. Marine biological observation data from coastal and offshore surveys around...

    • gbif.org
    • compendiumkustenzee.be
    • +1more
    Updated Jun 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin Mackay; Kevin Mackay (2025). Marine biological observation data from coastal and offshore surveys around New Zealand [Dataset]. http://doi.org/10.15468/pzpgop
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    The National Institute of Water and Atmospheric Research (NIWA)
    Authors
    Kevin Mackay; Kevin Mackay
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 23, 1873 - May 30, 2014
    Area covered
    Description

    Occurrence details of New Zealand marine fauna and flora from around the coastline and offshore. Information is assimilated from a variety of sources including unpublished data sets and digitised from journal articles. Where the source is from a published paper the source paper citation is listed at the record level.

  14. Reference count CSV dataset of all bibliographic resources in OpenCitations...

    • figshare.com
    zip
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCitations ​ (2023). Reference count CSV dataset of all bibliographic resources in OpenCitations Index [Dataset]. http://doi.org/10.6084/m9.figshare.24747498.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    OpenCitations ​
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A CSV dataset containing the number of references of each bibliographic entity identified by an OMID in the OpenCitations Index (https://opencitations.net/index).The dataset is based on the last release of the OpenCitations Index (https://opencitations.net/download) – November 2023. The size of the zipped archive is 0.35 GB, while the size of the unzipped CSV file is 1.7 GB.The CSV dataset contains the reference count of 71,805,806 bibliographic entities. The first column (omid) lists the entities, while the second column (references) indicates the corresponding number of incoming citations.

  15. Z

    Methodology data of "A qualitative and quantitative citation analysis toward...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peroni, Silvio (2024). Methodology data of "A qualitative and quantitative citation analysis toward retracted articles: a case of study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4024337
    Explore at:
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Heibi, Ivan
    Peroni, Silvio
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This document contains the datasets and visualizations generated after the application of the methodology defined in our work: "A qualitative and quantitative citation analysis toward retracted articles: a case of study". The methodology defines a citation analysis of the Wakefield et al. [1] retracted article from a quantitative and qualitative point of view. The data contained in this repository are based on the first two steps of the methodology. The first step of the methodology (i.e. “Data gathering”) builds an annotated dataset of the citing entities, this step is largely discussed also in [2]. The second step (i.e. "Topic Modelling") runs a topic modeling analysis on the textual features contained in the dataset generated by the first step.

    Note: the data are all contained inside the "method_data.zip" file. You need to unzip the file to get access to all the files and directories listed below.

    Data gathering

    The data generated by this step are stored in "data/":

    "cits_features.csv": a dataset containing all the entities (rows in the CSV) which have cited the Wakefield et al. retracted article, and a set of features characterizing each citing entity (columns in the CSV). The features included are: DOI ("doi"), year of publication ("year"), the title ("title"), the venue identifier ("source_id"), the title of the venue ("source_title"), yes/no value in case the entity is retracted as well ("retracted"), the subject area ("area"), the subject category ("category"), the sections of the in-text citations ("intext_citation.section"), the value of the reference pointer ("intext_citation.pointer"), the in-text citation function ("intext_citation.intent"), the in-text citation perceived sentiment ("intext_citation.sentiment"), and a yes/no value to denote whether the in-text citation context mentions the retraction of the cited entity ("intext_citation.section.ret_mention"). Note: this dataset is licensed under a Creative Commons public domain dedication (CC0).

    "cits_text.csv": this dataset stores the abstract ("abstract") and the in-text citations context ("intext_citation.context") for each citing entity identified using the DOI value ("doi"). Note: the data keep their original license (the one provided by their publisher). This dataset is provided in order to favor the reproducibility of the results obtained in our work.

    Topic modeling We run a topic modeling analysis on the textual features gathered (i.e. abstracts and citation contexts). The results are stored inside the "topic_modeling/" directory. The topic modeling has been done using MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [3]. The topic modeling results for each textual feature are separated into two different folders, "abstracts/" for the abstracts, and "intext_cit/" for the in-text citation contexts. Both the directories contain the following directories/files:

    "mitao_workflows/": the workflows of MITAO. These are JSON files that could be reloaded in MITAO to reproduce the results following the same workflows.

    "corpus_and_dictionary/": it contains the dictionary and the vectorized corpus given as inputs for the LDA topic modeling.

    "coherence/coherence.csv": the coherence score of several topic models trained on a number of topics from 1 - 40.

    "datasets_and_views/": the datasets and visualizations generated using MITAO.

    References

    Wakefield, A., Murch, S., Anthony, A., Linnell, J., Casson, D., Malik, M., Berelowitz, M., Dhillon, A., Thomson, M., Harvey, P., Valentine, A., Davies, S., & Walker-Smith, J. (1998). RETRACTED: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet, 351(9103), 637–641. https://doi.org/10.1016/S0140-6736(97)11096-0

    Heibi, I., & Peroni, S. (2020). A methodology for gathering and annotating the raw-data/characteristics of the documents citing a retracted article v1 (protocols.io.bdc4i2yw) [Data set]. In protocols.io. ZappyLab, Inc. https://doi.org/10.17504/protocols.io.bdc4i2yw

    Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3
    
  16. National Software Reference Library (NSRL) Reference Data Set (RDS) - NIST...

    • datasets.ai
    • datadiscoverystudio.org
    • +4more
    0
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2024). National Software Reference Library (NSRL) Reference Data Set (RDS) - NIST Special Database 28 [Dataset]. https://datasets.ai/datasets/national-software-reference-library-nsrl-reference-data-set-rds-nist-special-database-28-72db0
    Explore at:
    0Available download formats
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    The National Software Reference Library (NSRL) collects software from various sources and incorporates file profiles computed from this software into a Reference Data Set (RDS) of information. The RDS can be used by law enforcement, government, and industry organizations to review files on a computer by matching file profiles in the RDS. This alleviates much of the effort involved in determining which files are important as evidence on computers or file systems that have been seized as part of criminal investigations. The RDS is a collection of digital signatures of known, traceable software applications. There are application hash values in the hash set which may be considered malicious, i.e. steganography tools and hacking scripts. There are no hash values of illicit data, i.e. child abuse images.

  17. Z

    Data from: Citation network of the knowledge co-production literature....

    • data.niaid.nih.gov
    Updated Dec 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megan Arthur (2021). Citation network of the knowledge co-production literature. Supplementary data. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5762450
    Explore at:
    Dataset updated
    Dec 8, 2021
    Dataset provided by
    Rhodri Ivor Leng
    Megan Arthur
    Justyna Bandola-Gill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data description

    This data note describes the final citation network dataset analysed in the manuscript "What is co-production? Conceptualising and understanding co-production of knowledge and policy across different theoretical perspectives’"[1].

    The data collection strategy used to construct the following dataset can be found in the associated manuscript [1]. These data were originally downloaded from the Web of Science (WoS) Core Collection via the library subscription of the University of Edinburgh via a systematic search methodology that sought to capture literature relevant to ‘knowledge co-production’. The dataset consists of 1,893 unique document reference strings (nodes) interlinked together by 9,759 citation links (edges). The network dataset describes a directed citation network composed of papers relevant to 'knowledge co-production', and is split into two files: (i) ‘KnowCo_node_attribute_list.csv’ contains attributes of the 1,893 documents (nodes); and (ii) ‘KnowCo_edge_list.csv’ records the citation links (edges) between pairs of documents.

    1. ‘KnowCo_node_attribute_list.csv’ consists of attributes of the 1,893 nodes (documents) of the citation network. Due to the approach used to collect data, there are two types of node: (i) 525 nodes represent documents retrieved from WoS via the systematic search strategy, and these have full attribute data including their reference lists; and (ii) 1,368 documents that were cited >2 times by our 525 fully retrieved papers (see manuscript for full description [1]). The columns refer to:

    Id, the unique identifier. Fully retrieved documents are identified via a unique identifier that begins with ‘f’ followed by an integer (e.g. f1, f2, etc.). Non-retrieved documents are identified via a unique identifier beginning with ‘n’ followed by an integer (e.g. n1, n2, etc.).

    Label, contains the unique reference string of the document for which the attribute data in that row corresponds. Reference strings contain the last name of the first author, publication year, journal, volume, start page, and DOI (if available).

    authors, all author names. These are in the order that these names appear in the authorship list of the corresponding document. These data are only available for fully retrieved documents.

    title, document title. These data are only available for fully retrieved documents.

    journal, journal of publication. These data are only available for fully retrieved documents. For those interested in journal data for the remaining papers, this can be extracted from the reference string in the ‘Label’ column.

    year, year of publication. These data are available for all nodes.

    type, document type (e.g. article, review). Available only for fully retrieved documents.

    wos_total_citations, total citation count as recorded by Web of Science Core Collection as of May 2020. Available only for fully retrieved documents.

    wos_id, Web of Science accession number. Available only for fully retrieved documents only, for non-retrieved documents ‘CitedReference’ fills the cell.

    cluster, provides the cluster membership number as discussed within the manuscript, established via modularity maximisation via the Leiden algorithm (Res 0.8; Q=0.53|5 clusters). Available for all nodes.

    indegree, total count of within network citations to a given document. Due to the composition of the network, this figure tells us the total number of citations from 525 fully retrieved documents to each of the 1,893 documents within the network. Available for all nodes.

    outdegree, total count of within network references from a given document. Due to the composition of the network, only fully retrieved documents can have a value >0 because only these documents have their associated reference list data. Available for all nodes.

    1. ‘KnowCo_edge _list.csv’ is an edge list containing 9,759 citation links between the 1,893 documents. The columns refer to:

    Source, the citing document’s unique identifier.

    Target, the cited document’s unique identifier.

    Notes

    [1] Bandola-Gill, J., Arthur, M., & Leng, R. I. (Under review). What is co-production? Conceptualising and understanding co-production of knowledge and policy across different theoretical perspectives. Evidence & Policy

  18. Sample CVs Dataset for Analysis

    • kaggle.com
    Updated Aug 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lone (2024). Sample CVs Dataset for Analysis [Dataset]. https://www.kaggle.com/datasets/hussnainmushtaq/sample-cvs-dataset-for-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    lone
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains a small collection of 6 randomly selected CVs (Curriculum Vitae), representing various professional backgrounds. The dataset is intended to serve as a resource for research in fields such as Human Resources (HR), data analysis, natural language processing (NLP), and machine learning. It can be used for tasks like resume parsing, skill extraction, job matching, and analyzing trends in professional qualifications and experiences. Potential Use Cases: This dataset can be used for various research and development purposes, including but not limited to:

    Resume Parsing: Developing algorithms to automatically extract and categorize information from resumes. Skill Extraction: Identifying key skills and competencies from text data within the CVs. Job Matching: Creating models to match candidates to job descriptions based on their qualifications and experience. NLP Research: Analyzing language patterns, sentence structure, and terminology used in professional resumes. HR Analytics: Studying trends in career paths, education, and skill development across different professions. Training Data for Machine Learning Models: Using the dataset as a sample for training and testing machine learning models in HR-related applications. Dataset Format: The dataset is available in a compressed file (ZIP) containing the 6 CVs in both PDF and DOCX formats. This allows for flexibility in how the data is processed and analyzed.

    Licensing: This dataset is shared under the CC BY-NC-SA 4.0 license. This means that you are free to:

    Share: Copy and redistribute the material in any medium or format. Adapt: Remix, transform, and build upon the material. Under the following terms:

    Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. NonCommercial: You may not use the material for commercial purposes. ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. Citation: If you use this dataset in your research or projects, please cite it as follows:

    "Sample CVs Dataset for Analysis, Mushtaq et al., Kaggle, 2024."

    Limitations and Considerations: Sample Size: The dataset contains only 6 CVs, which is a very small sample size. It is intended for educational and prototyping purposes rather than large-scale analysis. Anonymization: Personal details such as names, contact information, and specific locations may be anonymized or altered to protect privacy. Bias: The dataset is not representative of the entire population and may contain biases related to profession, education, and experience. This dataset is a useful starting point for developing models or conducting small-scale experiments in HR-related fields. However, users should be aware of its limitations and consider supplementing it with additional data for more robust analysis.

  19. t

    INDIGO Change Detection Reference Dataset

    • researchdata.tuwien.at
    jpeg, png, zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Wild; Benjamin Wild; Geert Verhoeven; Geert Verhoeven; Rafał Muszyński; Rafał Muszyński; Norbert Pfeifer; Norbert Pfeifer (2024). INDIGO Change Detection Reference Dataset [Dataset]. http://doi.org/10.48436/ayj4e-v4864
    Explore at:
    jpeg, zip, pngAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Benjamin Wild; Benjamin Wild; Geert Verhoeven; Geert Verhoeven; Rafał Muszyński; Rafał Muszyński; Norbert Pfeifer; Norbert Pfeifer
    Description

    The INDIGO Change Detection Reference Dataset

    Description

    This graffiti-centred change detection dataset was developed in the context of INDIGO, a research project focusing on the documentation, analysis and dissemination of graffiti along Vienna's Donaukanal. The dataset aims to support the development and assessment of change detection algorithms.

    The dataset was collected from a test site approximately 50 meters in length along Vienna's Donaukanal during 11 days between 2022/10/21 and 2022/12/01. Various cameras with different settings were used, resulting in a total of 29 data collection sessions or "epochs" (see "EpochIDs.jpg" for details). Each epoch contains 17 images generated from 29 distinct 3D models with different textures. In total, the dataset comprises 6,902 unique image pairs, along with corresponding reference change maps. Additionally, exclusion masks are provided to ignore parts of the scene that might be irrelevant, such as the background.

    To summarise, the dataset, labelled as "Data.zip," includes the following:

    • Synthetic Images: These are colour images created within Agisoft Metashape Professional 1.8.4, generated by rendering views from 17 artificial cameras observing 29 differently textured versions of the same 3D surface model.
    • Change Maps: Binary images that were manually and programmatically generated, using a Python script, from two synthetic graffiti images. These maps highlight the areas where changes have occurred.
    • Exclusion Masks: Binary images are manually created from synthetic graffiti images to identify "no data" areas or irrelevant ground pixels.

    Image Acquisition

    Image acquisition involved the use of two different camera setups. The first two datasets (ID 1 and 2; cf. "EpochIDs.jpg") were obtained using a Nikon Z 7II camera with a pixel count of 45.4 MP, paired with a Nikon NIKKOR Z 20 mm lens. For the remaining image datasets (ID 3-29), a triple GoPro setup was employed. This triple setup featured three GoPro cameras, comprising two GoPro HERO 10 cameras and one GoPro HERO 11, all securely mounted within a frame. This triple-camera setup was utilised on nine different days with varying camera settings, resulting in the acquisition of 27 image datasets in total (nine days with three datasets each).

    Data Structure

    The "Data.zip" file contains two subfolders:

    • 1_ImagesAndChangeMaps: This folder contains the primary dataset. Each subfolder corresponds to a specific epoch. Within each epoch folder resides a subfolder for every other epoch with which a distinct epoch pair can be created. It is important to note that the pairs "Epoch Y and Epoch Z" are equivalent to "Epoch Z and Epoch Y", so the latter combinations are not included in this dataset. Each sub-subfolder, organised by epoch, contains 17 more subfolders, which hold the image data. These subfolders consist of:
      • Two synthetic images rendered from the same synthetic camera ("X_Y.jpg" and "X_Z.jpg")
      • The corresponding binary reference change map depicting the graffiti-related differences between the two images ("X_YZ.png"). Black areas denote new graffiti (i.e. "change"), and white denotes "no change". "DataStructure.png" provides a visual explanation concerning the creation of the dataset.

        The filenames follow the following pattern:
        • X - Is the ID number of the synthetic camera. In total, 17 synthetic cameras were placed along the test site
        • Y - Corresponds to the reference epoch (i.e. the "older epoch")
        • Z - Corresponds to the "new epoch"
    • 2_ExclusionMasks: This folder contains the binary exclusion masks. They were manually created from synthetic graffiti images and identify "no data" areas or areas considered irrelevant, such as "ground pixels". Two exclusion masks were generated for each of the 17 synthetic cameras:
      • "groundMasks": depict ground pixels which are usually irrelevant for the detection of graffiti
      • "noDataMasks": depict "background" for which no data is available.

    A detailed dataset description (including detailed explanations of the data creation) is part of a journal paper currently in preparation. The paper will be linked here for further clarification as soon as it is available.

    Licensing

    Due to the nature of the three image types, this dataset comes with two licenses:

    Every synthetic image, change map and mask has this licensing information embedded as IPTC photo metadata. In addition, the images' IPTC metadata also provide a short image description, the image creator and the creator's identity (in the form of an ORCiD).

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    If there are any questions, problems or suggestions for the dataset or the description, please do not hesitate to contact the corresponding author, Benjamin Wild.

  20. PASTA Data

    • kaggle.com
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Research (2024). PASTA Data [Dataset]. https://www.kaggle.com/datasets/googleai/pasta-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Google Research
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".

    We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.

    At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.

    See our paper for a comprehensive description of the rater study.

    Citation

    Please cite our paper if you use it in your work.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
C. Lee Giles; Kurt D. Bollacker; Steve Lawrence (2007). Citeseer Dataset [Dataset]. https://paperswithcode.com/dataset/citeseer

Data from: Citeseer Dataset

Related Article
Explore at:
12 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Mar 4, 2007
Authors
C. Lee Giles; Kurt D. Bollacker; Steve Lawrence
Description

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.

Search
Clear search
Close search
Google apps
Main menu