100+ datasets found
  1. POCI CSV dataset of all the citation data

    • figshare.com
    zip
    Updated Dec 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCitations ​ (2022). POCI CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.21776351.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 27, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    OpenCitations ​
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:

    [field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).

    This version of the dataset contains:

    717,654,703 citations; 26,024,862 bibliographic resources.

    The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.

  2. Massive Scholarly Dataset: 5M Papers 36M Citations

    • kaggle.com
    zip
    Updated Mar 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agung Pambudi (2024). Massive Scholarly Dataset: 5M Papers 36M Citations [Dataset]. https://www.kaggle.com/datasets/agungpambudi/research-citation-network-5m-papers
    Explore at:
    zip(7308773733 bytes)Available download formats
    Dataset updated
    Mar 16, 2024
    Authors
    Agung Pambudi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Explore a rich research dataset with 5.2M papers and 36.6M citations! Unleash your data science skills for clustering, influence analysis, topic modeling, and more. Dive into the world of research networks.

    Field NameField TypeDescriptionExample
    idstringpaper ID53e997ddb7602d9701fd3ad7
    titlestringpaper titleRewrite-Based Satisfiability Procedures for Recursive Data Structures
    authors.namestringauthor nameMaria Paola Bonacina
    author.orgstringauthor affiliationDipartimento di Informatica
    author.idstringauthor ID53f47275dabfaee43ed25965
    venue.rawstringpaper venue nameElectronic Notes in Theoretical Computer Science(ENTCS)
    yearintpublished year2007
    keywordslist of stringskeywords["theorem-proving strategy", "rewrite-based approach", ...]
    fos.namestringpaper fields of studyData structure
    fos.wfloatfields of study weight0.48341
    referenceslist of stringspaper references["53e9a31fb7602d9702c2c61e", "53e997f1b7602d9701fef4d1", ...]
    n_citationintcitation number19
    page_startstringpage start55
    page_endstringpage end70
    doc_typestringpaper type: journal, conferenceJournal
    langstringdetected languageen
    volumestringvolume174
    issuestringissue8
    issnstringissnElectronic Notes in Theoretical Computer Science
    isbnstringisbn
    doistringdoi10.1016/j.entcs.2006.11.039
    urllistexternal links[https: ...]
    abstractstringabstractOur ability to generate ...
    indexed_abstractdictindexed abstract{"IndexLength": 116, "InvertedIndex": {"data": [49], ...}
    v12_idintv12 paper id2027211529
    v12_authors.namestringv12 author nameMaria Paola Bonacina
    v12_authors.orgstringv12 author affiliationDipartimento di Informatica,Università degli Studi di Verona,Italy#TAB#
    v12_authors.idintv12 author ID669130765
  3. I

    Self-citation analysis data based on PubMed Central subset (2002-2005)

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
    Explore at:
    Authors
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Science Foundation (NSF)
    U.S. National Institutes of Health (NIH)
    Description

    Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.

  4. H

    GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Dec 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Grennan; Martin Schibel; Andrew Collins; Joeran Beel (2019). GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [Data] [Dataset]. http://doi.org/10.7910/DVN/LXQXAO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Mark Grennan; Martin Schibel; Andrew Collins; Joeran Beel
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAOhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAO

    Description

    Extracting and parsing reference strings from research articles is a challenging task. State-of-the-art tools like GROBID apply rather simple machine learning models such as conditional random fields (CRF). Recent research has shown a high potential of deep-learning for reference string parsing. The challenge with deep learning is, however, that the training step requires enormous amounts of labeled data – which does not exist for reference string parsing. Creating such a large dataset manually, through human labor, seems hardly feasible. Therefore, we created GIANT. GIANT is a large dataset with 991,411,100 XML labeled reference strings. The strings were automatically created based on 677,000 entries from CrossRef, 1,500 citation styles in the citation-style language, and the citation processor citeproc-js. GIANT can be used to train machine learning models, particularly deep learning models, for citation parsing. While we have not yet tested GIANT for training such models, we hypothesise that the dataset will be able to significantly improve the accuracy of citation parsing. The dataset and code to create it, are freely available at https://github.com/BeelGroup/.

  5. Reference count CSV dataset of all bibliographic resources in OpenCitations...

    • figshare.com
    zip
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCitations ​ (2023). Reference count CSV dataset of all bibliographic resources in OpenCitations Index [Dataset]. http://doi.org/10.6084/m9.figshare.24747498.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    OpenCitations ​
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A CSV dataset containing the number of references of each bibliographic entity identified by an OMID in the OpenCitations Index (https://opencitations.net/index).The dataset is based on the last release of the OpenCitations Index (https://opencitations.net/download) – November 2023. The size of the zipped archive is 0.35 GB, while the size of the unzipped CSV file is 1.7 GB.The CSV dataset contains the reference count of 71,805,806 bibliographic entities. The first column (omid) lists the entities, while the second column (references) indicates the corresponding number of incoming citations.

  6. Data Citation Corpus Data File

    • zenodo.org
    zip
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataCite (2024). Data Citation Corpus Data File [Dataset]. http://doi.org/10.5281/zenodo.13376773
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    DataCitehttps://www.datacite.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

    The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.

    For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.

    The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

    Each data citation record is comprised of:

    • A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited

    • Metadata for the cited dataset and for the citing publication

    The data file includes the following fields:

    Field

    Description

    Required?

    id

    Internal identifier for the citation

    Yes

    created

    Date of item's incorporation into the corpus

    Yes

    updated

    Date of item's most recent update in corpus

    Yes

    repository

    Repository where cited data is stored

    No

    publisher

    Publisher for the article citing the data

    No

    journal

    Journal for the article citing the data

    No

    title

    Title of cited data

    No

    publication

    DOI of article where data is cited

    Yes

    dataset

    DOI or accession number of cited data

    Yes

    publishedDate

    Date when citing article was published

    No

    source

    Source where citation was harvested

    Yes

    subjects

    Subject information for cited data

    No

    affiliations

    Affiliation information for creator of cited data

    No

    funders

    Funding information for cited data

    No

    Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

    The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

    Add and update Event Data citations:

    • Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024

    Remove citation records deemed out of scope for the corpus:

    • 273,567 records from DataCite Event Data with non-citation relationship types

    • 28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)

    • 44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication

    • 473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions

    • 4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)

    Metadata enhancements:

    • Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository

    • Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)

    Data structure updates to improve usability and eliminate redundancies:

    • Rename subj_id and obj_id fields to “dataset” and “publication” for clarity

    • Remove accessionNumber and doi elements to eliminate redundancy with subj_id

    • Remove relationTypeId fields as these are specific to Event Data only

    Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.

    While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.


    Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.

  7. Data from: Standards Incorporated by Reference (SIBR) Database

    • catalog.data.gov
    • data.nist.gov
    • +2more
    Updated Sep 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Standards Incorporated by Reference (SIBR) Database [Dataset]. https://catalog.data.gov/dataset/standards-incorporated-by-reference-sibr-database
    Explore at:
    Dataset updated
    Sep 30, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is a searchable historical collection of standards referenced in regulations - Voluntary consensus standards, government-unique standards, industry standards, and international standards referenced in the Code of Federal Regulations (CFR).

  8. UIEB Dataset-reference

    • kaggle.com
    zip
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kaggle6 (2023). UIEB Dataset-reference [Dataset]. https://www.kaggle.com/datasets/larjeck/uieb-dataset-reference
    Explore at:
    zip(823157904 bytes)Available download formats
    Dataset updated
    Aug 3, 2023
    Authors
    kaggle6
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    950 algorithmically restored underwater images, used for image enhancement, image generation, etc.

  9. Time Reference for Management Information

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Security Administration (2025). Time Reference for Management Information [Dataset]. https://catalog.data.gov/dataset/time-reference-for-management-information
    Explore at:
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Social Security Administrationhttp://ssa.gov/
    Description

    The application is responsible for maintaining a series of tables containing data for fiscal and processing years and located on the mainframe and data warehouse servers.

  10. dataset-8670

    • kaggle.com
    zip
    Updated Aug 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    e0650443145 (2022). dataset-8670 [Dataset]. https://www.kaggle.com/datasets/e0650443145/dataset-8670
    Explore at:
    zip(2137977761 bytes)Available download formats
    Dataset updated
    Aug 24, 2022
    Authors
    e0650443145
    Description

    The dataset includes sensory data from the following sensors:

    • Triaxial acceleration force (in m/s^2) from the mobile phone accelerometer (30 Hz) and E4 accelerometer (32 Hz)
    • Triaxial rate of rotation (in rad/s) and degrees of rotation (in Degrees) from the mobile phone gyroscope (30 Hz)
    • Triaxial geomagnetic field strength (in μT) from the mobile phone magnetometer (30 Hz)
    • Latitude and longitude, and horizontal accuracy 1) (in meters) from the mobile phone GPS (every 5 seconds)
    • Blood volume pressure (in nano Watt) from E4 photoplethysmography (PPG) sensor (64 Hz)
    • Electrodermal activity (skin conductance in μS) from E4 EDA sensor (4 Hz)
    • Average heart rate 2) values (in bps) computed in 10 seconds-span based on the BVP analysis from E4 (1 Hz)
    • Peripheral skin temperature (in Celsius degrees) from E4 infrared thermopile (4 Hz)

    1) The estimated horizontal accuracy is defined as the radius of 68% confidence according to the API. Reference: https://developer.android.com/reference/android/location/Location#getAccuracy() 2) HR values are not derived from a real-time reading but are created after the data collection session. Reference: https://support.empatica.com/hc/en-us/articles/360029469772-E4-data-HR-csv-explanation

    The dataset includes files in a structure shown below:

    +----- USER-ID | +----- timestamp (DAY 1) | | +----- e4Acc | | | timestamp (e4-accelerometer-data).csv | | | ... | | +----- e4Bvp | | | timestamp (e4-blood-volume-pressure-data).csv | | | ... | | +----- e4Eda | | | timestamp (e4-electrodermal-activity-data).csv | | | ... | | +----- e4Hr | | | timestamp (e4-heart-rate-data).csv | | | ... | | +----- e4Temp | | | timestamp (e4-skin-temperature-data).csv | | | ... | | +----- mAcc | | | timestamp (mobile-accelerometer-data).csv | | | ... | | +----- mGps | | | timestamp (mobile-gps-data).csv | | | ... | | +----- mGyr | | | timestamp (mobile-gyroscope-data).csv | | | ... | | +----- mMag | | | timestamp (mobile-magnetometer-data).csv | | | ... | | timestamp-label.csv | +----- timestamp (DAY 2) | | +----- ...

    Directories (in timestamps) located under the USER_ID directory indicate when the user started the experiment each day. Each day has directories named by the corresponding sensors, which includes data files generated every one minute. Each data file records raw sensor values in the designated sampling interval with the timestamp. (Timestamp is represented in second.millisecond format.)

    User label files are composed of 12 columns representing the physical, emotional, and contextual states as follows:

    • ts: timestamp
    • action: sleep, personal_care, work, study, household, care_housemem (caregiving), recreation_media, entertainment, outdoor_act (sports), hobby, recreation_etc (free time), shop, communitiy_interaction (regular activity), travel (includes commute), meal (includes snack), socialising
    • actionOption: Details of the selected action. See the description below.
    • actionSub: meal-amount when action=meal or snack, move-method when action=travel
    • actionSubOption: 1 (light), 2 (moderate), 3 (heavy) when actionSub=meal_amount, 1 (walk), 2 (driving), 3 (taxi, passenger), 4 (personal mobility), 5 (bus), 6 (train, subway), 7 (others) when actionSub=move-method
    • condition: ALONE, WITH-ONE, WITH-MANY
    • conditionSub1Option: 1 (with families), 2 (with friends), 3 (with colleagues), 4 (acquaintances), 5 (others)
    • conditionSub2Option: 1 (passive in conversation), 2 (moderate participation in conversation), 3 (active in conversation)
    • place: home, workplace, restaurant, outdoor, other-indoor
    • emotionPositive : (negative) 1-2-3-4-5-6-7 (positive)
    • emotionTension: (relaxed) 1-2-3-4-5-6-7 (aroused)
    • activity 3): 0 (IN-VEHICLE), 1 (ON-BICYCLE), 2 (ON-FOOT), 3 (STILL), 4 (UNKNOWN), 5 (TILTING), 7 (WALKING), 8 (RUNNING)

    3) Values in the activity column represent the detected activity of the mobile device using Google's Awareness API. Reference: https://developers.google.com/android/reference/com/google/android/gms/location/DetectedActivity?hl=en

    Descriptions for the actionOption field is as follows:

    111 Sleep 112 Sleepless 121 Meal 122 Snack 131 Medical services, treatments, sick rest 132 Personal hygiene (bath) 133 Appearance management (makeup, change of clothes) 134 Beauty-related services 211 Main job 212 Side job 213 Rest during work 22 Job search 311 School class / seminar (listening) 312 Break between classes 313 School homework, self-study (individual) 314 Team project (in groups) 321 Private tutoring (offline) 322 Online courses 41 Preparing food and washing dishes 42 Laundry and ironing 43 Housing management and cleaning 44 Vehicle management 45 Pet and plant caring 46 Purchasing goods and services (grocery/take-out) 51 Caring for children under 10 who live together 52 Caring for elementary, middle, and high school students over 10 who ...

  11. f

    Data From: TERRA-REF, An Open Reference Data Set From High Resolution...

    • datasetcatalog.nlm.nih.gov
    • nde-dev.biothings.io
    • +4more
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burnette, Maxwell A.; Pauli, Duke; French, Andrew N.; Garnett, Roman; Maimaitijiang, Maitiniyazi; White, Jeffrey W.; Rohde, Gareth S; Newcomb, Maria; Rooney, William L.; Thorp, Kelly; Fahlgren, Noah; Lebauer, David; Pless, Robert; Paheding, Sidike; Ozersky, Philip; Willis, Craig; Kooper, Rob; Sagan, Vasit; Morris, Geoffrey; Ward, Richard; Demieville, Jeffrey; Li, Zongyang; Stylianou, Abby; Ottman, Michael J.; Shakoor, Nadia; Zender, Charles S.; Flinn, Barry; Riemer, Kristina (2024). Data From: TERRA-REF, An Open Reference Data Set From High Resolution Genomics, Phenomics, and Imaging Sensors [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001456820
    Explore at:
    Dataset updated
    Feb 15, 2024
    Authors
    Burnette, Maxwell A.; Pauli, Duke; French, Andrew N.; Garnett, Roman; Maimaitijiang, Maitiniyazi; White, Jeffrey W.; Rohde, Gareth S; Newcomb, Maria; Rooney, William L.; Thorp, Kelly; Fahlgren, Noah; Lebauer, David; Pless, Robert; Paheding, Sidike; Ozersky, Philip; Willis, Craig; Kooper, Rob; Sagan, Vasit; Morris, Geoffrey; Ward, Richard; Demieville, Jeffrey; Li, Zongyang; Stylianou, Abby; Ottman, Michael J.; Shakoor, Nadia; Zender, Charles S.; Flinn, Barry; Riemer, Kristina
    Description

    The ARPA-E funded TERRA-REF project generated open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active fluorescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets. Data were collected at the University of Arizona Maricopa Agricultural Center in Maricopa, Arizona. This site hosts a large field scanner with fifteen sensors, many of which are capable of capturing mm-scale images and point clouds at daily to weekly intervals. These data are intended to be reused and are accessible as a combination of files and databases linked by spatial, temporal, and genomic information. In addition to providing open access data, the entire computational pipeline is open source, and we enable users to access high-performance computing environments. The study has evaluated a sorghum diversity panel, biparental cross populations, and elite lines and hybrids from structured sorghum breeding populations. In addition, a durum wheat diversity panel was grown and evaluated over three winter seasons. The initial release includes derived data from two seasons in which the sorghum diversity panel was evaluated. Future releases will include data from additional seasons and locations. The TERRA-REF reference dataset can be used to characterize phenotype-to-genotype associations, on a genomic scale, that will enable knowledge-driven breeding and the development of higher-yielding cultivars of sorghum and wheat. The data is also being used to develop new algorithms for machine learning, image analysis, genomics, and optical sensor engineering. Resources in this dataset:Resource Title: Link to dataset at Datadryad.org. File Name: Web Page, url: https://datadryad.org/stash/dataset/doi:10.5061/dryad.4b8gtht99 The ARPA-E funded TERRA-REF project is generating open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active flourescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets.

  12. Datasets for Information Reference

    • figshare.com
    xlsx
    Updated Oct 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tongyang Zhang (2021). Datasets for Information Reference [Dataset]. http://doi.org/10.6084/m9.figshare.16834840.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 20, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Tongyang Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We use journal articles published by Scientometrics in 2016-2020 as the data source. Through the analysis of the data set usage records of scientometrics research, the frequency ranking of the usage of each dataset for information reference is listed, so as to provide a reference for the selection of data sets for scientometrics research.

  13. English Wikipedia People Dataset

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
    Explore at:
    zip(4293465577 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

    The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

    We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

    Data Structure

    • File name: wme_people_infobox.tar.gz
    • Size of compressed file: 4.12 GB
    • Size of uncompressed file: 21.28 GB

    Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

    The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

    Stats

    Infoboxes - Compressed: 2GB - Uncompressed: 11GB

    Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

    Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

    This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

    Maintenance and Support

    This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

    Initial Data Collection and Normalization

    The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

    Who are the source language producers?

    Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

    Attribution

    Terms and conditions

    Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

  14. E

    Data from: Global hydrological dataset of daily streamflow data from the...

    • catalogue.ceh.ac.uk
    • hosted-metadata.bgs.ac.uk
    • +3more
    zip
    Updated May 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S. Turner; J. Hannaford; L.J. Barker; G. Suman; R. Armitage; A. Killeen; A. Griffin; H. Davies; A. Kumar; H. Dixon; M.T.D. Albuquerque; N. Almeida Ribeiro; C. Alvarez-Garreton; E. Amoussou; B. Arheimer; Y. Asano; T. Berezowski; A. Bodian; H. Boutaghane; R. Capell; H. Dakhaoui; J. Daňhelka; H.X. Do; C. Ekkawatpanit; E.M. El Khalki; A.K. Fleig; R. Fonseca; J.D. Giraldo-Osorio; A.B.T. Goula; M. Hanel; G Hodgkins; S. Horton; C. Kan; D.G. Kingston; G. Laaha; R. Laugesen; W. Lopes; S. Mager; Y. Markonis; L. Mediero; G. Midgley; C. Murphy; P. O'Connor; A.I. Pedersen; H.T. Pham; M. Piniewski; M. Rachdane; B. Renard; M.E. Saidi; P. Schmocker-Facker; K. Stahl; M. Thyler; M. Toucher; Y. Tramblay; J. Uusikivi; N. Venegas-Cordero; S. Vissesri; A. Watson; S. Westra; P.H. Whitfield (2024). Global hydrological dataset of daily streamflow data from the Reference Observatory of Basins for INternational hydrological climate change detection (ROBIN), 1863 - 2022 [Dataset]. http://doi.org/10.5285/3b077711-f183-42f1-bac6-c892922c81f4
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 28, 2024
    Dataset provided by
    NERC EDS Environmental Information Data Centre
    Authors
    S. Turner; J. Hannaford; L.J. Barker; G. Suman; R. Armitage; A. Killeen; A. Griffin; H. Davies; A. Kumar; H. Dixon; M.T.D. Albuquerque; N. Almeida Ribeiro; C. Alvarez-Garreton; E. Amoussou; B. Arheimer; Y. Asano; T. Berezowski; A. Bodian; H. Boutaghane; R. Capell; H. Dakhaoui; J. Daňhelka; H.X. Do; C. Ekkawatpanit; E.M. El Khalki; A.K. Fleig; R. Fonseca; J.D. Giraldo-Osorio; A.B.T. Goula; M. Hanel; G Hodgkins; S. Horton; C. Kan; D.G. Kingston; G. Laaha; R. Laugesen; W. Lopes; S. Mager; Y. Markonis; L. Mediero; G. Midgley; C. Murphy; P. O'Connor; A.I. Pedersen; H.T. Pham; M. Piniewski; M. Rachdane; B. Renard; M.E. Saidi; P. Schmocker-Facker; K. Stahl; M. Thyler; M. Toucher; Y. Tramblay; J. Uusikivi; N. Venegas-Cordero; S. Vissesri; A. Watson; S. Westra; P.H. Whitfield
    License

    https://eidc.ac.uk/licences/ogl/plainhttps://eidc.ac.uk/licences/ogl/plain

    Time period covered
    Jan 1, 1863 - Dec 31, 2022
    Area covered
    Earth
    Dataset funded by
    Natural Environment Research Councilhttps://www.ukri.org/councils/nerc
    Description

    The Reference Observatory of Basins for INternational hydrological climate change detection (ROBIN) dataset is a global hydrological dataset containing publicly available daily flow data for 2,386 gauging stations across the globe which have natural or near-natural catchments. Metadata is also provided alongside these stations for the Full ROBIN Dataset consisting of 3,060 gauging stations. Data were quality controlled by the central ROBIN team before being added to the dataset, and two levels of data quality are applied to guide users towards appropriate the data usage. Most records have data of at least 40 years with minimal missing data with data records starting in the late 19th Century for some sites through to 2022. ROBIN represents a significant advance in global-scale, accessible streamflow data. The project was funded the UK Natural Environment Research Council Global Partnership Seedcorn Fund - NE/W004038/1 and the NC-International programme [NE/X006247/1] delivering National Capability

  15. Citation Networks (SNAP)

    • kaggle.com
    zip
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Citation Networks (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-cit
    Explore at:
    zip(95620457 bytes)Available download formats
    Dataset updated
    Dec 16, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High-energy physics citation network

    Dataset information

    Arxiv HEP-PH (high energy physics phenomenology ) citation graph is from the
    e-print arXiv and covers all the citations within a dataset of 34,546 papers
    with 421,578 edges. If a paper i cites paper j, the graph contains a directed
    edge from i to j. If a paper cites, or is cited by, a paper outside the
    dataset, the graph does not contain any information about this.

    The data covers papers in the period from January 1993 to April 2003 (124
    months). It begins within a few months of the inception of the arXiv, and thus
    represents essentially the complete history of its HEP-PH section.

    The data was originally released as a part of 2003 KDD Cup.

    Dataset statistics
    Nodes 34546
    Edges 421578
    Nodes in largest WCC 34401 (0.996)
    Edges in largest WCC 421485 (1.000)
    Nodes in largest SCC 12711 (0.368)
    Edges in largest SCC 139981 (0.332)
    Average clustering coefficient 0.2962
    Number of triangles 1276868
    Fraction of closed triangles 0.1457
    Diameter (longest shortest path) 12
    90-percentile effective diameter 5

    Source (citation)

    J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
    Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining (KDD), 2005.

    J. Gehrke, P. Ginsparg, J. M. Kleinberg. Overview of the 2003 KDD Cup. SIGKDD
    Explorations 5(2): 149-151, 2003.

    Files
    File Description
    cit-HepPh.txt.gz Paper citation network of Arxiv High Energy Physics category cit-HepPh-dates.txt.gz Time of nodes (paper submission time to Arxiv)

    High-energy physics theory citation network

    Dataset information

    Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print
    arXiv and covers all the citations within a dataset of 27,770 papers with
    352,807 edges. If a paper i cites paper j, the graph contains a directed edge
    from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this.

    The data covers papers in the period from January 1993 to April 2003 (124
    months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its HEP-TH section.

    The data was originally released as a part of 2003 KDD Cup.

    Dataset statistics
    Nodes 27770
    Edges 352807
    Nodes in largest WCC 27400 (0.987) ...

  16. h

    Per-Citation-Dataset

    • huggingface.co
    Updated Jul 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salunkhe (2025). Per-Citation-Dataset [Dataset]. https://huggingface.co/datasets/Mithilss/Per-Citation-Dataset
    Explore at:
    Dataset updated
    Jul 23, 2025
    Authors
    Salunkhe
    Description

    Mithilss/Per-Citation-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. Computer Forensic Reference Data Set Portal

    • data.nist.gov
    Updated May 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ayers (2022). Computer Forensic Reference Data Set Portal [Dataset]. http://doi.org/10.18434/mds2-2635
    Explore at:
    Dataset updated
    May 2, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Richard Ayers
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    This portal is your gateway to documented digital forensic image datasets. These datasets can assist in a variety of tasks including tool testing, developing familiarity with tool behavior for given tasks, general practitioner training and other unforeseen uses that the user of the datasets can devise. Most datasets have a description of the type and locations of significant artifacts present in the dataset. There are descriptions and finding aides to help you locate datasets by the year produced, by author, or by attributes of the dataset.

  18. NIST SAMATE Software Assurance Reference Dataset

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). NIST SAMATE Software Assurance Reference Dataset [Dataset]. https://catalog.data.gov/dataset/nist-samate-software-assurance-reference-dataset
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This dataset provides the NIST Software Assurance Metrics And Tool Evaluation (SAMATE) Software Assurance Reference Dataset (SARD) - a set of programs with known security flaws. This will allow end users to evaluate tools and tool developers to test their methods.

  19. U

    A circa 2010 global land cover reference dataset from commercial high...

    • data.usgs.gov
    • s.cnmilf.com
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruce Pengra; Jordan Long; Devendra Dahal; Steve Stehman; Thomas Loveland, A circa 2010 global land cover reference dataset from commercial high resolution satellite data [Dataset]. http://doi.org/10.5066/P96FKANW
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Bruce Pengra; Jordan Long; Devendra Dahal; Steve Stehman; Thomas Loveland
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    May 19, 2002 - May 29, 2014
    Description

    The data are 475 thematic land cover raster’s at 2m resolution. Land cover classification was to the land cover classes: Tree (1), Water (2), Barren (3), Other Vegetation (4) and Ice & Snow (8). Cloud cover and Shadow were sometimes coded as Cloud (5) and Shadow (6), however for any land cover application would be considered NoData. Some raster’s may have Cloud and Shadow pixels coded or recoded to NoData already. Commercial high-resolution satellite data was used to create the classifications. Usable image data for the target year (2010) was acquired for 475 of the 500 primary sample locations, with 90% of images acquired within ±2 years of the 2010 target. The remaining 25 of the 500 sample blocks had no usable data so were not able to be mapped. Tabular data is included with the raster classifications indicating the specific high-resolution sensor and date of acquisition for source imagery as well as the stratum to which that sample block belonged. Methods for this classifi ...

  20. MIEDT dataset

    • kaggle.com
    Updated Jan 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    机关鸢鸟 (2025). MIEDT dataset [Dataset]. https://www.kaggle.com/datasets/lidang78/miedt-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    机关鸢鸟
    Description
      1. Dataset Overview This dataset is organized based on the edge detection task, aiming to provide rich image resources and corresponding edge detection annotation information for related research and applications, which can be used for the testing of edge detection algorithms. In order to evaluate the performance of the edge detection method comprehensively, we created the Medical Image Edge Detection Test (MIEDT) dataset. The MIEDT contains 100 medical images, which were randomly selected from three publicly available datasets, Head CT-hemorrhage, Coronary Artery Diseases DataSet, and Skin Cancer MNIST: HAM10000 .
      1. Data Set Structure Original image: This folder stores the original image data. It contains 15 Head CT images in PNG format with varying image resolutions; 25 coronary heart disease images in JPG format and with an image resolution of [1024 * 1024]; 60 skin images in JPG format and with an image resolution of [600 * 450]. It covers a variety of medical image materials with different imaging and contrast, providing diverse input data for edge detection algorithms. Ground truth:The data in this folder are the edge detection annotation images corresponding to the images in the "Originals" folder. They are in PNG format. In these images, the white pixels represent the edge parts of the image, and the black pixels represent the non-edge areas. These annotation information accurately outlines the object contours and edge features in the original images.
      1. Usage Instructions For users who conduct image processing using Python, they can utilize the cv2 (OpenCV) library to read image data. The sample code is as follows:

    import cv2 original_image = cv2.imread('Original image/IMG-001.png') # Read original image ground_truth_image = cv2.imread('Ground truth/GT-001.png', cv2.IMREAD_GRAYSCALE) # Read the corresponding Ground Truth image When performing model training based on deep learning frameworks (such as TensorFlow, PyTorch), the dataset path can be configured into the corresponding dataset loading class according to the data loading mechanism of the framework to ensure that the model can correctly read and process the image and its annotation data.

    • 4. Data Sources and References Data Sources: The original images are collected from public image datasets Head CT-hemorrhage, Coronary Artery Diseases DataSet, and Skin Cancer MNIST: HAM10000 to ensure the quality and diversity of the images. If you are using this dataset in academic research, please cite the following literature.

    References: [1] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, Harald Kittler, Allan Halpern: “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”, 2018; https://arxiv.org/abs/1902.03368

    [2] Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi:10.1038/sdata.2018.161 (2018).

    [3] Classification of Brain Hemorrhage Using Deep Learning from CT Scan Images - https://link.springer.com/chapter/10.1007/978-981-19-7528-8_15

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenCitations ​ (2022). POCI CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.21776351.v1
Organization logoOrganization logo

POCI CSV dataset of all the citation data

Explore at:
zipAvailable download formats
Dataset updated
Dec 27, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
OpenCitations ​
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:

[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).

This version of the dataset contains:

717,654,703 citations; 26,024,862 bibliographic resources.

The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.

Search
Clear search
Close search
Google apps
Main menu