100+ datasets found

POCI CSV dataset of all the citation data
figshare.com
zip
Updated Dec 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenCitations (2022). POCI CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.21776351.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21776351.v1
Dataset updated
Dec 27, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
OpenCitations
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:

[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).

This version of the dataset contains:

717,654,703 citations; 26,024,862 bibliographic resources.

The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.

Massive Scholarly Dataset: 5M Papers 36M Citations

kaggle.com

zip

Updated Mar 16, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Agung Pambudi (2024). Massive Scholarly Dataset: 5M Papers 36M Citations [Dataset]. https://www.kaggle.com/datasets/agungpambudi/research-citation-network-5m-papers

Explore at:

zip(7308773733 bytes)Available download formats

Dataset updated

Mar 16, 2024

Authors

Agung Pambudi

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Explore a rich research dataset with 5.2M papers and 36.6M citations! Unleash your data science skills for clustering, influence analysis, topic modeling, and more. Dive into the world of research networks.

Field Name	Field Type	Description	Example
id	string	paper ID	53e997ddb7602d9701fd3ad7
title	string	paper title	Rewrite-Based Satisfiability Procedures for Recursive Data Structures
authors.name	string	author name	Maria Paola Bonacina
author.org	string	author affiliation	Dipartimento di Informatica
author.id	string	author ID	53f47275dabfaee43ed25965
venue.raw	string	paper venue name	Electronic Notes in Theoretical Computer Science(ENTCS)
year	int	published year	2007
keywords	list of strings	keywords	["theorem-proving strategy", "rewrite-based approach", ...]
fos.name	string	paper fields of study	Data structure
fos.w	float	fields of study weight	0.48341
references	list of strings	paper references	["53e9a31fb7602d9702c2c61e", "53e997f1b7602d9701fef4d1", ...]
n_citation	int	citation number	19
page_start	string	page start	55
page_end	string	page end	70
doc_type	string	paper type: journal, conference	Journal
lang	string	detected language	en
volume	string	volume	174
issue	string	issue	8
issn	string	issn	Electronic Notes in Theoretical Computer Science
isbn	string	isbn
doi	string	doi	10.1016/j.entcs.2006.11.039
url	list	external links	[https: ...]
abstract	string	abstract	Our ability to generate ...
indexed_abstract	dict	indexed abstract	{"IndexLength": 116, "InvertedIndex": {"data": [49], ...}
v12_id	int	v12 paper id	2027211529
v12_authors.name	string	v12 author name	Maria Paola Bonacina
v12_authors.org	string	v12 author affiliation	Dipartimento di Informatica,Università degli Studi di Verona,Italy#TAB#
v12_authors.id	int	v12 author ID	669130765

I
Self-citation analysis data based on PubMed Central subset (2002-2005)
databank.illinois.edu
aws-databank-alb.library.illinois.edu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9665377_V1
Authors
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Science Foundation (NSF)
U.S. National Institutes of Health (NIH)
Description
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
H
GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String...
dataverse.harvard.edu
search.dataone.org
Updated Dec 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Grennan; Martin Schibel; Andrew Collins; Joeran Beel (2019). GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [Data] [Dataset]. http://doi.org/10.7910/DVN/LXQXAO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/LXQXAO
Dataset updated
Dec 9, 2019
Dataset provided by
Harvard Dataverse
Authors
Mark Grennan; Martin Schibel; Andrew Collins; Joeran Beel
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAOhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAO
Description
Extracting and parsing reference strings from research articles is a challenging task. State-of-the-art tools like GROBID apply rather simple machine learning models such as conditional random fields (CRF). Recent research has shown a high potential of deep-learning for reference string parsing. The challenge with deep learning is, however, that the training step requires enormous amounts of labeled data – which does not exist for reference string parsing. Creating such a large dataset manually, through human labor, seems hardly feasible. Therefore, we created GIANT. GIANT is a large dataset with 991,411,100 XML labeled reference strings. The strings were automatically created based on 677,000 entries from CrossRef, 1,500 citation styles in the citation-style language, and the citation processor citeproc-js. GIANT can be used to train machine learning models, particularly deep learning models, for citation parsing. While we have not yet tested GIANT for training such models, we hypothesise that the dataset will be able to significantly improve the accuracy of citation parsing. The dataset and code to create it, are freely available at https://github.com/BeelGroup/.
Reference count CSV dataset of all bibliographic resources in OpenCitations...
figshare.com
zip
Updated Dec 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenCitations (2023). Reference count CSV dataset of all bibliographic resources in OpenCitations Index [Dataset]. http://doi.org/10.6084/m9.figshare.24747498.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24747498.v1
Dataset updated
Dec 11, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
OpenCitations
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A CSV dataset containing the number of references of each bibliographic entity identified by an OMID in the OpenCitations Index (https://opencitations.net/index).The dataset is based on the last release of the OpenCitations Index (https://opencitations.net/download) – November 2023. The size of the zipped archive is 0.35 GB, while the size of the unzipped CSV file is 1.7 GB.The CSV dataset contains the reference count of 71,805,806 bibliographic entities. The first column (omid) lists the entities, while the second column (references) indicates the corresponding number of incoming citations.

Data Citation Corpus Data File

zenodo.org

zip

Updated Oct 14, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

DataCite (2024). Data Citation Corpus Data File [Dataset]. http://doi.org/10.5281/zenodo.13376773

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13376773

Dataset updated

Oct 14, 2024

Dataset provided by

DataCitehttps://www.datacite.org/

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.

For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.

The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

Each data citation record is comprised of:

A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited
Metadata for the cited dataset and for the citing publication

The data file includes the following fields:

Field	Description	Required?
id	Internal identifier for the citation	Yes
created	Date of item's incorporation into the corpus	Yes
updated	Date of item's most recent update in corpus	Yes
repository	Repository where cited data is stored	No
publisher	Publisher for the article citing the data	No
journal	Journal for the article citing the data	No
title	Title of cited data	No
publication	DOI of article where data is cited	Yes
dataset	DOI or accession number of cited data	Yes
publishedDate	Date when citing article was published	No
source	Source where citation was harvested	Yes
subjects	Subject information for cited data	No
affiliations	Affiliation information for creator of cited data	No
funders	Funding information for cited data	No

Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

Add and update Event Data citations:

Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024

Remove citation records deemed out of scope for the corpus:

273,567 records from DataCite Event Data with non-citation relationship types
28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)
44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication
473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions
4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)

Metadata enhancements:

Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository
Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)

Data structure updates to improve usability and eliminate redundancies:

Rename subj_id and obj_id fields to “dataset” and “publication” for clarity
Remove accessionNumber and doi elements to eliminate redundancy with subj_id
Remove relationTypeId fields as these are specific to Event Data only

Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.

While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.

Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.

Data from: Standards Incorporated by Reference (SIBR) Database
catalog.data.gov
data.nist.gov
+2more
Updated Sep 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Standards Incorporated by Reference (SIBR) Database [Dataset]. https://catalog.data.gov/dataset/standards-incorporated-by-reference-sibr-database
Explore at:
Dataset updated
Sep 30, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a searchable historical collection of standards referenced in regulations - Voluntary consensus standards, government-unique standards, industry standards, and international standards referenced in the Code of Federal Regulations (CFR).
UIEB Dataset-reference
kaggle.com
zip
Updated Aug 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kaggle6 (2023). UIEB Dataset-reference [Dataset]. https://www.kaggle.com/datasets/larjeck/uieb-dataset-reference
Explore at:
zip(823157904 bytes)Available download formats
Dataset updated
Aug 3, 2023
Authors
kaggle6
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
950 algorithmically restored underwater images, used for image enhancement, image generation, etc.
Time Reference for Management Information
catalog.data.gov
s.cnmilf.com
+1more
Updated Sep 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Time Reference for Management Information [Dataset]. https://catalog.data.gov/dataset/time-reference-for-management-information
Explore at:
Dataset updated
Sep 19, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
The application is responsible for maintaining a series of tables containing data for fiscal and processing years and located on the mainframe and data warehouse servers.
dataset-8670
kaggle.com
zip
Updated Aug 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
e0650443145 (2022). dataset-8670 [Dataset]. https://www.kaggle.com/datasets/e0650443145/dataset-8670
Explore at:
zip(2137977761 bytes)Available download formats
Dataset updated
Aug 24, 2022
Authors
e0650443145
Description
The dataset includes sensory data from the following sensors:

Triaxial acceleration force (in m/s^2) from the mobile phone accelerometer (30 Hz) and E4 accelerometer (32 Hz)

Triaxial rate of rotation (in rad/s) and degrees of rotation (in Degrees) from the mobile phone gyroscope (30 Hz)

Triaxial geomagnetic field strength (in μT) from the mobile phone magnetometer (30 Hz)

Latitude and longitude, and horizontal accuracy 1) (in meters) from the mobile phone GPS (every 5 seconds)

Blood volume pressure (in nano Watt) from E4 photoplethysmography (PPG) sensor (64 Hz)

Electrodermal activity (skin conductance in μS) from E4 EDA sensor (4 Hz)

Average heart rate 2) values (in bps) computed in 10 seconds-span based on the BVP analysis from E4 (1 Hz)

Peripheral skin temperature (in Celsius degrees) from E4 infrared thermopile (4 Hz)

1) The estimated horizontal accuracy is defined as the radius of 68% confidence according to the API. Reference: https://developer.android.com/reference/android/location/Location#getAccuracy() 2) HR values are not derived from a real-time reading but are created after the data collection session. Reference: https://support.empatica.com/hc/en-us/articles/360029469772-E4-data-HR-csv-explanation

The dataset includes files in a structure shown below:

+----- USER-ID | +----- timestamp (DAY 1) | | +----- e4Acc | | | timestamp (e4-accelerometer-data).csv | | | ... | | +----- e4Bvp | | | timestamp (e4-blood-volume-pressure-data).csv | | | ... | | +----- e4Eda | | | timestamp (e4-electrodermal-activity-data).csv | | | ... | | +----- e4Hr | | | timestamp (e4-heart-rate-data).csv | | | ... | | +----- e4Temp | | | timestamp (e4-skin-temperature-data).csv | | | ... | | +----- mAcc | | | timestamp (mobile-accelerometer-data).csv | | | ... | | +----- mGps | | | timestamp (mobile-gps-data).csv | | | ... | | +----- mGyr | | | timestamp (mobile-gyroscope-data).csv | | | ... | | +----- mMag | | | timestamp (mobile-magnetometer-data).csv | | | ... | | timestamp-label.csv | +----- timestamp (DAY 2) | | +----- ...

Directories (in timestamps) located under the USER_ID directory indicate when the user started the experiment each day. Each day has directories named by the corresponding sensors, which includes data files generated every one minute. Each data file records raw sensor values in the designated sampling interval with the timestamp. (Timestamp is represented in second.millisecond format.)

User label files are composed of 12 columns representing the physical, emotional, and contextual states as follows:

ts: timestamp

action: sleep, personal_care, work, study, household, care_housemem (caregiving), recreation_media, entertainment, outdoor_act (sports), hobby, recreation_etc (free time), shop, communitiy_interaction (regular activity), travel (includes commute), meal (includes snack), socialising

actionOption: Details of the selected action. See the description below.

actionSub: meal-amount when action=meal or snack, move-method when action=travel

actionSubOption: 1 (light), 2 (moderate), 3 (heavy) when actionSub=meal_amount, 1 (walk), 2 (driving), 3 (taxi, passenger), 4 (personal mobility), 5 (bus), 6 (train, subway), 7 (others) when actionSub=move-method

condition: ALONE, WITH-ONE, WITH-MANY

conditionSub1Option: 1 (with families), 2 (with friends), 3 (with colleagues), 4 (acquaintances), 5 (others)

conditionSub2Option: 1 (passive in conversation), 2 (moderate participation in conversation), 3 (active in conversation)

place: home, workplace, restaurant, outdoor, other-indoor

emotionPositive : (negative) 1-2-3-4-5-6-7 (positive)

emotionTension: (relaxed) 1-2-3-4-5-6-7 (aroused)

activity 3): 0 (IN-VEHICLE), 1 (ON-BICYCLE), 2 (ON-FOOT), 3 (STILL), 4 (UNKNOWN), 5 (TILTING), 7 (WALKING), 8 (RUNNING)

3) Values in the activity column represent the detected activity of the mobile device using Google's Awareness API. Reference: https://developers.google.com/android/reference/com/google/android/gms/location/DetectedActivity?hl=en

Descriptions for the actionOption field is as follows:

111 Sleep 112 Sleepless 121 Meal 122 Snack 131 Medical services, treatments, sick rest 132 Personal hygiene (bath) 133 Appearance management (makeup, change of clothes) 134 Beauty-related services 211 Main job 212 Side job 213 Rest during work 22 Job search 311 School class / seminar (listening) 312 Break between classes 313 School homework, self-study (individual) 314 Team project (in groups) 321 Private tutoring (offline) 322 Online courses 41 Preparing food and washing dishes 42 Laundry and ironing 43 Housing management and cleaning 44 Vehicle management 45 Pet and plant caring 46 Purchasing goods and services (grocery/take-out) 51 Caring for children under 10 who live together 52 Caring for elementary, middle, and high school students over 10 who ...
f
Data From: TERRA-REF, An Open Reference Data Set From High Resolution...
datasetcatalog.nlm.nih.gov
nde-dev.biothings.io
+4more
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Burnette, Maxwell A.; Pauli, Duke; French, Andrew N.; Garnett, Roman; Maimaitijiang, Maitiniyazi; White, Jeffrey W.; Rohde, Gareth S; Newcomb, Maria; Rooney, William L.; Thorp, Kelly; Fahlgren, Noah; Lebauer, David; Pless, Robert; Paheding, Sidike; Ozersky, Philip; Willis, Craig; Kooper, Rob; Sagan, Vasit; Morris, Geoffrey; Ward, Richard; Demieville, Jeffrey; Li, Zongyang; Stylianou, Abby; Ottman, Michael J.; Shakoor, Nadia; Zender, Charles S.; Flinn, Barry; Riemer, Kristina (2024). Data From: TERRA-REF, An Open Reference Data Set From High Resolution Genomics, Phenomics, and Imaging Sensors [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001456820
Explore at:
Dataset updated
Feb 15, 2024
Authors
Burnette, Maxwell A.; Pauli, Duke; French, Andrew N.; Garnett, Roman; Maimaitijiang, Maitiniyazi; White, Jeffrey W.; Rohde, Gareth S; Newcomb, Maria; Rooney, William L.; Thorp, Kelly; Fahlgren, Noah; Lebauer, David; Pless, Robert; Paheding, Sidike; Ozersky, Philip; Willis, Craig; Kooper, Rob; Sagan, Vasit; Morris, Geoffrey; Ward, Richard; Demieville, Jeffrey; Li, Zongyang; Stylianou, Abby; Ottman, Michael J.; Shakoor, Nadia; Zender, Charles S.; Flinn, Barry; Riemer, Kristina
Description
The ARPA-E funded TERRA-REF project generated open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active fluorescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets. Data were collected at the University of Arizona Maricopa Agricultural Center in Maricopa, Arizona. This site hosts a large field scanner with fifteen sensors, many of which are capable of capturing mm-scale images and point clouds at daily to weekly intervals. These data are intended to be reused and are accessible as a combination of files and databases linked by spatial, temporal, and genomic information. In addition to providing open access data, the entire computational pipeline is open source, and we enable users to access high-performance computing environments. The study has evaluated a sorghum diversity panel, biparental cross populations, and elite lines and hybrids from structured sorghum breeding populations. In addition, a durum wheat diversity panel was grown and evaluated over three winter seasons. The initial release includes derived data from two seasons in which the sorghum diversity panel was evaluated. Future releases will include data from additional seasons and locations. The TERRA-REF reference dataset can be used to characterize phenotype-to-genotype associations, on a genomic scale, that will enable knowledge-driven breeding and the development of higher-yielding cultivars of sorghum and wheat. The data is also being used to develop new algorithms for machine learning, image analysis, genomics, and optical sensor engineering. Resources in this dataset:Resource Title: Link to dataset at Datadryad.org. File Name: Web Page, url: https://datadryad.org/stash/dataset/doi:10.5061/dryad.4b8gtht99 The ARPA-E funded TERRA-REF project is generating open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active flourescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets.
Datasets for Information Reference
figshare.com
xlsx
Updated Oct 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tongyang Zhang (2021). Datasets for Information Reference [Dataset]. http://doi.org/10.6084/m9.figshare.16834840.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16834840.v2
Dataset updated
Oct 20, 2021
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Tongyang Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We use journal articles published by Scientometrics in 2016-2020 as the data source. Through the analysis of the data set usage records of scientometrics research, the frequency ranking of the usage of each dataset for information reference is listed, so as to provide a reference for the selection of data sets for scientometrics research.
English Wikipedia People Dataset
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz

Size of compressed file: 4.12 GB

Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
E
Data from: Global hydrological dataset of daily streamflow data from the...
catalogue.ceh.ac.uk
hosted-metadata.bgs.ac.uk
+3more
zip
Updated May 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
S. Turner; J. Hannaford; L.J. Barker; G. Suman; R. Armitage; A. Killeen; A. Griffin; H. Davies; A. Kumar; H. Dixon; M.T.D. Albuquerque; N. Almeida Ribeiro; C. Alvarez-Garreton; E. Amoussou; B. Arheimer; Y. Asano; T. Berezowski; A. Bodian; H. Boutaghane; R. Capell; H. Dakhaoui; J. Daňhelka; H.X. Do; C. Ekkawatpanit; E.M. El Khalki; A.K. Fleig; R. Fonseca; J.D. Giraldo-Osorio; A.B.T. Goula; M. Hanel; G Hodgkins; S. Horton; C. Kan; D.G. Kingston; G. Laaha; R. Laugesen; W. Lopes; S. Mager; Y. Markonis; L. Mediero; G. Midgley; C. Murphy; P. O'Connor; A.I. Pedersen; H.T. Pham; M. Piniewski; M. Rachdane; B. Renard; M.E. Saidi; P. Schmocker-Facker; K. Stahl; M. Thyler; M. Toucher; Y. Tramblay; J. Uusikivi; N. Venegas-Cordero; S. Vissesri; A. Watson; S. Westra; P.H. Whitfield (2024). Global hydrological dataset of daily streamflow data from the Reference Observatory of Basins for INternational hydrological climate change detection (ROBIN), 1863 - 2022 [Dataset]. http://doi.org/10.5285/3b077711-f183-42f1-bac6-c892922c81f4
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5285/3b077711-f183-42f1-bac6-c892922c81f4
Dataset updated
May 28, 2024
Dataset provided by
NERC EDS Environmental Information Data Centre
Authors
S. Turner; J. Hannaford; L.J. Barker; G. Suman; R. Armitage; A. Killeen; A. Griffin; H. Davies; A. Kumar; H. Dixon; M.T.D. Albuquerque; N. Almeida Ribeiro; C. Alvarez-Garreton; E. Amoussou; B. Arheimer; Y. Asano; T. Berezowski; A. Bodian; H. Boutaghane; R. Capell; H. Dakhaoui; J. Daňhelka; H.X. Do; C. Ekkawatpanit; E.M. El Khalki; A.K. Fleig; R. Fonseca; J.D. Giraldo-Osorio; A.B.T. Goula; M. Hanel; G Hodgkins; S. Horton; C. Kan; D.G. Kingston; G. Laaha; R. Laugesen; W. Lopes; S. Mager; Y. Markonis; L. Mediero; G. Midgley; C. Murphy; P. O'Connor; A.I. Pedersen; H.T. Pham; M. Piniewski; M. Rachdane; B. Renard; M.E. Saidi; P. Schmocker-Facker; K. Stahl; M. Thyler; M. Toucher; Y. Tramblay; J. Uusikivi; N. Venegas-Cordero; S. Vissesri; A. Watson; S. Westra; P.H. Whitfield
License
https://eidc.ac.uk/licences/ogl/plainhttps://eidc.ac.uk/licences/ogl/plain
Time period covered
Jan 1, 1863 - Dec 31, 2022
Area covered
Earth
Dataset funded by
Natural Environment Research Councilhttps://www.ukri.org/councils/nerc
Description
The Reference Observatory of Basins for INternational hydrological climate change detection (ROBIN) dataset is a global hydrological dataset containing publicly available daily flow data for 2,386 gauging stations across the globe which have natural or near-natural catchments. Metadata is also provided alongside these stations for the Full ROBIN Dataset consisting of 3,060 gauging stations. Data were quality controlled by the central ROBIN team before being added to the dataset, and two levels of data quality are applied to guide users towards appropriate the data usage. Most records have data of at least 40 years with minimal missing data with data records starting in the late 19th Century for some sites through to 2022. ROBIN represents a significant advance in global-scale, accessible streamflow data. The project was funded the UK Natural Environment Research Council Global Partnership Seedcorn Fund - NE/W004038/1 and the NC-International programme [NE/X006247/1] delivering National Capability
Citation Networks (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Citation Networks (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-cit
Explore at:
zip(95620457 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
High-energy physics citation network

Dataset information

Arxiv HEP-PH (high energy physics phenomenology ) citation graph is from the
e-print arXiv and covers all the citations within a dataset of 34,546 papers
with 421,578 edges. If a paper i cites paper j, the graph contains a directed
edge from i to j. If a paper cites, or is cited by, a paper outside the
dataset, the graph does not contain any information about this.

The data covers papers in the period from January 1993 to April 2003 (124
months). It begins within a few months of the inception of the arXiv, and thus
represents essentially the complete history of its HEP-PH section.

The data was originally released as a part of 2003 KDD Cup.

Dataset statistics
Nodes 34546
Edges 421578
Nodes in largest WCC 34401 (0.996)
Edges in largest WCC 421485 (1.000)
Nodes in largest SCC 12711 (0.368)
Edges in largest SCC 139981 (0.332)
Average clustering coefficient 0.2962
Number of triangles 1276868
Fraction of closed triangles 0.1457
Diameter (longest shortest path) 12
90-percentile effective diameter 5

Source (citation)

J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.

J. Gehrke, P. Ginsparg, J. M. Kleinberg. Overview of the 2003 KDD Cup. SIGKDD
Explorations 5(2): 149-151, 2003.

Files
File Description
cit-HepPh.txt.gz Paper citation network of Arxiv High Energy Physics category cit-HepPh-dates.txt.gz Time of nodes (paper submission time to Arxiv)

High-energy physics theory citation network

Dataset information

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print
arXiv and covers all the citations within a dataset of 27,770 papers with
352,807 edges. If a paper i cites paper j, the graph contains a directed edge
from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this.

The data covers papers in the period from January 1993 to April 2003 (124
months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its HEP-TH section.

The data was originally released as a part of 2003 KDD Cup.

Dataset statistics
Nodes 27770
Edges 352807
Nodes in largest WCC 27400 (0.987) ...
h
Per-Citation-Dataset
huggingface.co
Updated Jul 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salunkhe (2025). Per-Citation-Dataset [Dataset]. https://huggingface.co/datasets/Mithilss/Per-Citation-Dataset
Explore at:
Dataset updated
Jul 23, 2025
Authors
Salunkhe
Description
Mithilss/Per-Citation-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Computer Forensic Reference Data Set Portal
data.nist.gov
Updated May 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ayers (2022). Computer Forensic Reference Data Set Portal [Dataset]. http://doi.org/10.18434/mds2-2635
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2635, https://identifiers.org/ark:/88434/mds2-2635
Dataset updated
May 2, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Richard Ayers
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
This portal is your gateway to documented digital forensic image datasets. These datasets can assist in a variety of tasks including tool testing, developing familiarity with tool behavior for given tasks, general practitioner training and other unforeseen uses that the user of the datasets can devise. Most datasets have a description of the type and locations of significant artifacts present in the dataset. There are descriptions and finding aides to help you locate datasets by the year produced, by author, or by attributes of the dataset.
NIST SAMATE Software Assurance Reference Dataset
catalog.data.gov
datasets.ai
+2more
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). NIST SAMATE Software Assurance Reference Dataset [Dataset]. https://catalog.data.gov/dataset/nist-samate-software-assurance-reference-dataset
Explore at:
Dataset updated
Sep 30, 2025
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This dataset provides the NIST Software Assurance Metrics And Tool Evaluation (SAMATE) Software Assurance Reference Dataset (SARD) - a set of programs with known security flaws. This will allow end users to evaluate tools and tool developers to test their methods.
U
A circa 2010 global land cover reference dataset from commercial high...
data.usgs.gov
s.cnmilf.com
+1more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bruce Pengra; Jordan Long; Devendra Dahal; Steve Stehman; Thomas Loveland, A circa 2010 global land cover reference dataset from commercial high resolution satellite data [Dataset]. http://doi.org/10.5066/P96FKANW
Explore at:
Unique identifier
https://doi.org/10.5066/P96FKANW
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Bruce Pengra; Jordan Long; Devendra Dahal; Steve Stehman; Thomas Loveland
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
May 19, 2002 - May 29, 2014
Description
The data are 475 thematic land cover raster’s at 2m resolution. Land cover classification was to the land cover classes: Tree (1), Water (2), Barren (3), Other Vegetation (4) and Ice & Snow (8). Cloud cover and Shadow were sometimes coded as Cloud (5) and Shadow (6), however for any land cover application would be considered NoData. Some raster’s may have Cloud and Shadow pixels coded or recoded to NoData already. Commercial high-resolution satellite data was used to create the classifications. Usable image data for the target year (2010) was acquired for 475 of the 500 primary sample locations, with 90% of images acquired within ±2 years of the 2010 target. The remaining 25 of the 500 sample blocks had no usable data so were not able to be mapped. Tabular data is included with the raster classifications indicating the specific high-resolution sensor and date of acquisition for source imagery as well as the stratum to which that sample block belonged. Methods for this classifi ...
MIEDT dataset
kaggle.com
Updated Jan 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
机关鸢鸟 (2025). MIEDT dataset [Dataset]. https://www.kaggle.com/datasets/lidang78/miedt-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
机关鸢鸟
Description
Dataset Overview This dataset is organized based on the edge detection task, aiming to provide rich image resources and corresponding edge detection annotation information for related research and applications, which can be used for the testing of edge detection algorithms. In order to evaluate the performance of the edge detection method comprehensively, we created the Medical Image Edge Detection Test (MIEDT) dataset. The MIEDT contains 100 medical images, which were randomly selected from three publicly available datasets, Head CT-hemorrhage, Coronary Artery Diseases DataSet, and Skin Cancer MNIST: HAM10000 .

Data Set Structure Original image: This folder stores the original image data. It contains 15 Head CT images in PNG format with varying image resolutions; 25 coronary heart disease images in JPG format and with an image resolution of [1024 * 1024]; 60 skin images in JPG format and with an image resolution of [600 * 450]. It covers a variety of medical image materials with different imaging and contrast, providing diverse input data for edge detection algorithms. Ground truth：The data in this folder are the edge detection annotation images corresponding to the images in the "Originals" folder. They are in PNG format. In these images, the white pixels represent the edge parts of the image, and the black pixels represent the non-edge areas. These annotation information accurately outlines the object contours and edge features in the original images.

Usage Instructions For users who conduct image processing using Python, they can utilize the cv2 (OpenCV) library to read image data. The sample code is as follows:

import cv2 original_image = cv2.imread('Original image/IMG-001.png') # Read original image ground_truth_image = cv2.imread('Ground truth/GT-001.png', cv2.IMREAD_GRAYSCALE) # Read the corresponding Ground Truth image When performing model training based on deep learning frameworks (such as TensorFlow, PyTorch), the dataset path can be configured into the corresponding dataset loading class according to the data loading mechanism of the framework to ensure that the model can correctly read and process the image and its annotation data.

4. Data Sources and References Data Sources: The original images are collected from public image datasets Head CT-hemorrhage, Coronary Artery Diseases DataSet, and Skin Cancer MNIST: HAM10000 to ensure the quality and diversity of the images. If you are using this dataset in academic research, please cite the following literature.

References: [1] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, Harald Kittler, Allan Halpern: “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”, 2018; https://arxiv.org/abs/1902.03368

[2] Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi:10.1038/sdata.2018.161 (2018).

[3] Classification of Brain Hemorrhage Using Deep Learning from CT Scan Images - https://link.springer.com/chapter/10.1007/978-981-19-7528-8_15

Facebook

Twitter

Click to copy link

Link copied

Cite

OpenCitations (2022). POCI CSV dataset of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.21776351.v1

POCI CSV dataset of all the citation data

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.21776351.v1

Dataset updated

Dec 27, 2022

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

OpenCitations

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:

[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).

This version of the dataset contains:

717,654,703 citations; 26,024,862 bibliographic resources.

The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.

Clear search

Close search

Google apps

Main menu

POCI CSV dataset of all the citation data

Massive Scholarly Dataset: 5M Papers 36M Citations

Self-citation analysis data based on PubMed Central subset (2002-2005)

GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String...

Reference count CSV dataset of all bibliographic resources in OpenCitations...

Data Citation Corpus Data File

Data from: Standards Incorporated by Reference (SIBR) Database

UIEB Dataset-reference

Time Reference for Management Information

dataset-8670

Data From: TERRA-REF, An Open Reference Data Set From High Resolution...

Datasets for Information Reference

English Wikipedia People Dataset

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution

Data from: Global hydrological dataset of daily streamflow data from the...

Citation Networks (SNAP)

High-energy physics citation network

High-energy physics theory citation network

Per-Citation-Dataset

Computer Forensic Reference Data Set Portal

NIST SAMATE Software Assurance Reference Dataset

A circa 2010 global land cover reference dataset from commercial high...

MIEDT dataset

POCI CSV dataset of all the citation dataSee More Versions

POCI CSV dataset of all the citation data