62 datasets found

w
Dataset of artists who created Paper folder
workwithdata.com
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of artists who created Paper folder [Dataset]. https://www.workwithdata.com/datasets/artists?f=1&fcol0=j0-artwork&fop0=%3D&fval0=Paper+folder&j=1&j0=artworks
Explore at:
Dataset updated
May 8, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about artists. It has 1 row and is filtered where the artworks is Paper folder. It features 9 columns including birth date, death date, country, and gender.
Meta Kaggle Code
kaggle.com
zip
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(148301844275 bytes)Available download formats
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
I
Dataset for "Continued use of retracted papers: Temporal trends in citations...
databank.illinois.edu
Updated Apr 6, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tzu-Kun Hsiao; Jodi Schneider (2021). Dataset for "Continued use of retracted papers: Temporal trends in citations and (lack of) awareness of retractions shown in citation contexts in biomedicine" [Dataset]. http://doi.org/10.13012/B2IDB-8255619_V2
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-8255619_V2
Dataset updated
Apr 6, 2021
Authors
Tzu-Kun Hsiao; Jodi Schneider
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Alfred P. Sloan Foundation
U.S. National Institutes of Health (NIH)
Description
This dataset includes five files. Descriptions of the files are given as follows: FILENAME: PubMed_retracted_publication_full_v3.tsv - Bibliographic data of retracted papers indexed in PubMed (retrieved on August 20, 2020, searched with the query "retracted publication" [PT] ). - Except for the information in the "cited_by" column, all the data is from PubMed. - PMIDs in the "cited_by" column that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a retracted paper. There are 7,813 retracted papers. COLUMN HEADER EXPLANATIONS 1) PMID - PubMed ID 2) Title - Paper title 3) Authors - Author names 4) Citation - Bibliographic information of the paper 5) First Author - First author's name 6) Journal/Book - Publication name 7) Publication Year 8) Create Date - The date the record was added to the PubMed database 9) PMCID - PubMed Central ID (if applicable, otherwise blank) 10) NIHMS ID - NIH Manuscript Submission ID (if applicable, otherwise blank) 11) DOI - Digital object identifier (if applicable, otherwise blank) 12) retracted_in - Information of retraction notice (given by PubMed) 13) retracted_yr - Retraction year identified from "retracted_in" (if applicable, otherwise blank) 14) cited_by - PMIDs of the citing papers. (if applicable, otherwise blank) Data collected from iCite. 15) retraction_notice_pmid - PMID of the retraction notice (if applicable, otherwise blank) FILENAME: PubMed_retracted_publication_CitCntxt_withYR_v3.tsv - This file contains citation contexts (i.e., citing sentences) where the retracted papers were cited. The citation contexts were identified from the XML version of PubMed Central open access (PMCOA) articles. - This is part of the data from: Hsiao, T.-K., & Torvik, V. I. (manuscript in preparation). Citation contexts identified from PubMed Central open access articles: A resource for text mining and citation analysis. - Citation contexts that meet either of the two conditions below have been excluded from analyses: [1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file). [2] Citing paper and the cited retracted paper have the same PMID. ROW EXPLANATIONS - Each row is a citation context associated with one retracted paper that's cited. - In the manuscript, we count each citation context once, even if it cites multiple retracted papers. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) year - Publication year of the citing paper 4) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, tbl_fig_caption = tables and table/figure captions) 5) IMRaD - IMRaD section of the citation context (I = Introduction, M = Methods, R = Results, D = Discussions/Conclusion, NoIMRaD = not identified) 6) sentence_id - The ID of the citation context in a given location. For location information, please see column 4. The first sentence in the location gets the ID 1, and subsequent sentences are numbered consecutively. 7) total_sentences - Total number of sentences in a given location 8) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 9) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 10) citation - The citation context 11) progression - Position of a citation context by centile within the citing paper. 12) retracted_yr - Retraction year of the retracted paper 13) post_retraction - 0 = not post-retraction citation; 1 = post-retraction citation. A post-retraction citation is a citation made after the calendar year of retraction. FILENAME: 724_knowingly_post_retraction_cit.csv (updated) - The 724 post-retraction citation contexts that we determined knowingly cited the 7,813 retracted papers in "PubMed_retracted_publication_full_v3.tsv". - Two citation contexts from retraction notices have been excluded from analyses. ROW EXPLANATIONS - Each row is a citation context. COLUMN HEADER EXPLANATIONS 1) pmcid - PubMed Central ID of the citing paper 2) pmid - PubMed ID of the citing paper 3) pub_type - Publication type collected from the metadata in the PMCOA XML files. 4) pub_type2 - Specific article types. Please see the manuscript for explanations. 5) year - Publication year of the citing paper 6) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, table_or_figure_caption = tables and table/figure captions) 7) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper. 8) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper. 9) citation - The citation context 10) retracted_yr - Retraction year of the retracted paper 11) cit_purpose - Purpose of citing the retracted paper. This is from human annotations. Please see the manuscript for further information about annotation. 12) longer_context - A extended version of the citation context. (if applicable, otherwise blank) Manually pulled from the full-texts in the process of annotation. FILENAME: Annotation manual.pdf - The manual for annotating the citation purposes in column 11) of the 724_knowingly_post_retraction_cit.tsv. FILENAME: retraction_notice_PMID.csv (new file added for this version) - A list of 8,346 PMIDs of retraction notices indexed in PubMed (retrieved on August 20, 2020, searched with the query "retraction of publication" [PT] ).
E
Methodology data of "Twenty years of research in Digital Humanities: a topic...
live.european-language-grid.eu
data.niaid.nih.gov
json
Updated Sep 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Methodology data of "Twenty years of research in Digital Humanities: a topic modeling study" [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18310
Explore at:
jsonAvailable download formats
Dataset updated
Sep 10, 2022
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This document contains the datasets created in the thesis "Twenty years of research in Digital Humanities: a topic modeling study". The methodological approach of the work is based on two datasets built by web scraping DH journals’ official web pages and API requests to popular academic databases (Crossref, Datacite). The datasets constitute a corpus of DH research and include research papers abstracts and abstract papers from DH journals and international DH conferences published between 2000 and 2020. Probabilistic topic modeling with latent Dirichlet allocation is then performed on both datasets to identify relevant research subfields.
Data
Folder "data/" contains four folders which relate to two datasets:
The first dataset, which will be referred to as the journals dataset, contains original research papers published in journals exclusively devoted to digital humanities scholarshipis [1] and is composed of 2,464 articles from 26 journals.
The second dataset, the conference dataset, contains abstract papers available in ADHO conference archives and is composed of 2,160 articles from 15 years of ADHO conferences and 4 conferences promoted by journals
Both datasets are provided with: URL (if available); identifier and related scheme (if available); abstract or abstract paper; title; authors’ given name, family name; author’s affiliation name, found within the document metadata or text; normalized affiliation name, country of the affiliation, identifiers of the affiliation provided by the Research Organization Registry Community (ROR, https://ror.org); publisher (if available); publishing date (complete date when provided or only the year); keywords (if available); journal title; volume and issue (if available); electronic and/or print ISSN (if available).
The two folders "data/no_abstracts..." are licensed under a Creative Commons public domain dedication (CC0), while the others keep their original license (the one provided by their publisher) because they contain full abstracts of the papers. These latter datasets are provided in order to favor the reproducibility of the results obtained in our work.
Topic modeling

"topic_modeling/" directory contains input and output data used within MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [2]. The topic modeling results are divided in two folders, one for each of the datasets.
Note: It's necessary to unzip the file to get access to all the files and directories listed below.
References
Spinaci, G., Colavizza, G., Peroni, S., Preliminary Results on Mapping Digital Humanities Research, in: Atti del IX Convegno Annuale AIUCD. La svolta inevitabile: sfide e prospettive per l'Informatica Umanistica, Milan, Università Cattolica del Sacro Cuore, 2020, pp. 246 - 252 (atti di: IX Convegno Annuale AIUCD. La svolta inevitabile: sfide e prospettive per l'Informatica Umanistica, Milano, Italy, 15-17 gennaio 2020)
Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3
h
ml-arxiv-papers
huggingface.co
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksei Artemiev (2023). ml-arxiv-papers [Dataset]. https://huggingface.co/datasets/aalksii/ml-arxiv-papers
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2023
Authors
Aleksei Artemiev
Description
Dataset Card for "ml-arxiv-papers"

This is a dataset containing ML ArXiv papers. The dataset is a version of the original one from CShorten, which is a part of the ArXiv papers dataset from Kaggle. Three steps are made to process the source data:

useless columns removal; train-test split; ' ' removal and trimming spaces on sides of the text.

More Information needed
I
Conceptual novelty scores for PubMed articles
databank.illinois.edu
Updated Apr 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Vetle I. Torvik (2018). Conceptual novelty scores for PubMed articles [Dataset]. http://doi.org/10.13012/B2IDB-5060298_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5060298_V1
Dataset updated
Apr 27, 2018
Authors
Shubhanshu Mishra; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
National Science Foundationhttp://www.nsf.gov/
Description
Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * MeSH tree 2015: ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/ * Source code provided at: https://github.com/napsternxg/Novelty Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742 . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/Novelty
w
Dataset of artists who created Lunes en papier (Paper Moons)
workwithdata.com
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of artists who created Lunes en papier (Paper Moons) [Dataset]. https://www.workwithdata.com/datasets/artists?f=1&fcol0=j0-artwork&fop0=%3D&fval0=Lunes+en+papier+%28Paper+Moons%29&j=1&j0=artworks
Explore at:
Dataset updated
May 8, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about artists. It has 1 row and is filtered where the artworks is Lunes en papier (Paper Moons). It features 9 columns including birth date, death date, country, and gender.
P
GSM8K Dataset
paperswithcode.com
tensorflow.org
+2more
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman (2024). GSM8K Dataset [Dataset]. https://paperswithcode.com/dataset/gsm8k
Explore at:
Dataset updated
Dec 31, 2024
Authors
Karl Cobbe; Vineet Kosaraju; Mohammad Bavarian; Mark Chen; Heewoo Jun; Lukasz Kaiser; Matthias Plappert; Jerry Tworek; Jacob Hilton; Reiichiro Nakano; Christopher Hesse; John Schulman
Description
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning.
h
internal-datasets
huggingface.co
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Rivaldo Marbun (2023). internal-datasets [Dataset]. https://huggingface.co/datasets/Marbyun/internal-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2023
Authors
Ivan Rivaldo Marbun
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
SynQA is a Reading Comprehension dataset created in the work "Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation" (https://aclanthology.org/2021.emnlp-main.696/). It consists of 314,811 synthetically generated questions on the passages in the SQuAD v1.1 (https://arxiv.org/abs/1606.05250) training set.

In this work, we use a synthetic adversarial data generation to make QA models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA (https://adversarialqa.github.io/) dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.

For full details on how the dataset was created, kindly refer to the paper.
f
Data from: Table 4 -
plos.figshare.com
xls
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David J. Tonjes; Yiyi Wang; Firman Firmansyah; Sameena Manzur; Matthew Johnston; Griffin Walker; Krista L. Thyberg; Elizabeth Hewitt (2025). Table 4 - [Dataset]. http://doi.org/10.1371/journal.pone.0308367.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0308367.t004
Dataset updated
Jan 15, 2025
Dataset provided by
PLOS ONE
Authors
David J. Tonjes; Yiyi Wang; Firman Firmansyah; Sameena Manzur; Matthew Johnston; Griffin Walker; Krista L. Thyberg; Elizabeth Hewitt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
a. NYC data sets: 12 nearest and farthest comparison data sets to NYC 1990 and NYC 2004. b. NYC data sets: 12 nearest and farthest comparison data sets to NYC 2013 and NYC 2017.
F
Data from: A Neural Approach for Text Extraction from Scholarly Figures
data.uni-hannover.de
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
Explore at:
zip(798357692)Available download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19, author = {David Morris and Peichen Tang and Ralph Ewerth}, title = {A Neural Approach for Text Extraction from Scholarly Figures}, booktitle = {2019 International Conference on Document Analysis and Recognition, {ICDAR} 2019, Sydney, Australia, September 20-25, 2019}, pages = {1438--1443}, publisher = {{IEEE}}, year = {2019}, url = {https://doi.org/10.1109/ICDAR.2019.00231}, doi = {10.1109/ICDAR.2019.00231}, timestamp = {Tue, 04 Feb 2020 13:28:39 +0100}, biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
The Maestro Dataset v2
kaggle.com
zip
Updated Jun 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Vial (2020). The Maestro Dataset v2 [Dataset]. https://www.kaggle.com/datasets/jackvial/themaestrodatasetv2
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jun 14, 2020
Authors
Jack Vial
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Note

I did not have any part in creating this dataset I am only uploading it here to make it easily available to others on Kaggle. More info about the dataset can be found here https://magenta.tensorflow.org/datasets/maestro

Wav -> mp3 Conversion

I had to convert the wav audio files to mp3 so the dataset would fit within Kaggle's 20gb limit, therefore all audio files have the extension .mp3 which is inconsistent with the .wav extensions in the .csv meta files.

Summary

MAESTRO (MIDI and Audio Edited for Synchronous Tracks and Organization) is a dataset composed of over 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.

Dataset (from the Magenta site https://magenta.tensorflow.org/datasets/maestro )

We partnered with organizers of the International Piano-e-Competition for the raw data used in this dataset. During each installment of the competition virtuoso pianists perform on Yamaha Disklaviers which, in addition to being concert-quality acoustic grand pianos, utilize an integrated high-precision MIDI capture and playback system. Recorded MIDI data is of sufficient fidelity to allow the audition stage of the competition to be judged remotely by listening to contestant performances reproduced over the wire on another Disklavier instrument.

The dataset contains over 200 hours of paired audio and MIDI recordings from ten years of International Piano-e-Competition. The MIDI data includes key strike velocities and sustain/sostenuto/una corda pedal positions. Audio and MIDI files are aligned with ∼3 ms accuracy and sliced to individual musical pieces, which are annotated with composer, title, and year of performance. Uncompressed audio is of CD quality or higher (44.1–48 kHz 16-bit PCM stereo).

A train/validation/test split configuration is also proposed, so that the same composition, even if performed by multiple contestants, does not appear in multiple subsets. Repertoire is mostly classical, including composers from the 17th to early 20th century.

For more information about how the dataset was created and several applications of it, please see the paper where it was introduced: Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset.

For an example application of the dataset, see our blog post on Wave2Midi2Wave.

License

The dataset is made available by Google LLC under a Creative Commons Attribution Non-Commercial Share-Alike 4.0 (CC BY-NC-SA 4.0) license.

Acknowledgements

More info on the MAESTRO dataset https://magenta.tensorflow.org/datasets/maestro Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset https://arxiv.org/abs/1810.12247

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. "Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset." In International Conference on Learning Representations, 2019.
P
Reddit Dataset
paperswithcode.com
opendatalab.com
Updated Jun 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William L. Hamilton; Rex Ying; Jure Leskovec (2017). Reddit Dataset [Dataset]. https://paperswithcode.com/dataset/reddit
Explore at:
Dataset updated
Jun 9, 2017
Authors
William L. Hamilton; Rex Ying; Jure Leskovec
Description
The Reddit dataset is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. 50 large communities have been sampled to build a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. The first 20 days are used for training and the remaining days for testing (with 30% used for validation). For features, off-the-shelf 300-dimensional GloVe CommonCrawl word vectors are used.
Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL
zenodo.org
bin, json, txt
Updated Aug 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
Explore at:
txt, json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5205322
Dataset updated
Aug 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
Z
Dataset: A Systematic Literature Review on the topic of High-value datasets
data.niaid.nih.gov
zenodo.org
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charalampos Alexopoulos (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
Explore at:
Dataset updated
Jun 23, 2023
Dataset provided by
Anastasija Nikiforova
Charalampos Alexopoulos
Magdalena Ciesielska
Nina Rizun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

Methodology

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

Description of the data in this data set

Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

Licenses or restrictions CC-BY

For more info, see README.txt
f
Comparison site multivariate distance summaries.
plos.figshare.com
xls
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David J. Tonjes; Yiyi Wang; Firman Firmansyah; Sameena Manzur; Matthew Johnston; Griffin Walker; Krista L. Thyberg; Elizabeth Hewitt (2025). Comparison site multivariate distance summaries. [Dataset]. http://doi.org/10.1371/journal.pone.0308367.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0308367.t005
Dataset updated
Jan 15, 2025
Dataset provided by
PLOS ONE
Authors
David J. Tonjes; Yiyi Wang; Firman Firmansyah; Sameena Manzur; Matthew Johnston; Griffin Walker; Krista L. Thyberg; Elizabeth Hewitt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The composition of solid waste affects technology choices and policy decisions regarding its management. Analyses of waste composition studies are almost always made on a parameter by parameter basis. Multivariate distance techniques can create wholisitic determinations of similarities and differences and were applied here to enhance a series of waste composition comparisons. A set of New York City residential waste composition studies conducted in 1990, 2004, 2013, and 2017 were compared to EPA data and to 88 studies conducted in other US jurisdictions from 1987–2021. The total residential waste stream and the disposed wastes in NYC were found to be similar in nature, and very different from the composition of wastes set out for recycling. Disposed wastes were more similar across the five NYC boroughs in a single year than in one borough over the 28-year time period, but recyclables were more similar across 14 years than across the boroughs in a single year. Food and plastics percentages in total and disposed waste streams increased over time, and paper percentages fell. The food disposal rate in NYC over time increased much less than EPA data show. The rate of plastics and paper disposal in NYC decreased. NYC data largely conformed to trends from the 88 other waste composition studies and did not generally agree with EPA data sets. The use of novel-to-waste studies multivariate distance analyses offers the promise of simplifying the identification of overall similarities and differences across waste studies, and so improving management and planning for solid waste.
P
Something-Something V2 Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic, Something-Something V2 Dataset [Dataset]. https://paperswithcode.com/dataset/something-something-v2
Explore at:
Authors
Raghav Goyal; Samira Ebrahimi Kahou; Vincent Michalski; Joanna Materzyńska; Susanne Westphal; Heuna Kim; Valentin Haenel; Ingo Fruend; Peter Yianilos; Moritz Mueller-Freitag; Florian Hoppe; Christian Thurau; Ingo Bax; Roland Memisevic
Description
The 20BN-SOMETHING-SOMETHING V2 dataset is a large collection of labeled video clips that show humans performing pre-defined basic actions with everyday objects. The dataset was created by a large number of crowd workers. It allows machine learning models to develop fine-grained understanding of basic actions that occur in the physical world. It contains 220,847 videos, with 168,913 in the training set, 24,777 in the validation set and 27,157 in the test set. There are 174 labels.

Source

Image Source
Floating Waste 8deje Lrbq Dataset
universe.roboflow.com
zip
Updated Mar 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roboflow 100-VL (2025). Floating Waste 8deje Lrbq Dataset [Dataset]. https://universe.roboflow.com/rf-100-vl/floating-waste-8deje-lrbq/2
Explore at:
zipAvailable download formats
Dataset updated
Mar 13, 2025
Dataset provided by
Roboflow
Authors
Roboflow 100-VL
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Variables measured
Floating Waste 8deje Lrbq Bounding Boxes
Description
Overview

Introduction

Object Classes

Bottle

Can

Carton

Paper

Plastic

Introduction

This dataset focuses on annotating various types of floating waste in water. The task involves identifying specific waste items to assist in environmental monitoring and cleanup efforts. The classes included are:

Bottle: Typically cylindrical containers made of plastic or glass.

Can: Usually cylindrical containers made of metal.

Carton: Box-shaped containers, often used for liquids.

Paper: Flat or crumpled material, lightweight and prone to water absorption.

Plastic: Items made of flexible synthetic material, varied shapes.

Object Classes

Bottle

Description

Bottles are cylindrical containers with a neck and usually a cap or lid. They often float neck up in the water or can be partially submerged.

Instructions

Annotate the entire visible part of the bottle, ensuring to include the neck and cap if visible. Do not label reflections or shadows of the bottle. If only part of the bottle is visible, capture as much of the identifiable cylindrical shape as possible.

Can

Description

Cans are typically cylindrical and made of metal, often used for beverages. They maintain their shape even when partially submerged.

Instructions

Capture the full outline of the can, including any visible top or bottom surfaces. Avoid annotating shadows or reflections. If the can is partially submerged, ensure the visible portion’s edges are well-defined.

Carton

Description

Cartons are box-shaped, primarily used for packaging liquids like milk or juice. They generally have straight edges and right angles.

Instructions

Annotate the complete visible surface area of the carton. Pay attention to its boxy shape and straight edges. Exclude any reflections or partially obscured areas that prevent accurate identification.

Paper

Description

Paper is characterized by its flat, often wrinkled texture. It may float unevenly and absorb water, causing it to change shape.

Instructions

Outline the visible part of the paper accurately, including any wrinkles or folds. Avoid areas where the paper is too waterlogged to be distinguishable from the water’s surface.

Plastic

Description

Plastic encompasses various synthetic materials, frequently found in flexible, deformable items or packaging. The shapes can vary greatly.

Instructions

Identify and annotate the full extent of visible plastic items. Look for distinctive synthetic textures and forms. Ignore reflections and focus on the actual item, ensuring not to include anything embedding into the water beyond recognition.
PERSON Dataset
figshare.com
zip
Updated Sep 25, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shayan A. Tabrizi; Azadeh Shakery; Hamed Zamani; Mohammad Ali Tavallaie (2016). PERSON Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.3858291.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3858291.v2
Dataset updated
Sep 25, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Shayan A. Tabrizi; Azadeh Shakery; Hamed Zamani; Mohammad Ali Tavallaie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PERSON dataset:Dataset created for paper "PERSON: Personalized Information Retrieval Evaluation Based on Citation Networks." Please cite the paper for any usage.The dataset is produced by data cleaning of AMiner's citation network V2 dataset (https://aminer.org/citation). Anyone who wants to use PERSON dataset must cite Aminer's dataset (as explained in its homepage: Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'2008). pp.990-998) as well as the aforementioned paper.It includes two files: 1- authors_giant.txt: the information of authors and their co-authors. The format is as follows: author ID author name the list of coauthors delimited by "," (Each entry contains the ID of the coauthor followed by the number of times they co-authored a paper) ... 2- papers_giant.txt: the information of papers and references. The format is as follows: paper ID Is paper merged (See the first paper for details) original paper ID (in Aminer's dataset) blank blank blank blank title abstract time (only the year part is important) blank references to papers out of the PERSON dataset (indicated by Aminer's IDs) references to papers inside the PERSON dataset (indicated by PERSON's IDs) author IDs ...
m
Mean annual climate data clipped to BA_SYD extent
demo.dev.magda.io
cloud.csiss.gmu.edu
+3more
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2023). Mean annual climate data clipped to BA_SYD extent [Dataset]. https://demo.dev.magda.io/dataset/ds-dga-f488c53d-af5a-4d93-9975-25979e11de56
Explore at:
Dataset updated
Aug 8, 2023
Dataset provided by
Bioregional Assessment Program
Description
Abstract The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets. You can find a link to the parent datasets in the Lineage Field in this …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived. This dataset includes the following parameters clipped to BA_SYD extent. Mean annual BAWAP (Bureau of Meteorology Australian Water Availability Project) rainfall of year 1981 - 2013 Mean annual penman PET (potential evapotranspiration) of year 1981 - 2013 Mean annual runoff using the 'Budyko-framework' implementation of Choudhury Dataset History Lineage is as per the BA All mean climate data for Australia except the national data has been clipped to BA SYD extent. The mean annual rainfall data is created from monthly BAWAP grids which is created from daily BILO rainfall. Jones, D. A., W. Wang and R. Fawcett (2009). "High-quality spatial climate data-sets for Australia." Australian Meteorological and Oceanographic Journal 58(4): 233-248. The Mean annual penman PET is created as per the Donohue et al (2010) paper using the fully physically based Penman formulation of potential evapotranspiration, exept that daily wind speed grids used here were generated with a spline (i.e., ANUSPLIN) as per McVicar et al (2008), not the TIN as per Donohue et al (2010). For comprehensive details regarding the generation of some of these datasets (i.e., net radiation, Rn) see the details provided in Donohue et al (2009). Donohue, R.J., McVicar, T.R. and Roderick, M.L. (2010) Assessing the ability of potential evaporation formulations to capture the dynamics in evaporative demand within a changing climate. Journal of Hydrology. 386(1-4), 186-197. doi:10.1016/j.jhydrol.2010.03.020 Donohue, R.J., McVicar, T.R. and Roderick, M.L., (2009) Generating Australian potential evaporation data suitable for assessing the dynamics in evaporative demand within a changing climate. CSIRO: Water for a Healthy Country Flagship, pp 43. http://www.clw.csiro.au/publications/waterforahealthycountry/2009/wfhc-evaporative-demand-dynamics.pdf McVicar, T.R., Van Niel, T.G., Li, L.T., Roderick, M.L., Rayner, D.P., Ricciardulli, L. and Donohue, R.J. (2008) Wind speed climatology and trends for Australia, 1975-2006: Capturing the stilling phenomenon and comparison with near-surface reanalysis output. Geophysical Research Letters. 35, L20403, doi:10.1029/2008GL035627 The Mean annual runoff was created as per the Donohue et al (2010) paper. The data represent the runoff expected from the steady-state 'Budyko curve' longterm mean annual water-energy limit approach using BAWAP precipitation and the Penman potential ET described above. Choudhury BJ (1999) Evaluation of an empirical equation for annual evaporation using field observations and results from a biophysical model. Journal of Hydrology 216, 99-110. Donohue, R.J., McVicar, T.R. and Roderick, M.L. (2010) Assessing the ability of potential evaporation formulations to capture the dynamics in evaporative demand within a changing climate. Journal of Hydrology. 386(1-4), 186-197. doi:10.1016/j.jhydrol.2010.03.020 Donohue, R.J., McVicar, T.R. and Roderick, M.L., (2009) Generating Australian potential evaporation data suitable for assessing the dynamics in evaporative demand within a changing climate. CSIRO: Water for a Healthy Country Flagship, pp 43. http://www.clw.csiro.au/publications/waterforahealthycountry/2009/wfhc-evaporative-demand-dynamics.pdf McVicar, T.R., Van Niel, T.G., Li, L.T., Roderick, M.L., Rayner, D.P., Ricciardulli, L. and Donohue, R.J. (2008) Wind speed climatology and trends for Australia, 1975-2006: Capturing the stilling phenomenon and comparison with near-surface reanalysis output. Geophysical Research Letters. 35, L20403, doi:10.1029/2008GL035627 Dataset Citation Bioregional Assessment Programme (2014) Mean annual climate data clipped to BA_SYD extent. Bioregional Assessment Derived Dataset. Viewed 18 June 2018, http://data.bioregionalassessments.gov.au/dataset/a8393a45-5e86-431b-b504-c0b2953296f4. Dataset Ancestors Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012 Derived From Mean Annual Climate Data of Australia 1981 to 2012

Facebook

Twitter

Click to copy link

Link copied

Cite

Work With Data (2025). Dataset of artists who created Paper folder [Dataset]. https://www.workwithdata.com/datasets/artists?f=1&fcol0=j0-artwork&fop0=%3D&fval0=Paper+folder&j=1&j0=artworks

Dataset of artists who created Paper folder

Explore at:

Dataset updated

May 8, 2025

Dataset authored and provided by

Work With Data

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is about artists. It has 1 row and is filtered where the artworks is Paper folder. It features 9 columns including birth date, death date, country, and gender.

Clear search

Close search

Google apps

Main menu

Dataset of artists who created Paper folder

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Dataset for "Continued use of retracted papers: Temporal trends in citations...

Methodology data of "Twenty years of research in Digital Humanities: a topic...

ml-arxiv-papers

Conceptual novelty scores for PubMed articles

Dataset of artists who created Lunes en papier (Paper Moons)

GSM8K Dataset

internal-datasets

Data from: Table 4 -

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code

The Maestro Dataset v2

Note

Wav -> mp3 Conversion

Summary

Dataset (from the Magenta site https://magenta.tensorflow.org/datasets/maestro )

License

Acknowledgements

Reddit Dataset

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

Dataset: A Systematic Literature Review on the topic of High-value datasets

Comparison site multivariate distance summaries.

Something-Something V2 Dataset

Floating Waste 8deje Lrbq Dataset

Overview

Introduction

Object Classes

Bottle

Description

Instructions

Can

Description

Instructions

Carton

Description

Instructions

Paper

Description

Instructions

Plastic

Description

Instructions

PERSON Dataset

Mean annual climate data clipped to BA_SYD extent

Dataset of artists who created Paper folder