Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado’s Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums. AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12). Data The source data includes discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 6455 210 229 6894 DEFT DF English 19558 0 0 19558 Guidelines AMRs 819 0 0 819 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Xinhua MT 741 99 86 Totals 36521 1368 1371 39260 For those interested in utilizing a standard/community partition for AMR research (for instance in development of semantic parsers), data in the “split” directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The “unsplit” directory contains the same 39,260 AMRs with no train/dev/test partition.
AbstractIntroductionAbstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations. AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12), and Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).DataThe source data includes discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. New source data to AMR 3.0 includes sentences from Aesop's Fables, parallel text and the situation frame data set developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 7379 210 229 7818 DEFT DF English 32915 0 0 32915 Aesop fables 49 0 0 49 Guidelines AMRs 970 0 0 970 LORELEI 4441 354 527 5322 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Wikipedia 192 0 0 192 Xinhua MT 741 99 86 926 Totals 55635 1722 1898 59255 Data in the "split" directory contains 59,255 AMRs split roughly 93.9%/2.9%/3.2% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 59,255 AMRs with no train/dev/test partition.
The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from the Project Gutenberg. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challenging Automatic Speech Recognition systems would perform against. Each of the dev and test sets is around 5hr in audio length. This corpus also provides the n-gram language models and the corresponding texts excerpted from the Project Gutenberg books, which contain 803M tokens and 977K unique words.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.
If you find this project useful or if you use our newly proposed datasets and/or our analyses, please cite our paper.
Kumar Abhishek, Aditi Jain, Ghassan Hamarneh. "Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets". arXiv preprint arXiv:2401.14497, 2024. DOI: 10.48550/ARXIV.2401.14497.
The corresponding BibTeX entry is:
@article{abhishek2024investigating,
title={Investigating the Quality of {DermaMNIST} and {Fitzpatrick17k} Dermatological Image Datasets},
author={Abhishek, Kumar and Jain, Aditi and Hamarneh, Ghassan},
journal={arXiv preprint arXiv:2401.14497},
doi = {10.48550/ARXIV.2401.14497},
url = {https://arxiv.org/abs/2401.14497},
year={2024}
}
The results of the analysis, including the visualizations, are available on the project website: https://derm.cs.sfu.ca/critique/.
The accompanying code for this project is hosted on GitHub at https://github.com/kakumarabhishek/Corrected-Skin-Image-Datasets" target="_blank" rel="noopener">https://github.com/kakumarabhishek/Corrected-Skin-Image-Datasets.
The DermaMNIST-E, DermaMNIST-C, and Fitzpatrick17k-C datasets contained in this repository are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.
The code hosted on GitHub is licensed under the Apache License 2.0.
AbstractIntroduction MyST (My Science Tutor) Children's Conversational Speech was developed by Boulder Learning Inc. It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary. Data was collected in two phases between 2008 and 2017. In both phases, spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System (FOSS) system, a research-based science curriculum for grades K-8. The eight FOSS science modules represented in this data set consisted of an average of 16 small-group classroom science investigations. Following the investigations, students conversed with the virtual science tutor for 15-20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers. Data Speech data was collected in 10,496 sessions for a total of 227,567 utterances. Approximately 45% of those utterances (102,433) were transcribed. All data collected in Phase I was transcribed using rich transcription guidelines; data collected in Phase II was partially transcribed using a reduced version of those guidelines. The transcription guidelines are included in this release. Data is divided into development, test, and train partitions for use with ASR systems Speech is presented in single channel, 16kHz, 16-bit flac compressed wav format. Transcripts are UTF-8 encoded plain text.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for Journal recommendation, includes title, abstract, keywords, and journal.
We extracted the journals and more information of:
Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.
Dataset Components:
data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.
data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.
data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.
Description:
This dataset comprises 17 distinct classes of agricultural pests, specifically targeting various insects and mites that affect Jute Pest The data is meticulously divided into three partitions: train, validation (val), and test sets, ensuring a robust framework for developing and evaluating machine learning models.
Download Dataset
Key Features
Comprehensive Coverage: The dataset includes images of 17 different pest classes, providing a broad spectrum for pest identification and classification.
Structured Partitions: Data is divided into training, validation, and testing sets, facilitating the development of accurate and generalizable models.
High-Quality Images: The dataset contains high-resolution images, ensuring the detailed features of each pest are captured, which is crucial for precise classification.
Usage
This dataset is ideal for:
Training Machine Learning Models: Suitable for developing and refining models aimed at pest detection and classification in agricultural settings.
Research on Pest Management: A valuable resource for studying pest behavior, distribution, and impact on crops, contributing to better pest management strategies.
Educational Purposes: Providing a rich dataset for educational projects in entomology, agriculture, and machine learning.
Additional Applications
Automated Pest Detection: Enhancing the capabilities of automated systems for early pest detection and management in agriculture.
Precision Agriculture: Supporting precision agriculture techniques by enabling targeted pest control measures based on accurate pest identification.
Cross-Domain Studies: Facilitating research on the generalization of pest detection models across different crops and agricultural environments.
This dataset is sourced from Kaggle.
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
OVERVIEW:
The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. The podcast audio recordings, sourced from SoundCloud (www.soundcloud.com), are CC-licensed, gender-balanced, and total 145 hours of audio from over 350 speakers. The annotations are provided under a non-commercial license and consist of 85,803 manually annotated audio events including approximately 35,000 filler words (“uh” and “um”) and 50,000 non-filler events such as breaths, music, laughter, repeated words, and noise. The annotated events are also provided as pre-processed 1-second audio clips. The dataset also includes automatically generated speech transcripts from a speech-to-text system. A detailed description is provided below.
The PodcastFillers dataset homepage: PodcastFillers.github.io
The preprocessing utility functions and code repository for reproducing our experimental results: PodcastFillersUtils
LICENSE:
The PodcastFillers dataset has separate licenses for the audio data and for the metadata. The metadata includes all annotations, speech-to-text transcriptions, and model outputs including VAD activations and FillerNet classification predictions.
Note: PodcastFillers is provided for research purposes only. The metadata license prohibits commercial use, which in turn prohibits deploying technology developed using the PodcastFillers metadata (such as the CSV annotations or audio clips extracted based on these annotations) in commercial applications.
## License for PodcastFillers Dataset metadata
This license agreement (the “License”) between Adobe Inc., having a place of business at 345 Park Avenue, San Jose, California 95110-2704 (“Adobe”), and you, the individual or entity exercising rights under this License (“you” or “your”), sets forth the terms for your use of certain research materials that are owned by Adobe (the “Licensed Materials”). By exercising rights under this License, you accept and agree to be bound by its terms. If you are exercising rights under this License on behalf of an entity, then “you” means you and such entity, and you (personally) represent and warrant that you (personally) have all necessary authority to bind that entity to the terms of this License.
1. GRANT OF LICENSE.
1.1 Adobe grants you a nonexclusive, worldwide, royalty-free, revocable, fully paid license to (A) reproduce, use, modify, and publicly display the Licensed Materials for noncommercial research purposes only; and (B) redistribute the Licensed Materials, and modifications or derivative works thereof, for noncommercial research purposes only, provided that you give recipients a copy of this License upon redistribution.
1.2 You may add your own copyright statement to your modifications and/or provide additional or different license terms for use, reproduction, modification, public display, and redistribution of your modifications and derivative works, provided that such license terms limit the use, reproduction, modification, public display, and redistribution of such modifications and derivative works to noncommercial research purposes only.
1.3 For purposes of this License, noncommercial research purposes include academic research and teaching only. Noncommercial research purposes do not include commercial licensing or distribution, development of commercial products, or any other activity that results in commercial gain.
2. OWNERSHIP AND ATTRIBUTION. Adobe and its licensors own all right, title, and interest in the Licensed Materials. You must retain all copyright notices and/or disclaimers in the Licensed Materials.
3. DISCLAIMER OF WARRANTIES. THE LICENSED MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND. THE ENTIRE RISK AS TO THE USE, RESULTS, AND PERFORMANCE OF THE LICENSED MATERIALS IS ASSUMED BY YOU. ADOBE DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, WITH REGARD TO YOUR USE OF THE LICENSED MATERIALS, INCLUDING, BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD-PARTY RIGHTS.
4. LIMITATION OF LIABILITY. IN NO EVENT WILL ADOBE BE LIABLE FOR ANY ACTUAL, INCIDENTAL, SPECIAL OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT LIMITATION, LOSS OF PROFITS OR OTHER COMMERCIAL LOSS, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE LICENSED MATERIALS, EVEN IF ADOBE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
5. TERM AND TERMINATION.
5.1 The License is effective upon acceptance by you and will remain in effect unless terminated earlier in accordance with Section 5.2.
5.2 Any breach of any material provision of this License will automatically terminate the rights granted herein.
5.3 Sections 2 (Ownership and Attribution), 3 (Disclaimer of Warranties), 4 (Limitation of Liability) will survive termination of this License.
## License for PodcastFillers Dataset audio files
All of the podcast episode audio files come from SoundCloud. Please see podcast_episode_license.csv (included in the dataset) for a detailed license info for each episode. They include CC-BY-3.0, CC-BY-SA 3.0 and CC-BY-ND-3.0 licenses.
ACKNOWLEDGEMENT:
Please cite the following paper in work that makes use of this dataset:
Filler Word Detection and Classification: A Dataset and Benchmark
Ge Zhu, Juan-Pablo Caceres and Justin Salamon
In 23rd Annual Cong. of the Int. Speech Communication Association (INTERSPEECH), Incheon, Korea, Sep. 2022.
Bibtex
@inproceedings{Zhu:FillerWords:INTERSPEECH:22,
title = {Filler Word Detection and Classification: A Dataset and Benchmark},
booktitle = {23rd Annual Cong.~of the Int.~Speech Communication Association (INTERSPEECH)},
address = {Incheon, Korea},
month = {Sep.},
url = {https://arxiv.org/abs/2203.15135},
author = {Zhu, Ge and Caceres, Juan-Pablo and Salamon, Justin},
year = {2022},
}
ANNOTATIONS:
The annotations include 85,803 manually annotated audio events covering common English filler-word and non-filler-word events. We also provide automatically-generated speech transcripts from a speech-to-text system, which do not contain the manually annotated events.
Full label vocabulary
Each of the 85,803 manually annotated events is labeled as one of 5 filler classes or 8 non-filler classes (label: number of events).
Fillers
- Uh: 17,907
- Um: 17,078
- You know: 668
- Other: 315
- Like: 157
Non-fillers
- Words: 12,709
- Repetitions: 9,024
- Breath: 8,288
- Laughter: 6,623
- Music : 5,060
- Agree (agreement sounds, e.g., “mm-hmm”, “ah-ha”): 3,755
- Noise : 2,735
- Overlap (overlapping speakers): 1,484
Total: 85,803
Consolidated label vocabulary
76,689 of the audio events are also labeled with a smaller, consolidated vocabulary with 6 classes. The consolidated vocabulary was obtained by removing classes with less than 5,000 annotations (like, you know, other, agreement sounds, overlapping speakers, noise), and grouping “repetitions” and “words” into “words”.
- Words: 21,733
- Uh: 17,907
- Um: 17,078
- Breath: 8,288
- Laughter: 6,623
- Music : 5,060
- Total: 76,689
The consolidated vocabulary was used to train FillerNet
For a detailed description of how the dataset was created, please see our paper.
Data Split for Machine Learning:
To facilitate machine learning experiments, the audio data in this dataset (full-length recordings and preprocessed 1-sec clips) are pre-arranged into “train”, “validation”, and “test” folders. This split ensures that episodes from the same podcast show are always in the same subset (train, validation, or test), to prevent speaker leakage. We also ensured that each subset in this split remains gender balanced, same as the complete dataset.
We strongly recommend using this split in your experiments. It will ensure your results are not inflated due to overfitting, and that they are comparable to the results published in the FillerNet paper
AUDIO FILES:
1. Full-length podcast episodes (MP3)
199 audio files of the full-length podcast episode recordings in mp3 format, stereo channels, 44.1 kHz sample rate and 32 bit depth. Filename format: [show name]_[episode name].mp3.
2. Pre-processed full-length podcast episodes (WAV)
199 audio files of the full-length podcast episode recordings in wav format, mono channel, 16 kHz sample rate and 32 bit depth. The files are split into train, validation and test partitions (folders), see Data Split for Machine Learning above. Filename format: [show name]_[episode name].wav
3. Pre-processed WAV clips
Pre-processed 1-second audio clips of the annotated events, where each clip is centered on the center of the event. For annotated events longer than 1 second, we truncate them from the center into 1-second. The clips are in the same format as the pre-processed full-length podcast episodes: wav format, mono channel, 16 kHz sample rate and 32 bit depth.
The clips that have consolidated vocabulary labels (76,689) are split into “train”, “validation” and “test” partitions (folders), see Data Split for Machine Learning above. The remainder of the clips (9,114) are placed in an “extra”
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado’s Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums. AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12). Data The source data includes discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 6455 210 229 6894 DEFT DF English 19558 0 0 19558 Guidelines AMRs 819 0 0 819 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Xinhua MT 741 99 86 Totals 36521 1368 1371 39260 For those interested in utilizing a standard/community partition for AMR research (for instance in development of semantic parsers), data in the “split” directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The “unsplit” directory contains the same 39,260 AMRs with no train/dev/test partition.