8 datasets found
  1. A

    Abstract Meaning Representation (AMR) Annotation Release 2.0

    • abacus.library.ubc.ca
    iso, txt
    Updated Jun 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2017). Abstract Meaning Representation (AMR) Annotation Release 2.0 [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/8MN4GE
    Explore at:
    iso(157806592), txt(1308)Available download formats
    Dataset updated
    Jun 15, 2017
    Dataset provided by
    Abacus Data Network
    Time period covered
    1997 - 2017
    Area covered
    France, Taiwan, Province of China, Israel, China, United States
    Dataset funded by
    National Science Foundation
    Defense Advanced Research Projects Agency
    Description

    Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado’s Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums. AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12). Data The source data includes discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 6455 210 229 6894 DEFT DF English 19558 0 0 19558 Guidelines AMRs 819 0 0 819 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Xinhua MT 741 99 86 Totals 36521 1368 1371 39260 For those interested in utilizing a standard/community partition for AMR research (for instance in development of semantic parsers), data in the “split” directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The “unsplit” directory contains the same 39,260 AMRs with no train/dev/test partition.

  2. A

    Abstract Meaning Representation (AMR) Annotation Release 3.0

    • abacus.library.ubc.ca
    iso, txt
    Updated Sep 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2021). Abstract Meaning Representation (AMR) Annotation Release 3.0 [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl%3A11272.1%2FAB2%2F82CVJF&version=&q=&fileAccess=Restricted&fileTag=%22Data%22&fileSortField=name&fileSortOrder=desc
    Explore at:
    iso(276281344), txt(1308)Available download formats
    Dataset updated
    Sep 3, 2021
    Dataset provided by
    Abacus Data Network
    Description

    AbstractIntroductionAbstract Meaning Representation (AMR) Annotation Release 3.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release adds new data to, and updates material contained in, Abstract Meaning Representation 2.0 (LDC2017T10), specifically: more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations. AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12), and Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).DataThe source data includes discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. New source data to AMR 3.0 includes sentences from Aesop's Fables, parallel text and the situation frame data set developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 7379 210 229 7818 DEFT DF English 32915 0 0 32915 Aesop fables 49 0 0 49 Guidelines AMRs 970 0 0 970 LORELEI 4441 354 527 5322 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Wikipedia 192 0 0 192 Xinhua MT 741 99 86 926 Totals 55635 1722 1898 59255 Data in the "split" directory contains 59,255 AMRs split roughly 93.9%/2.9%/3.2% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The "unsplit" directory contains the same 59,255 AMRs with no train/dev/test partition.

  3. P

    LibriSpeech Dataset

    • paperswithcode.com
    Updated Oct 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vassil Panayotov; Guoguo Chen; Daniel Povey; Sanjeev Khudanpur (2021). LibriSpeech Dataset [Dataset]. https://paperswithcode.com/dataset/librispeech
    Explore at:
    Dataset updated
    Oct 22, 2024
    Authors
    Vassil Panayotov; Guoguo Chen; Daniel Povey; Sanjeev Khudanpur
    Description

    The LibriSpeech corpus is a collection of approximately 1,000 hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from the Project Gutenberg. The training data is split into 3 partitions of 100hr, 360hr, and 500hr sets while the dev and test data are split into the ’clean’ and ’other’ categories, respectively, depending upon how well or challenging Automatic Speech Recognition systems would perform against. Each of the dev and test sets is around 5hr in audio length. This corpus also provides the n-gram language models and the corresponding texts excerpted from the Project Gutenberg books, which contain 803M tokens and 977K unique words.

  4. Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated May 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kumar Abhishek; Kumar Abhishek; Aditi Jain; Aditi Jain; Ghassan Hamarneh; Ghassan Hamarneh (2024). Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets [Dataset]. http://doi.org/10.5281/zenodo.11101338
    Explore at:
    bin, csv, zipAvailable download formats
    Dataset updated
    May 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kumar Abhishek; Kumar Abhishek; Aditi Jain; Aditi Jain; Ghassan Hamarneh; Ghassan Hamarneh
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Abstract

    The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.

    Citation

    If you find this project useful or if you use our newly proposed datasets and/or our analyses, please cite our paper.

    Kumar Abhishek, Aditi Jain, Ghassan Hamarneh. "Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets". arXiv preprint arXiv:2401.14497, 2024. DOI: 10.48550/ARXIV.2401.14497.

    The corresponding BibTeX entry is:

    @article{abhishek2024investigating,
    title={Investigating the Quality of {DermaMNIST} and {Fitzpatrick17k} Dermatological Image Datasets},
    author={Abhishek, Kumar and Jain, Aditi and Hamarneh, Ghassan},
    journal={arXiv preprint arXiv:2401.14497},
    doi = {10.48550/ARXIV.2401.14497},
    url = {https://arxiv.org/abs/2401.14497},
    year={2024}
    }

    Project Website

    The results of the analysis, including the visualizations, are available on the project website: https://derm.cs.sfu.ca/critique/.

    Code

    The accompanying code for this project is hosted on GitHub at https://github.com/kakumarabhishek/Corrected-Skin-Image-Datasets" target="_blank" rel="noopener">https://github.com/kakumarabhishek/Corrected-Skin-Image-Datasets.

    License

    The DermaMNIST-E, DermaMNIST-C, and Fitzpatrick17k-C datasets contained in this repository are licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

    The code hosted on GitHub is licensed under the Apache License 2.0.

  5. A

    MyST Children's Conversational Speech

    • abacus.library.ubc.ca
    pdf, txt
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2023). MyST Children's Conversational Speech [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=7471793ada3ce146d7fc914b77f3?persistentId=hdl%3A11272.1%2FAB2%2FQUHJRW&version=&q=&fileTypeGroupFacet=%22Text%22&fileAccess=Restricted
    Explore at:
    pdf(31909), txt(3132)Available download formats
    Dataset updated
    Aug 18, 2023
    Dataset provided by
    Abacus Data Network
    Description

    AbstractIntroduction MyST (My Science Tutor) Children's Conversational Speech was developed by Boulder Learning Inc. It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary. Data was collected in two phases between 2008 and 2017. In both phases, spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System (FOSS) system, a research-based science curriculum for grades K-8. The eight FOSS science modules represented in this data set consisted of an average of 16 small-group classroom science investigations. Following the investigations, students conversed with the virtual science tutor for 15-20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers. Data Speech data was collected in 10,496 sessions for a total of 227,567 utterances. Approximately 45% of those utterances (102,433) were transcribed. All data collected in Phase I was transcribed using rich transcription guidelines; data collected in Phase II was partially transcribed using a reduced version of those guidelines. The transcription guidelines are included in this release. Data is divided into development, test, and train partitions for use with ASR systems Speech is presented in single channel, 16kHz, 16-bit flac compressed wav format. Transcripts are UTF-8 encoded plain text.

  6. Z

    Pubmed Journal Recommendation System dataset

    • data.niaid.nih.gov
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiayun Liu (2025). Pubmed Journal Recommendation System dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8386010
    Explore at:
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Raúl García Castro
    Manuel Castillo Cara
    Jiayun Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for Journal recommendation, includes title, abstract, keywords, and journal.

    We extracted the journals and more information of:

    Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.

    Dataset Components:

    data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.

    data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.

    data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.

  7. P

    Jute Pest Dataset

    • paperswithcode.com
    • gts.ai
    Updated May 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Jute Pest Dataset [Dataset]. https://paperswithcode.com/dataset/jute-pest
    Explore at:
    Dataset updated
    May 8, 2025
    Description

    Description:

    👉 Download the dataset here

    This dataset comprises 17 distinct classes of agricultural pests, specifically targeting various insects and mites that affect Jute Pest The data is meticulously divided into three partitions: train, validation (val), and test sets, ensuring a robust framework for developing and evaluating machine learning models.

    Download Dataset

    Key Features

    Comprehensive Coverage: The dataset includes images of 17 different pest classes, providing a broad spectrum for pest identification and classification.

    Structured Partitions: Data is divided into training, validation, and testing sets, facilitating the development of accurate and generalizable models.

    High-Quality Images: The dataset contains high-resolution images, ensuring the detailed features of each pest are captured, which is crucial for precise classification.

    Usage

    This dataset is ideal for:

    Training Machine Learning Models: Suitable for developing and refining models aimed at pest detection and classification in agricultural settings.

    Research on Pest Management: A valuable resource for studying pest behavior, distribution, and impact on crops, contributing to better pest management strategies.

    Educational Purposes: Providing a rich dataset for educational projects in entomology, agriculture, and machine learning.

    Additional Applications

    Automated Pest Detection: Enhancing the capabilities of automated systems for early pest detection and management in agriculture.

    Precision Agriculture: Supporting precision agriculture techniques by enabling targeted pest control measures based on accurate pest identification.

    Cross-Domain Studies: Facilitating research on the generalization of pest detection models across different crops and agricultural environments.

    This dataset is sourced from Kaggle.

  8. PodcastFillers

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated Oct 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ge Zhu; Ge Zhu; Juan-Pablo Caceres; Justin Salamon; Juan-Pablo Caceres; Justin Salamon (2022). PodcastFillers [Dataset]. http://doi.org/10.5281/zenodo.7121457
    Explore at:
    zip, bin, csvAvailable download formats
    Dataset updated
    Oct 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ge Zhu; Ge Zhu; Juan-Pablo Caceres; Justin Salamon; Juan-Pablo Caceres; Justin Salamon
    License

    Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
    License information was derived automatically

    Description

    OVERVIEW:
    The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. The podcast audio recordings, sourced from SoundCloud (www.soundcloud.com), are CC-licensed, gender-balanced, and total 145 hours of audio from over 350 speakers. The annotations are provided under a non-commercial license and consist of 85,803 manually annotated audio events including approximately 35,000 filler words (“uh” and “um”) and 50,000 non-filler events such as breaths, music, laughter, repeated words, and noise. The annotated events are also provided as pre-processed 1-second audio clips. The dataset also includes automatically generated speech transcripts from a speech-to-text system. A detailed description is provided below.

    The PodcastFillers dataset homepage: PodcastFillers.github.io
    The preprocessing utility functions and code repository for reproducing our experimental results: PodcastFillersUtils

    LICENSE:

    The PodcastFillers dataset has separate licenses for the audio data and for the metadata. The metadata includes all annotations, speech-to-text transcriptions, and model outputs including VAD activations and FillerNet classification predictions.

    Note: PodcastFillers is provided for research purposes only. The metadata license prohibits commercial use, which in turn prohibits deploying technology developed using the PodcastFillers metadata (such as the CSV annotations or audio clips extracted based on these annotations) in commercial applications.

    ## License for PodcastFillers Dataset metadata

    This license agreement (the “License”) between Adobe Inc., having a place of business at 345 Park Avenue, San Jose, California 95110-2704 (“Adobe”), and you, the individual or entity exercising rights under this License (“you” or “your”), sets forth the terms for your use of certain research materials that are owned by Adobe (the “Licensed Materials”). By exercising rights under this License, you accept and agree to be bound by its terms. If you are exercising rights under this License on behalf of an entity, then “you” means you and such entity, and you (personally) represent and warrant that you (personally) have all necessary authority to bind that entity to the terms of this License.

    1. GRANT OF LICENSE.
    1.1 Adobe grants you a nonexclusive, worldwide, royalty-free, revocable, fully paid license to (A) reproduce, use, modify, and publicly display the Licensed Materials for noncommercial research purposes only; and (B) redistribute the Licensed Materials, and modifications or derivative works thereof, for noncommercial research purposes only, provided that you give recipients a copy of this License upon redistribution.
    1.2 You may add your own copyright statement to your modifications and/or provide additional or different license terms for use, reproduction, modification, public display, and redistribution of your modifications and derivative works, provided that such license terms limit the use, reproduction, modification, public display, and redistribution of such modifications and derivative works to noncommercial research purposes only.
    1.3 For purposes of this License, noncommercial research purposes include academic research and teaching only. Noncommercial research purposes do not include commercial licensing or distribution, development of commercial products, or any other activity that results in commercial gain.
    2. OWNERSHIP AND ATTRIBUTION. Adobe and its licensors own all right, title, and interest in the Licensed Materials. You must retain all copyright notices and/or disclaimers in the Licensed Materials.
    3. DISCLAIMER OF WARRANTIES. THE LICENSED MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND. THE ENTIRE RISK AS TO THE USE, RESULTS, AND PERFORMANCE OF THE LICENSED MATERIALS IS ASSUMED BY YOU. ADOBE DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, WITH REGARD TO YOUR USE OF THE LICENSED MATERIALS, INCLUDING, BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD-PARTY RIGHTS.
    4. LIMITATION OF LIABILITY. IN NO EVENT WILL ADOBE BE LIABLE FOR ANY ACTUAL, INCIDENTAL, SPECIAL OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT LIMITATION, LOSS OF PROFITS OR OTHER COMMERCIAL LOSS, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE LICENSED MATERIALS, EVEN IF ADOBE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
    5. TERM AND TERMINATION.
    5.1 The License is effective upon acceptance by you and will remain in effect unless terminated earlier in accordance with Section 5.2.
    5.2 Any breach of any material provision of this License will automatically terminate the rights granted herein.
    5.3 Sections 2 (Ownership and Attribution), 3 (Disclaimer of Warranties), 4 (Limitation of Liability) will survive termination of this License.
    ## License for PodcastFillers Dataset audio files

    All of the podcast episode audio files come from SoundCloud. Please see podcast_episode_license.csv (included in the dataset) for a detailed license info for each episode. They include CC-BY-3.0, CC-BY-SA 3.0 and CC-BY-ND-3.0 licenses.

    ACKNOWLEDGEMENT:
    Please cite the following paper in work that makes use of this dataset:

    Filler Word Detection and Classification: A Dataset and Benchmark
    Ge Zhu, Juan-Pablo Caceres and Justin Salamon
    In 23rd Annual Cong. of the Int. Speech Communication Association (INTERSPEECH), Incheon, Korea, Sep. 2022.

    Bibtex

    @inproceedings{Zhu:FillerWords:INTERSPEECH:22,
     title = {Filler Word Detection and Classification: A Dataset and Benchmark},
     booktitle = {23rd Annual Cong.~of the Int.~Speech Communication Association (INTERSPEECH)},
     address = {Incheon, Korea}, 
     month = {Sep.},
     url = {https://arxiv.org/abs/2203.15135},
     author = {Zhu, Ge and Caceres, Juan-Pablo and Salamon, Justin},
     year = {2022},
    }

    ANNOTATIONS:
    The annotations include 85,803 manually annotated audio events covering common English filler-word and non-filler-word events. We also provide automatically-generated speech transcripts from a speech-to-text system, which do not contain the manually annotated events.
    Full label vocabulary
    Each of the 85,803 manually annotated events is labeled as one of 5 filler classes or 8 non-filler classes (label: number of events).

    Fillers
    - Uh: 17,907
    - Um: 17,078
    - You know: 668
    - Other: 315
    - Like: 157

    Non-fillers
    - Words: 12,709
    - Repetitions: 9,024
    - Breath: 8,288
    - Laughter: 6,623
    - Music : 5,060
    - Agree (agreement sounds, e.g., “mm-hmm”, “ah-ha”): 3,755
    - Noise : 2,735
    - Overlap (overlapping speakers): 1,484

    Total: 85,803
    Consolidated label vocabulary
    76,689 of the audio events are also labeled with a smaller, consolidated vocabulary with 6 classes. The consolidated vocabulary was obtained by removing classes with less than 5,000 annotations (like, you know, other, agreement sounds, overlapping speakers, noise), and grouping “repetitions” and “words” into “words”.

    - Words: 21,733
    - Uh: 17,907
    - Um: 17,078
    - Breath: 8,288
    - Laughter: 6,623
    - Music : 5,060

    - Total: 76,689

    The consolidated vocabulary was used to train FillerNet

    For a detailed description of how the dataset was created, please see our paper.
    Data Split for Machine Learning:
    To facilitate machine learning experiments, the audio data in this dataset (full-length recordings and preprocessed 1-sec clips) are pre-arranged into “train”, “validation”, and “test” folders. This split ensures that episodes from the same podcast show are always in the same subset (train, validation, or test), to prevent speaker leakage. We also ensured that each subset in this split remains gender balanced, same as the complete dataset.

    We strongly recommend using this split in your experiments. It will ensure your results are not inflated due to overfitting, and that they are comparable to the results published in the FillerNet paper

    AUDIO FILES:

    1. Full-length podcast episodes (MP3)
    199 audio files of the full-length podcast episode recordings in mp3 format, stereo channels, 44.1 kHz sample rate and 32 bit depth. Filename format: [show name]_[episode name].mp3.

    2. Pre-processed full-length podcast episodes (WAV)
    199 audio files of the full-length podcast episode recordings in wav format, mono channel, 16 kHz sample rate and 32 bit depth. The files are split into train, validation and test partitions (folders), see Data Split for Machine Learning above. Filename format: [show name]_[episode name].wav

    3. Pre-processed WAV clips
    Pre-processed 1-second audio clips of the annotated events, where each clip is centered on the center of the event. For annotated events longer than 1 second, we truncate them from the center into 1-second. The clips are in the same format as the pre-processed full-length podcast episodes: wav format, mono channel, 16 kHz sample rate and 32 bit depth.

    The clips that have consolidated vocabulary labels (76,689) are split into “train”, “validation” and “test” partitions (folders), see Data Split for Machine Learning above. The remainder of the clips (9,114) are placed in an “extra”

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abacus Data Network (2017). Abstract Meaning Representation (AMR) Annotation Release 2.0 [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/8MN4GE

Abstract Meaning Representation (AMR) Annotation Release 2.0

Explore at:
19 scholarly articles cite this dataset (View in Google Scholar)
iso(157806592), txt(1308)Available download formats
Dataset updated
Jun 15, 2017
Dataset provided by
Abacus Data Network
Time period covered
1997 - 2017
Area covered
France, Taiwan, Province of China, Israel, China, United States
Dataset funded by
National Science Foundation
Defense Advanced Research Projects Agency
Description

Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado’s Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums. AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12). Data The source data includes discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset: Dataset Training Dev Test Totals BOLT DF MT 1061 133 133 1327 Broadcast conversation 214 0 0 214 Weblog and WSJ 0 100 100 200 BOLT DF English 6455 210 229 6894 DEFT DF English 19558 0 0 19558 Guidelines AMRs 819 0 0 819 2009 Open MT 204 0 0 204 Proxy reports 6603 826 823 8252 Weblog 866 0 0 866 Xinhua MT 741 99 86 Totals 36521 1368 1371 39260 For those interested in utilizing a standard/community partition for AMR research (for instance in development of semantic parsers), data in the “split” directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The “unsplit” directory contains the same 39,260 AMRs with no train/dev/test partition.

Search
Clear search
Close search
Google apps
Main menu