MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
This data set comes from data held by the Driver and Vehicle Standards Agency (DVSA).
It is not classed as an âofficial statisticâ. This means itâs not subject to scrutiny and assessment by the UK Statistics Authority.
The MOT test checks that your vehicle meets road safety and environmental standards. Different types of vehicles (for example, cars and motorcycles) fall into different âclassesâ.
This data table shows the number of initial tests. It does not include abandoned tests, aborted tests, or retests.
The initial fail rate is the rate for vehicles as they were brought for the MOT. The final fail rate excludes vehicles that pass the test after rectification of minor defects at the time of the test.
This data table is updated every 3 months.
Ref: DVSA/MOT/01 View online https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1060287/dvsa-mot-01-mot-test-results-by-class-of-vehicle1.csv"> Download CSV 16.1 KB
These tables give data for the following classes of vehicles:
All figures are for vehicles as they were brought in for the MOT.
A failed test usually has multiple failure items.
The percentage of tests is worked out as the number of tests with one or more failure items in the defect as a percentage of total tests.
The percentage of defects is worked out as the total defects in the category as a percentage of total defects for all categories.
The average defects per initial test failure is worked out as the total failure items as a percentage of total tests failed plus tests that passed after rectification of a minor defect at the time of the test.
These data tables are updated every 3 months.
Ref: DVSA/MOT/02 View online https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1060255/dvsa-mot-02-mot-class-1-and-2-vehicles-initial-failures-by-defect-category-.csv"> Download CSV 19.1 KB
http://researchdatafinder.qut.edu.au/display/n15252http://researchdatafinder.qut.edu.au/display/n15252
This file contains the features for the test portion of the movie dataset. The data has been changed into an average word vector. This is 50% of the total movie results. QUT Research Data Respository Dataset Resource available for download
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.
Contents:
This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.
The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.
repositories.csv:
programs.csv:
testing-files.csv:
scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/
AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html
CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup
The script uses the GitHub code search API and inherits its limitations:
More details: https://docs.github.com/en/search-github/searching-on-github/searching-code
The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api
download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
10
"""
_HOMEPAGE = "https://gitee.com/didi233/test_date_gitee"
_LICENSE = "Creative Commons Attribution 4.0 International"
_TRAIN_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/train.csv"
_TEST_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/test.csv"
class test_data_huggingface(datasets.GeneratorBasedBuilder):
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
QASPER: NLP Questions and Evidence Discovering Answers with Expertise By Huggingface Hub [source]
About this dataset QASPER is an incredible collection of over 5,000 questions and answers on a vast range of Natural Language Processing (NLP) papers -- all crowdsourced from experienced NLP practitioners. Each question in the dataset is written based only on the titles and abstracts of the corresponding paper, providing an insight into how the experts understood and parsed various materials. The answers to each query have been expertly enriched by evidence taken directly from the full text of each paper. Moreover, QASPER comes with carefully crafted fields that contain relevant information including âqasâ â questions and answers; âevidenceâ â evidence provided for answering questions; title; abstract; figures_and_tables, and full_text. All this adds up to create a remarkable dataset for researchers looking to gain insights into how practitioners interpret NLP topics while providing effective validation when it comes to finding clear-cut solutions to problems encountered in existing literature
More Datasets For more datasets, click here.
Featured Notebooks đ¨ Your notebook can be here! đ¨! How to use the dataset This guide will provide instructions on how to use the QASPER dataset of Natural Language Processing (NLP) questions and evidence. The QASPER dataset contains 5,049 questions over 1,585 papers that has been crowdsourced by NLP practitioners. To get the most out of this dataset we will show you how to access the questions and evidence, as well as provide tips for getting started.
Step 1: Accessing the Dataset To access the data you can download it from Kaggle's website or through a code version control system like Github. Once downloaded, you will find five files in .csv format; two test data sets (test.csv and validation.csv), two train data sets (train-v2-0_lessons_only_.csv and trainv2-0_unsplit.csv) as well as one figure data set (figures_and_tables_.json). Each .csv file contains different datasets with columns representing titles, abstracts, full texts and Q&A fields with evidence for each paper mentioned in each row of each file respectively
**Step 2: Analyzing Your Data Sets ** Now would be a good time to explore your datasets using basic descriptive statistics or more advanced predictive analytics such as logistic regression or naive bayes models depending on what kind of analysis you would like to undertake with this dataset You can start simple by summarizing some basic crosstabs between any two variables comprise your dataset; titles abstracts etc.). As an example try correlating title lengths with certain number of words in their corresponding abstracts then check if there is anything worth investigating further
**Step 3: Define Your Research Questions & Perform Further Analysis ** Once satisfied with your initial exploration it is time to dig deeper into the underlying QR relationship among different variables comprising your main documents One way would be using text mining technologies such as topic modeling machine learning techniques or even automated processes that may help summarize any underlying patterns Yet another approach could involve filtering terms that are relevant per specific research hypothesis then process such terms via web crawlers search engines document similarity algorithms etc
Finally once all relevant parameters are defined analyzed performed searched it would make sense to draw preliminary connsusison linking them back together before conducting replicable tests ensuring reproducible results
Research Ideas Developing AI models to automatically generate questions and answers from paper titles and abstracts. Enhancing machine learning algorithms by combining the answers with the evidence provided in the dataset to find relationships between papers. Creating online forums for NLP practitioners that uses questions from this dataset to spark discussion within the community
CC0
Original Data Source: QASPER: NLP Questions and Evidence
http://researchdatafinder.qut.edu.au/display/n15252http://researchdatafinder.qut.edu.au/display/n15252
This file contains the labels for the test portion of the movie dataset. This is 50% of the total movie results. QUT Research Data Respository Dataset Resource available for download
The test set from "Titanic - Machine Learning from Disaster" doesn't include labels.
This is an augmented version of the test set with the correct labels, retrieved from the original Titanic dataset at: https://www.openml.org/search?type=data&sort=runs&id=40945&status=active
The accuracy of the labels was validated by getting a 1.0 score in the competition with them.
This dataset is provided for educational purposes, and is not intended to help people cheat in the competition. If the only reason you want to download this is so you can get a shiny 1.0 on the leaderboards, don't do it.
"""
_HOMEPAGE = "https://github.com/freeziyou/live_stream_dataset"
_LICENSE = "Creative Commons Attribution 4.0 International"
_TRAIN_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/train.csv"
_TEST_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/test.csv"
class live_stream_dataset_huggingface(datasets.GeneratorBasedBuilder):
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Acknowledgement
These data are a product of a research activity conducted in the context of the RAILS (Roadmaps for AI integration in the raiL Sector) project which has received funding from the Shift2Rail Joint Undertaking under the European Unionâs Horizon 2020 research and innovation programme under grant agreement n. 881782 Rails. The JU receives support from the European Unionâs Horizon 2020 research and innovation program and the Shift2Rail JU members other than the Union.
Disclaimers
The information and views set out in this document are those of the author(s) and do not necessarily reflect the official opinion of Shift2Rail Joint Undertaking. The JU does not guarantee the accuracy of the data included in this document. Neither the JU nor any person acting on the JUâs behalf may be held responsible for the use which may be made of the information contained therein.
This "dataset" has been created for scientific purposes only - and WITHOUT ANY COMMERCIAL purposes - to study the potentials of Deep Learning and Transfer Learning approaches. We are NOT re-distributing any video or audio; our files just contain pointers and indications needed to reproduce our study. The authors DO NOT ASSUME any responsibility for the use that other researchers or users will make of these data.
General Info
The CSV files contained in this folder (and subfolders) compose the Level Crossing (LC) Warning Bell (WB) Dataset.
When using any of these data, please mention:
De Donato, L., Marrone, S., Flammini, F., Sansone, C., Vittorini, V., Nardone, R., Mazzariello, C., and Bernaudine, F., "Intelligent Detection of Warning Bells at Level Crossings through Deep Transfer Learning for Smarter Railway Maintenance", Engineering Applications of Artificial Intelligence, Elsevier, 2023
Content of the folder
This folder contains the following subfolders and files.
"Data Files" contains all the CSV files related to the data composing the LCWB Dataset:
"LCWB Dataset" contains all the JSON files that show how the aforementioned data have been distributed among training, validation, and test sets:
"Additional Files" contains some CSV files related to data we adopted to further test the deep neural network leveraged in the aforementioned manuscript:
CSV Files Structure
Each "XX_labels.csv" file contains, for each entry, the following information:
Worth mentioning, sub-classes do not have a specific purpose in our task. They have been kept to maintain as much as possible the structure of the "class_labels_indices.csv" file provided by AudioSet. The same applies to the "XX_data.csv" files, which have roughly the same structures of "Evaluation", "Balanced train", and "Unbalanced train" AudioSet CSV files.
Indeed, each "XX_data.csv" file contains, for each entry, the following information:
Credits
The structure of the CSV files contained in this dataset, as well as part of their content, was inspired by the CSV files composing the AudioSet dataset which is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, while its ontology is available under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
Particularly, from AudioSet, we retrieved:
Pointers contained in "XX_data.csv" files other than GE_data.csv have been retrieved manually from scratch. Then, the related "XX_labels.csv" files have been created consequently.
More about downloading the AudioSet dataset can be found here.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GFM-Attack
How to download this dataset
git lfs install
git clone https://huggingface.co/datasets/magicslabnu/GFM-Attack
How to use data
Each directory corresponds to a dataset and contains the standard files: train.csv, dev.csv, and test.csv. You may select any of these dataset folders to perform adversarial training. For example, to use the tf1 dataset for adversarial training, utilize the train.csv file located within the tf1 folder.
Paper⌠See the full description on the dataset page: https://huggingface.co/datasets/magicslabnu/GenoAdv.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU
ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.
*lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)
Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.
This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.
** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).
** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract
Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).
**VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)
v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record
** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.
** Other uses of Speedtest Open Data; - see link at Speedtest below.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Checkmynet.lu is a measurement tool to test speed and quality of internet connections. It was published by ILR (Luxembourg Institute of Regulation, www.ilr.lu). The published data, accessible via API or simple *.csv download, contains the test results such as download speed, upload speed, operator, equipment used, GPS coordinates, a.s.o. Checkmynet.lu is independent, crowd-sourced, open-source and open-data based solution: ⢠Designed to measure availability, quality and neutrality of the internet ⢠Generates and processes all results objectively, securely and transparently ⢠Tests 150+ parameters: speed, Quality of Service & Quality of Experience ⢠Runs on Android, iOS, web browsers ⢠Displays results on a map with several filter options
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
RAVDESS
This is an audio classification dataset for Emotion Recognition. Classes = 8 , Split = Train-Test
Structure
audios folder contains audio files. train.csv for training split and test.csv for the testing split.
Download
import os import huggingface_hub audio_datasets_path = "DATASET_PATH/Audio-Datasets" if not os.path.exists(audio_datasets_path): print(f"Given {audio_datasets_path=} does not exist. Specify a valid path ending with⌠See the full description on the dataset page: https://huggingface.co/datasets/MahiA/RAVDESS.
FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.
Data curators
Eduardo Fonseca and Mercedes Collado
Contact
You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.
Citation
If you use this dataset or part of it, please cite the following ICASSP 2019 paper:
Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, âLearning Sound Event Classifiers from Web Audio with Noisy Labelsâ, arXiv preprint arXiv:1901.01189, 2019
You can also consider citing our ISMIR 2017 paper that describes the Freesound Annotator, which was used to gather the manual annotations included in FSDnoisy18k:
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, âFreesound Datasets: A Platform for the Creation of Open Audio Datasetsâ, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017
FSDnoisy18k description
What follows is a summary of the most basic aspects of FSDnoisy18k. For a complete description of FSDnoisy18k, make sure to check:
FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.
The source of audio content is Freesoundâa sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. The 20 classes are: "Acoustic guitar", "Bass guitar", "Clapping", "Coin (dropping)", "Crash cymbal", "Dishes, pots, and pans", "Engine", "Fart", "Fire", "Fireworks", "Glass", "Hi-hat", "Piano", "Rain", "Slam", "Squeak", "Tearing", "Walk, footsteps", "Wind", and "Writing". FSDnoisy18k was created with the Freesound Annotator, which is a platform for the collaborative creation of open audio datasets.
We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).
The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., âFireâ, âGlassâ, âWindâ or âWalk, footstepsâ.
The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.
Code
We've released the code for our ICASSP 2019 paper at https://github.com/edufonseca/icassp19. The framework comprises all the basic stages: feature extraction, training, inference and evaluation. After loading the FSDnoisy18k dataset, log-mel energies are computed and a CNN baseline is trained and evaluated. The code also allows to test four noise-robust loss functions. Please check our paper for more details.
Label noise characteristics
FSDnoisy18k features real label noise that is representative of audio data retrieved from the web, particularly from Freesound. The analysis of a per-class, random, 15% of the noisy portion of FSDnoisy18k revealed that roughly 40% of the analyzed labels are correct and complete, whereas 60% of the labels show some type of label noise. Please check the FSDnoisy18k companion site for a detailed characterization of the label noise in the dataset, including a taxonomy of label noise for singly-labeled data as well as a per-class description of the label noise.
FSDnoisy18k basic characteristics
The dataset most relevant characteristics are as follows:
License
FSDnoisy18k has licenses at two different levels, as explained next. All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. In particular, all Freesound clips included in FSDnoisy18k are released under either CC-BY or CC0. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of audio clips and their corresponding license in the LICENSE-INDIVIDUAL-CLIPS
file downloaded with the dataset.
In addition, FSDnoisy18k as a whole is the result of a curation process and it has an additional license. FSDnoisy18k is released under CC-BY. This license is specified in the LICENSE-DATASET
file downloaded with the dataset.
Files
FSDnoisy18k can be downloaded as a series of zip files with the following directory structure:
root â ââââFSDnoisy18k.audio_train/ Audio clips in the train set â ââââFSDnoisy18k.audio_test/ Audio clips in the test set â ââââFSDnoisy18k.meta/ Files for evaluation setup â â â ââââtrain.csv Data split and ground truth for the train set â â â ââââtest.csv Ground truth for the test set â ââââFSDnoisy18k.doc/ â ââââREADME.md The dataset description file that you are reading â ââââLICENSE-DATASET License of the FSDnoisy18k dataset as an entity â ââââLICENSE-INDIVIDUAL-CLIPS.csv Licenses of the individual audio clips from Freesound
Each row (i.e. audio clip) of the train.csv
file contains the following
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The train set and test set of NSL-KDD
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Details: The INCLUDE dataset has 4292 videos (the paper mentions 4287 videos but 5 videos were added later). The videos used for training are mentioned in train.csv (3475), while that used for testing is mentioned in test.csv (817 files). Each video is a recording of 1 ISL sign, signed by deaf students from St. Louis School for the Deaf, Adyar, Chennai.
INCLUDE50 has 766 train videos and 192 test videos.
Train-Test Split: Please download the train-test split for INCLUDE and INCLUDE50 from here: Train-Test Split
Publication Link: https://dl.acm.org/doi/10.1145/3394171.3413528
AI4Bharat website: https://sign-language.ai4bharat.org/
Download Instructions
For ease of access, we have prepared a Shell Script to download all the parts of the dataset and extract them to form the complete INCLUDE dataset.
You can find the script here: http://bit.ly/include_dl
Paper Abstract: Indian Sign Language (ISL) is a complete language with its own grammar, syntax, vocabulary and several unique linguistic attributes. It is used by over 5 million deaf people in India. Currently, there is no publicly available dataset on ISL to evaluate Sign Language Recognition (SLR) approaches. In this work, we present the Indian Lexicon Sign Language Dataset - INCLUDE - an ISL dataset that contains 0.27 million frames across 4,287 videos over 263 word signs from 15 different word categories. INCLUDE is recorded with the help of experienced signers to provide close resemblance to natural conditions. A subset of 50 word signs is chosen across word categories to define INCLUDE-50 for rapid evaluation of SLR methods with hyperparameter tuning. The best performing model achieves an accuracy of 94.5% on the INCLUDE-50 dataset and 85.6% on the INCLUDE dataset
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.