39 datasets found
  1. Datasets

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Bastian Eichenberger; YinXiu Zhan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.

  2. CSV file used in statistical analyses

    • data.csiro.au
    • researchdata.edu.au
    • +1more
    Updated Oct 13, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
    Explore at:
    Dataset updated
    Oct 13, 2014
    Dataset authored and provided by
    CSIROhttp://www.csiro.au/
    License

    https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

    Time period covered
    Mar 14, 2008 - Jun 9, 2009
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

  3. MOT testing data for Great Britain

    • s3.amazonaws.com
    • gov.uk
    Updated Mar 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Driver and Vehicle Standards Agency (2022). MOT testing data for Great Britain [Dataset]. https://s3.amazonaws.com/thegovernmentsays-files/content/179/1797262.html
    Explore at:
    Dataset updated
    Mar 24, 2022
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Driver and Vehicle Standards Agency
    Area covered
    United Kingdom, Great Britain
    Description

    About this data set

    This data set comes from data held by the Driver and Vehicle Standards Agency (DVSA).

    It is not classed as an ‘official statistic’. This means it’s not subject to scrutiny and assessment by the UK Statistics Authority.

    MOT test results by class

    The MOT test checks that your vehicle meets road safety and environmental standards. Different types of vehicles (for example, cars and motorcycles) fall into different ‘classes’.

    This data table shows the number of initial tests. It does not include abandoned tests, aborted tests, or retests.

    The initial fail rate is the rate for vehicles as they were brought for the MOT. The final fail rate excludes vehicles that pass the test after rectification of minor defects at the time of the test.

    This data table is updated every 3 months.

    https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

    Initial failures by defect category

    These tables give data for the following classes of vehicles:

    • class 1 and 2 vehicles - motorcycles
    • class 3 and 4 vehicles - cars and light vans up to 3,000kg
    • class 5 vehicles - private passenger vehicles with more than 12 seats
    • class 7 vehicles - goods vehicles between 3,000kg and 3,500kg gross vehicle weight

    All figures are for vehicles as they were brought in for the MOT.

    A failed test usually has multiple failure items.

    The percentage of tests is worked out as the number of tests with one or more failure items in the defect as a percentage of total tests.

    The percentage of defects is worked out as the total defects in the category as a percentage of total defects for all categories.

    The average defects per initial test failure is worked out as the total failure items as a percentage of total tests failed plus tests that passed after rectification of a minor defect at the time of the test.

    These data tables are updated every 3 months.

    https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

    https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

    MOT class 3 and 4 vehicles: initial failures by defect category</h3

  4. q

    Movie Data - X - Test - w2v

    • data.researchdatafinder.qut.edu.au
    Updated Apr 8, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Movie Data - X - Test - w2v [Dataset]. https://data.researchdatafinder.qut.edu.au/dataset/survey-word-vector/resource/e638fc06-7ef3-4a41-85e2-21f7fad2dfb3
    Explore at:
    Dataset updated
    Apr 8, 2018
    License

    http://researchdatafinder.qut.edu.au/display/n15252http://researchdatafinder.qut.edu.au/display/n15252

    Description

    This file contains the features for the test portion of the movie dataset. The data has been changed into an average word vector. This is 50% of the total movie results. QUT Research Data Respository Dataset Resource available for download

  5. PIPr: A Dataset of Public Infrastructure as Code Programs

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.

    Contents:

    • metadata.zip: The dataset metadata and analysis results as CSV files.
    • scripts-and-logs.zip: Scripts and logs of the dataset creation.
    • LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
    • README.md: This document.
    • redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.

    This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.

    Metadata

    The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.

    repositories.csv:

    • ID (integer): GitHub repository ID
    • url (string): GitHub repository URL
    • downloaded (boolean): Whether cloning the repository succeeded
    • name (string): Repository name
    • description (string): Repository description
    • licenses (string, list of strings): Repository licenses
    • redistributable (boolean): Whether the repository's licenses permit redistribution
    • created (string, date & time): Time of the repository's creation
    • updated (string, date & time): Time of the last update to the repository
    • pushed (string, date & time): Time of the last push to the repository
    • fork (boolean): Whether the repository is a fork
    • forks (integer): Number of forks
    • archive (boolean): Whether the repository is archived
    • programs (string, list of strings): Project file path of each IaC program in the repository

    programs.csv:

    • ID (string): Project file path of the IaC program
    • repository (integer): GitHub repository ID of the repository containing the IaC program
    • directory (string): Path of the directory containing the IaC program's project file
    • solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
    • language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
    • name (string): IaC program name
    • description (string): IaC program description
    • runtime (string): Runtime string of the IaC program
    • testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • tests (string, list of strings): File paths of IaC program's tests

    testing-files.csv:

    • file (string): Testing file path
    • language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
    • techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
    • program (string): Project file path of the testing file's IaC program

    Dataset Creation

    scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

    1. A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
    2. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
    3. Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
    4. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.

    Searching Repositories

    The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

    1. Github access token.
    2. Name of the CSV output file.
    3. Filename to search for.
    4. File extensions to search for, separated by commas.
    5. Min file size for the search (for all files: 0).
    6. Max file size for the search or * for unlimited (for all files: *).

    Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/

    AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html

    CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup

    Limitations

    The script uses the GitHub code search API and inherits its limitations:

    • Only forks with more stars than the parent repository are included.
    • Only the repositories' default branches are considered.
    • Only files smaller than 384 KB are searchable.
    • Only repositories with fewer than 500,000 files are considered.
    • Only repositories that have had activity or have been returned in search results in the last year are considered.

    More details: https://docs.github.com/en/search-github/searching-on-github/searching-code

    The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api

    Downloading Repositories

    download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

    1. Name of the repositories CSV files generated through search-repositories.py, separated by commas.
    2. Output directory to download the repositories to.
    3. Name of the CSV output file.

    The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.

  6. i

    Sample Dataset for Testing

    • ieee-dataport.org
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Outman (2025). Sample Dataset for Testing [Dataset]. https://ieee-dataport.org/documents/sample-dataset-testing
    Explore at:
    Dataset updated
    Apr 28, 2025
    Authors
    Alex Outman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    10

  7. h

    test_data_huggingface

    • huggingface.co
    Updated Jun 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dylan guo (2023). test_data_huggingface [Dataset]. https://huggingface.co/datasets/dd123/test_data_huggingface
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 18, 2023
    Authors
    dylan guo
    Description

    """

    _HOMEPAGE = "https://gitee.com/didi233/test_date_gitee"

    _LICENSE = "Creative Commons Attribution 4.0 International"

    _TRAIN_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/train.csv"

    _TRAIN_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/train.csv"

    _TEST_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/test.csv"

    _TEST_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/test.csv"

    class test_data_huggingface(datasets.GeneratorBasedBuilder):

  8. f

    Dataset

    • figshare.com
    application/x-gzip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Moynuddin Ahmed Shibly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.

  9. o

    QASPER: NLP Questions and Evidence

    • opendatabay.com
    .undefined
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). QASPER: NLP Questions and Evidence [Dataset]. https://www.opendatabay.com/data/ai-ml/c030902d-7b02-48a2-b32f-8f7140dd1de7
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 22, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    QASPER: NLP Questions and Evidence Discovering Answers with Expertise By Huggingface Hub [source]

    About this dataset QASPER is an incredible collection of over 5,000 questions and answers on a vast range of Natural Language Processing (NLP) papers -- all crowdsourced from experienced NLP practitioners. Each question in the dataset is written based only on the titles and abstracts of the corresponding paper, providing an insight into how the experts understood and parsed various materials. The answers to each query have been expertly enriched by evidence taken directly from the full text of each paper. Moreover, QASPER comes with carefully crafted fields that contain relevant information including ‘qas’ – questions and answers; ‘evidence’ – evidence provided for answering questions; title; abstract; figures_and_tables, and full_text. All this adds up to create a remarkable dataset for researchers looking to gain insights into how practitioners interpret NLP topics while providing effective validation when it comes to finding clear-cut solutions to problems encountered in existing literature

    More Datasets For more datasets, click here.

    Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This guide will provide instructions on how to use the QASPER dataset of Natural Language Processing (NLP) questions and evidence. The QASPER dataset contains 5,049 questions over 1,585 papers that has been crowdsourced by NLP practitioners. To get the most out of this dataset we will show you how to access the questions and evidence, as well as provide tips for getting started.

    Step 1: Accessing the Dataset To access the data you can download it from Kaggle's website or through a code version control system like Github. Once downloaded, you will find five files in .csv format; two test data sets (test.csv and validation.csv), two train data sets (train-v2-0_lessons_only_.csv and trainv2-0_unsplit.csv) as well as one figure data set (figures_and_tables_.json). Each .csv file contains different datasets with columns representing titles, abstracts, full texts and Q&A fields with evidence for each paper mentioned in each row of each file respectively

    **Step 2: Analyzing Your Data Sets ** Now would be a good time to explore your datasets using basic descriptive statistics or more advanced predictive analytics such as logistic regression or naive bayes models depending on what kind of analysis you would like to undertake with this dataset You can start simple by summarizing some basic crosstabs between any two variables comprise your dataset; titles abstracts etc.). As an example try correlating title lengths with certain number of words in their corresponding abstracts then check if there is anything worth investigating further

    **Step 3: Define Your Research Questions & Perform Further Analysis ** Once satisfied with your initial exploration it is time to dig deeper into the underlying QR relationship among different variables comprising your main documents One way would be using text mining technologies such as topic modeling machine learning techniques or even automated processes that may help summarize any underlying patterns Yet another approach could involve filtering terms that are relevant per specific research hypothesis then process such terms via web crawlers search engines document similarity algorithms etc

    Finally once all relevant parameters are defined analyzed performed searched it would make sense to draw preliminary connsusison linking them back together before conducting replicable tests ensuring reproducible results

    Research Ideas Developing AI models to automatically generate questions and answers from paper titles and abstracts. Enhancing machine learning algorithms by combining the answers with the evidence provided in the dataset to find relationships between papers. Creating online forums for NLP practitioners that uses questions from this dataset to spark discussion within the community

    License

    CC0

    Original Data Source: QASPER: NLP Questions and Evidence

  10. q

    Movie Data - Y - Test

    • data.researchdatafinder.qut.edu.au
    Updated Apr 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Movie Data - Y - Test [Dataset]. https://data.researchdatafinder.qut.edu.au/am/dataset/survey-word-vector/resource/bd278820-eb30-4fad-af8a-46c956261fa0
    Explore at:
    Dataset updated
    Apr 16, 2024
    License

    http://researchdatafinder.qut.edu.au/display/n15252http://researchdatafinder.qut.edu.au/display/n15252

    Description

    This file contains the labels for the test portion of the movie dataset. This is 50% of the total movie results. QUT Research Data Respository Dataset Resource available for download

  11. Titanic - Labelled Test Set

    • kaggle.com
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wesley Howe (2023). Titanic - Labelled Test Set [Dataset]. https://www.kaggle.com/datasets/wesleyhowe/titanic-labelled-test-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wesley Howe
    Description

    The test set from "Titanic - Machine Learning from Disaster" doesn't include labels.

    This is an augmented version of the test set with the correct labels, retrieved from the original Titanic dataset at: https://www.openml.org/search?type=data&sort=runs&id=40945&status=active

    The accuracy of the labels was validated by getting a 1.0 score in the competition with them.

    This dataset is provided for educational purposes, and is not intended to help people cheat in the competition. If the only reason you want to download this is so you can get a shiny 1.0 on the leaderboards, don't do it.

  12. h

    live_stream_dataset_huggingface

    • huggingface.co
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dylan guo (2023). live_stream_dataset_huggingface [Dataset]. https://huggingface.co/datasets/dd123/live_stream_dataset_huggingface
    Explore at:
    Dataset updated
    Jul 26, 2023
    Authors
    dylan guo
    Description

    """

    _HOMEPAGE = "https://github.com/freeziyou/live_stream_dataset"

    _LICENSE = "Creative Commons Attribution 4.0 International"

    _TRAIN_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/train.csv"

    _TRAIN_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/train.csv"

    _TEST_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/test.csv"

    _TEST_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/test.csv"

    class live_stream_dataset_huggingface(datasets.GeneratorBasedBuilder):

  13. Level Crossing Warning Bell (LCWB) Dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated May 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenzo De Donato; Lorenzo De Donato; Valeria Vittorini; Valeria Vittorini; Francesco Flammini; Francesco Flammini; Stefano Marrone; Stefano Marrone (2023). Level Crossing Warning Bell (LCWB) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7945412
    Explore at:
    Dataset updated
    May 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorenzo De Donato; Lorenzo De Donato; Valeria Vittorini; Valeria Vittorini; Francesco Flammini; Francesco Flammini; Stefano Marrone; Stefano Marrone
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Acknowledgement
    These data are a product of a research activity conducted in the context of the RAILS (Roadmaps for AI integration in the raiL Sector) project which has received funding from the Shift2Rail Joint Undertaking under the European Union’s Horizon 2020 research and innovation programme under grant agreement n. 881782 Rails. The JU receives support from the European Union’s Horizon 2020 research and innovation program and the Shift2Rail JU members other than the Union.

    Disclaimers
    The information and views set out in this document are those of the author(s) and do not necessarily reflect the official opinion of Shift2Rail Joint Undertaking. The JU does not guarantee the accuracy of the data included in this document. Neither the JU nor any person acting on the JU’s behalf may be held responsible for the use which may be made of the information contained therein.

    This "dataset" has been created for scientific purposes only - and WITHOUT ANY COMMERCIAL purposes - to study the potentials of Deep Learning and Transfer Learning approaches. We are NOT re-distributing any video or audio; our files just contain pointers and indications needed to reproduce our study. The authors DO NOT ASSUME any responsibility for the use that other researchers or users will make of these data.

    General Info
    The CSV files contained in this folder (and subfolders) compose the Level Crossing (LC) Warning Bell (WB) Dataset.

    When using any of these data, please mention:

    De Donato, L., Marrone, S., Flammini, F., Sansone, C., Vittorini, V., Nardone, R., Mazzariello, C., and Bernaudine, F., "Intelligent Detection of Warning Bells at Level Crossings through Deep Transfer Learning for Smarter Railway Maintenance", Engineering Applications of Artificial Intelligence, Elsevier, 2023

    Content of the folder
    This folder contains the following subfolders and files.

    "Data Files" contains all the CSV files related to the data composing the LCWB Dataset:

    • WB_data.csv (WB_labels.csv): representing data of the "Warning Bell (WB)" class;
    • NA_data.csv (NA_labels.csv): representing data of the "No Alarm (NA)" class;
    • GE_data.csv (GE_labels.csv): representing data of the "GEneric alarm (GE)" class.

    "LCWB Dataset" contains all the JSON files that show how the aforementioned data have been distributed among training, validation, and test sets:

    • IT_Distribution.json and UK_distribution.json respectively show how Italian (IT) WBs and British (UK) WBs have been distributed;
    • The same goes for NA_Distribution.json and GE_Distribution.json, which show the distribution of NA and GE data respectively;
    • DatasetDistribution.json simply incorporates the content of the aforementioned JSON files in a unique file that can be exploited to obtain exactly the same dataset we adopted in our analyses.

    "Additional Files" contains some CSV files related to data we adopted to further test the deep neural network leveraged in the aforementioned manuscript:

    • FR_DE_data.csv (FR_DE_labels.csv): representing data that have been used to test the generalisation performances of the network we exploited on LC WBs related to countries that were not considered in the training phase.
    • Noises_data.csv (Noises_labels.csv): representing the noises that were considered to study the behaviour of the network in case of noisy data.

    CSV Files Structure
    Each "XX_labels.csv" file contains, for each entry, the following information:

    • The identifier ("index") of the sub-class (which is not relevant in our case);
    • The code-name ("mid") of the class, which is used in the "XX_data.csv" file to indicate the sub-class of a specific audio;
    • The extended name of the class ("display_name").

    Worth mentioning, sub-classes do not have a specific purpose in our task. They have been kept to maintain as much as possible the structure of the "class_labels_indices.csv" file provided by AudioSet. The same applies to the "XX_data.csv" files, which have roughly the same structures of "Evaluation", "Balanced train", and "Unbalanced train" AudioSet CSV files.

    Indeed, each "XX_data.csv" file contains, for each entry, the following information:

    • ID: the identifier of the entry;
    • YTID: the YouTube identifier of the video;
    • start_seconds and end_seconds: which delimit the portion of audio (extracted from YTID) which is of interest for this task;
    • positive_labels: the label(s) associated with the audio.


    Credits
    The structure of the CSV files contained in this dataset, as well as part of their content, was inspired by the CSV files composing the AudioSet dataset which is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, while its ontology is available under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

    Particularly, from AudioSet, we retrieved:

    • The structure of the CSV files as discussed above.
    • Data contained in GE_data.csv (which is a minimal portion of data made available by AudioSet) as well as the related 19 classes (in GE_labels.csv) which we selected among the hundreds of classes included in the AudioSet ontology.

    Pointers contained in "XX_data.csv" files other than GE_data.csv have been retrieved manually from scratch. Then, the related "XX_labels.csv" files have been created consequently.

    More about downloading the AudioSet dataset can be found here.

  14. h

    GenoAdv

    • huggingface.co
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MagicsLab (2025). GenoAdv [Dataset]. https://huggingface.co/datasets/magicslabnu/GenoAdv
    Explore at:
    Dataset updated
    May 4, 2025
    Dataset authored and provided by
    MagicsLab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GFM-Attack

      How to download this dataset
    

    git lfs install

    git clone https://huggingface.co/datasets/magicslabnu/GFM-Attack

      How to use data
    

    Each directory corresponds to a dataset and contains the standard files: train.csv, dev.csv, and test.csv. You may select any of these dataset folders to perform adversarial training. For example, to use the tf1 dataset for adversarial training, utilize the train.csv file located within the tf1 folder.

      Paper… See the full description on the dataset page: https://huggingface.co/datasets/magicslabnu/GenoAdv.
    
  15. Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ferrers; Speedtest Global Index (2023). Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.13621169.v24
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Richard Ferrers; Speedtest Global Index
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU

    ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.

    1. LAX file named 0320, when should be Q320. Amended in v8.

    *lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)

    Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.

    This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.

    ** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).

    ** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract

    Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).

    **VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)

    Melb 14784 lines Avg download speed 69.4M Tests 0.39M

    SHG 31207 lines Avg 233.7M Tests 0.56M

    ALC 113 lines Avg 51.5M Test 1092

    BKK 29684 lines Avg 215.9M Tests 1.2M

    LAX 15505 lines Avg 218.5M Tests 0.74M

    v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record

    ** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.

    ** Other uses of Speedtest Open Data; - see link at Speedtest below.

  16. p

    Data from checkmynet.lu, ILR's internet access measuring tool

    • data.public.lu
    zip
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut Luxembourgeois de RĂŠgulation (ILR) (2025). Data from checkmynet.lu, ILR's internet access measuring tool [Dataset]. https://data.public.lu/en/datasets/data-from-checkmynet-lu-ilrs-internet-access-measuring-tool/
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset authored and provided by
    Institut Luxembourgeois de RĂŠgulation (ILR)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Checkmynet.lu is a measurement tool to test speed and quality of internet connections. It was published by ILR (Luxembourg Institute of Regulation, www.ilr.lu). The published data, accessible via API or simple *.csv download, contains the test results such as download speed, upload speed, operator, equipment used, GPS coordinates, a.s.o. Checkmynet.lu is independent, crowd-sourced, open-source and open-data based solution: • Designed to measure availability, quality and neutrality of the internet • Generates and processes all results objectively, securely and transparently • Tests 150+ parameters: speed, Quality of Service & Quality of Experience • Runs on Android, iOS, web browsers • Displays results on a map with several filter options

  17. h

    RAVDESS

    • huggingface.co
    Updated Oct 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maha Tufail Agro (2024). RAVDESS [Dataset]. https://huggingface.co/datasets/MahiA/RAVDESS
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2024
    Authors
    Maha Tufail Agro
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    RAVDESS

    This is an audio classification dataset for Emotion Recognition. Classes = 8 , Split = Train-Test

      Structure
    

    audios folder contains audio files. train.csv for training split and test.csv for the testing split.

      Download
    

    import os import huggingface_hub audio_datasets_path = "DATASET_PATH/Audio-Datasets" if not os.path.exists(audio_datasets_path): print(f"Given {audio_datasets_path=} does not exist. Specify a valid path ending with… See the full description on the dataset page: https://huggingface.co/datasets/MahiA/RAVDESS.

  18. FSDnoisy18k

    • zenodo.org
    • paperswithcode.com
    • +3more
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca; Eduardo Fonseca; Mercedes Collado; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Frederic Font; Frederic Font; Xavier Favory; Xavier Serra; Xavier Serra; Mercedes Collado; Manoj Plakal; Xavier Favory (2020). FSDnoisy18k [Dataset]. http://doi.org/10.5281/zenodo.2529934
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Fonseca; Eduardo Fonseca; Mercedes Collado; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Frederic Font; Frederic Font; Xavier Favory; Xavier Serra; Xavier Serra; Mercedes Collado; Manoj Plakal; Xavier Favory
    Description

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

    Data curators

    Eduardo Fonseca and Mercedes Collado

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    Citation

    If you use this dataset or part of it, please cite the following ICASSP 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

    You can also consider citing our ISMIR 2017 paper that describes the Freesound Annotator, which was used to gather the manual annotations included in FSDnoisy18k:

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound Datasets: A Platform for the Creation of Open Audio Datasets”, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    FSDnoisy18k description

    What follows is a summary of the most basic aspects of FSDnoisy18k. For a complete description of FSDnoisy18k, make sure to check:

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

    The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. The 20 classes are: "Acoustic guitar", "Bass guitar", "Clapping", "Coin (dropping)", "Crash cymbal", "Dishes, pots, and pans", "Engine", "Fart", "Fire", "Fireworks", "Glass", "Hi-hat", "Piano", "Rain", "Slam", "Squeak", "Tearing", "Walk, footsteps", "Wind", and "Writing". FSDnoisy18k was created with the Freesound Annotator, which is a platform for the collaborative creation of open audio datasets.

    We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).

    The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.

    The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.

    Code

    We've released the code for our ICASSP 2019 paper at https://github.com/edufonseca/icassp19. The framework comprises all the basic stages: feature extraction, training, inference and evaluation. After loading the FSDnoisy18k dataset, log-mel energies are computed and a CNN baseline is trained and evaluated. The code also allows to test four noise-robust loss functions. Please check our paper for more details.

    Label noise characteristics

    FSDnoisy18k features real label noise that is representative of audio data retrieved from the web, particularly from Freesound. The analysis of a per-class, random, 15% of the noisy portion of FSDnoisy18k revealed that roughly 40% of the analyzed labels are correct and complete, whereas 60% of the labels show some type of label noise. Please check the FSDnoisy18k companion site for a detailed characterization of the label noise in the dataset, including a taxonomy of label noise for singly-labeled data as well as a per-class description of the label noise.

    FSDnoisy18k basic characteristics

    The dataset most relevant characteristics are as follows:

    • FSDnoisy18k contains 18,532 audio clips (42.5h) unequally distributed in the 20 aforementioned classes drawn from the AudioSet Ontology.
    • The audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.
    • The audio clips are of variable length ranging from 300ms to 30s, and each clip has a single ground truth label (singly-labeled data).
    • The dataset is split into a test set and a train set. The test set is drawn entirely from the clean portion, while the remainder of data forms the train set.
    • The train set is composed of 17,585 clips (41.1h) unequally distributed among the 20 classes. It features a clean subset and a noisy subset. In terms of number of clips their proportion is 10%/90%, whereas in terms of duration the proportion is slightly more extreme (6%/94%). The per-class percentage of clean data within the train set is also imbalanced, ranging from 6.1% to 22.4%. The number of audio clips per class ranges from 51 to 170, and from 250 to 1000 in the clean and noisy subsets, respectively. Further, a noisy small subset is defined, which includes an amount of (noisy) data comparable (in terms of duration) to that of the clean subset.
    • The test set is composed of 947 clips (1.4h) that belong to the clean portion of the data. Its class distribution is similar to that of the clean subset of the train set. The number of per-class audio clips in the test set ranges from 30 to 72. The test set enables a multi-class classification problem.
    • FSDnoisy18k is an expandable dataset that features a per-class varying degree of types and amount of label noise. The dataset allows investigation of label noise as well as other approaches, from semi-supervised learning, e.g., self-training to learning with minimal supervision.

    License

    FSDnoisy18k has licenses at two different levels, as explained next. All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. In particular, all Freesound clips included in FSDnoisy18k are released under either CC-BY or CC0. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of audio clips and their corresponding license in the LICENSE-INDIVIDUAL-CLIPS file downloaded with the dataset.

    In addition, FSDnoisy18k as a whole is the result of a curation process and it has an additional license. FSDnoisy18k is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the dataset.

    Files

    FSDnoisy18k can be downloaded as a series of zip files with the following directory structure:

    root
    │ 
    └───FSDnoisy18k.audio_train/     Audio clips in the train set
    │  
    └───FSDnoisy18k.audio_test/      Audio clips in the test set
    │  
    └───FSDnoisy18k.meta/         Files for evaluation setup
    │  │      
    │  └───train.csv           Data split and ground truth for the train set
    │  │      
    │  └───test.csv           Ground truth for the test set     
    │  
    └───FSDnoisy18k.doc/
      │      
      └───README.md           The dataset description file that you are reading
      │      
      └───LICENSE-DATASET        License of the FSDnoisy18k dataset as an entity  
      │      
      └───LICENSE-INDIVIDUAL-CLIPS.csv Licenses of the individual audio clips from Freesound 
    

    Each row (i.e. audio clip) of the train.csv file contains the following

  19. i

    NSL-KDD

    • ieee-dataport.org
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RUIZHE ZHAO (2022). NSL-KDD [Dataset]. https://ieee-dataport.org/documents/nsl-kdd-0
    Explore at:
    Dataset updated
    Feb 2, 2022
    Authors
    RUIZHE ZHAO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The train set and test set of NSL-KDD

  20. Z

    Data from: INCLUDE: A Large Scale Dataset for Indian Sign Language...

    • data.niaid.nih.gov
    • live.european-language-grid.eu
    Updated Dec 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ganesan, Rohith Gandhi (2021). INCLUDE: A Large Scale Dataset for Indian Sign Language Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4010759
    Explore at:
    Dataset updated
    Dec 19, 2021
    Dataset provided by
    Ganesan, Rohith Gandhi
    Khapra, Mitesh
    Sridhar, Advaith
    Kumar, Pratyush
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Dataset Details: The INCLUDE dataset has 4292 videos (the paper mentions 4287 videos but 5 videos were added later). The videos used for training are mentioned in train.csv (3475), while that used for testing is mentioned in test.csv (817 files). Each video is a recording of 1 ISL sign, signed by deaf students from St. Louis School for the Deaf, Adyar, Chennai.

    INCLUDE50 has 766 train videos and 192 test videos.

    Train-Test Split: Please download the train-test split for INCLUDE and INCLUDE50 from here: Train-Test Split

    Publication Link: https://dl.acm.org/doi/10.1145/3394171.3413528

    AI4Bharat website: https://sign-language.ai4bharat.org/

    Download Instructions

    For ease of access, we have prepared a Shell Script to download all the parts of the dataset and extract them to form the complete INCLUDE dataset.

    You can find the script here: http://bit.ly/include_dl

    Paper Abstract: Indian Sign Language (ISL) is a complete language with its own grammar, syntax, vocabulary and several unique linguistic attributes. It is used by over 5 million deaf people in India. Currently, there is no publicly available dataset on ISL to evaluate Sign Language Recognition (SLR) approaches. In this work, we present the Indian Lexicon Sign Language Dataset - INCLUDE - an ISL dataset that contains 0.27 million frames across 4,287 videos over 263 word signs from 15 different word categories. INCLUDE is recorded with the help of experienced signers to provide close resemblance to natural conditions. A subset of 50 word signs is chosen across word categories to define INCLUDE-50 for rapid evaluation of SLR methods with hyperparameter tuning. The best performing model achieves an accuracy of 94.5% on the INCLUDE-50 dataset and 85.6% on the INCLUDE dataset

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1
Organization logo

Datasets

Explore at:
zipAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Bastian Eichenberger; YinXiu Zhan
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.

Search
Clear search
Close search
Google apps
Main menu