39 datasets found

Datasets
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12958037.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Bastian Eichenberger; YinXiu Zhan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.
CSV file used in statistical analyses
data.csiro.au
researchdata.edu.au
+1more
Updated Oct 13, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Explore at:
Unique identifier
https://doi.org/10.4225/08/543B4B4CA92E6
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttp://www.csiro.au/
License
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttp://www.csiro.au/
Description
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
MOT testing data for Great Britain
s3.amazonaws.com
gov.uk
Updated Mar 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Driver and Vehicle Standards Agency (2022). MOT testing data for Great Britain [Dataset]. https://s3.amazonaws.com/thegovernmentsays-files/content/179/1797262.html
Explore at:
Dataset updated
Mar 24, 2022
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Driver and Vehicle Standards Agency
Area covered
United Kingdom, Great Britain
Description
About this data set

This data set comes from data held by the Driver and Vehicle Standards Agency (DVSA).

It is not classed as an ‘official statistic’. This means it’s not subject to scrutiny and assessment by the UK Statistics Authority.

MOT test results by class

The MOT test checks that your vehicle meets road safety and environmental standards. Different types of vehicles (for example, cars and motorcycles) fall into different ‘classes’.

This data table shows the number of initial tests. It does not include abandoned tests, aborted tests, or retests.

The initial fail rate is the rate for vehicles as they were brought for the MOT. The final fail rate excludes vehicles that pass the test after rectification of minor defects at the time of the test.

This data table is updated every 3 months.

https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

MOT test results by class of vehicle

Ref: DVSA/MOT/01 View online https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1060287/dvsa-mot-01-mot-test-results-by-class-of-vehicle1.csv"> Download CSV 16.1 KB

Initial failures by defect category

These tables give data for the following classes of vehicles:

class 1 and 2 vehicles - motorcycles

class 3 and 4 vehicles - cars and light vans up to 3,000kg

class 5 vehicles - private passenger vehicles with more than 12 seats

class 7 vehicles - goods vehicles between 3,000kg and 3,500kg gross vehicle weight

All figures are for vehicles as they were brought in for the MOT.

A failed test usually has multiple failure items.

The percentage of tests is worked out as the number of tests with one or more failure items in the defect as a percentage of total tests.

The percentage of defects is worked out as the total defects in the category as a percentage of total defects for all categories.

The average defects per initial test failure is worked out as the total failure items as a percentage of total tests failed plus tests that passed after rectification of a minor defect at the time of the test.

These data tables are updated every 3 months.

https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

MOT class 1 and 2 vehicles: initial failures by defect category

Ref: DVSA/MOT/02 View online https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1060255/dvsa-mot-02-mot-class-1-and-2-vehicles-initial-failures-by-defect-category-.csv"> Download CSV 19.1 KB

https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

MOT class 3 and 4 vehicles: initial failures by defect category</h3
q
Movie Data - X - Test - w2v
data.researchdatafinder.qut.edu.au
Updated Apr 8, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Movie Data - X - Test - w2v [Dataset]. https://data.researchdatafinder.qut.edu.au/dataset/survey-word-vector/resource/e638fc06-7ef3-4a41-85e2-21f7fad2dfb3
Explore at:
Dataset updated
Apr 8, 2018
License
http://researchdatafinder.qut.edu.au/display/n15252http://researchdatafinder.qut.edu.au/display/n15252
Description
This file contains the features for the test portion of the movie dataset. The data has been changed into an average word vector. This is 50% of the total movie results. QUT Research Data Respository Dataset Resource available for download
PIPr: A Dataset of Public Infrastructure as Code Programs
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10173400
Dataset updated
Nov 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.
Contents:
metadata.zip: The dataset metadata and analysis results as CSV files.
scripts-and-logs.zip: Scripts and logs of the dataset creation.
LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
README.md: This document.
redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.
This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.
Metadata
The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.
repositories.csv:
ID (integer): GitHub repository ID
url (string): GitHub repository URL
downloaded (boolean): Whether cloning the repository succeeded
name (string): Repository name
description (string): Repository description
licenses (string, list of strings): Repository licenses
redistributable (boolean): Whether the repository's licenses permit redistribution
created (string, date & time): Time of the repository's creation
updated (string, date & time): Time of the last update to the repository
pushed (string, date & time): Time of the last push to the repository
fork (boolean): Whether the repository is a fork
forks (integer): Number of forks
archive (boolean): Whether the repository is archived
programs (string, list of strings): Project file path of each IaC program in the repository
programs.csv:
ID (string): Project file path of the IaC program
repository (integer): GitHub repository ID of the repository containing the IaC program
directory (string): Path of the directory containing the IaC program's project file
solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
name (string): IaC program name
description (string): IaC program description
runtime (string): Runtime string of the IaC program
testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
tests (string, list of strings): File paths of IaC program's tests
testing-files.csv:
file (string): Testing file path
language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
program (string): Project file path of the testing file's IaC program
Dataset Creation
scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.
Searching Repositories
The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Github access token.
Name of the CSV output file.
Filename to search for.
File extensions to search for, separated by commas.
Min file size for the search (for all files: 0).
Max file size for the search or * for unlimited (for all files: *).
Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/
AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html
CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup
Limitations
The script uses the GitHub code search API and inherits its limitations:
Only forks with more stars than the parent repository are included.
Only the repositories' default branches are considered.
Only files smaller than 384 KB are searchable.
Only repositories with fewer than 500,000 files are considered.
Only repositories that have had activity or have been returned in search results in the last year are considered.
More details: https://docs.github.com/en/search-github/searching-on-github/searching-code
The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api
Downloading Repositories
download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
Name of the repositories CSV files generated through search-repositories.py, separated by commas.
Output directory to download the repositories to.
Name of the CSV output file.
The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
i
Sample Dataset for Testing
ieee-dataport.org
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Outman (2025). Sample Dataset for Testing [Dataset]. https://ieee-dataport.org/documents/sample-dataset-testing
Explore at:
Dataset updated
Apr 28, 2025
Authors
Alex Outman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
10
h
test_data_huggingface
huggingface.co
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dylan guo (2023). test_data_huggingface [Dataset]. https://huggingface.co/datasets/dd123/test_data_huggingface
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 18, 2023
Authors
dylan guo
Description
"""

_HOMEPAGE = "https://gitee.com/didi233/test_date_gitee"

_LICENSE = "Creative Commons Attribution 4.0 International"

_TRAIN_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/train.csv"

_TRAIN_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/train.csv"

_TEST_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/test.csv"

_TEST_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/test.csv"

class test_data_huggingface(datasets.GeneratorBasedBuilder):
f
Dataset
figshare.com
application/x-gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13577873.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Moynuddin Ahmed Shibly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.
o
QASPER: NLP Questions and Evidence
opendatabay.com
.undefined
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). QASPER: NLP Questions and Evidence [Dataset]. https://www.opendatabay.com/data/ai-ml/c030902d-7b02-48a2-b32f-8f7140dd1de7
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 22, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
QASPER: NLP Questions and Evidence Discovering Answers with Expertise By Huggingface Hub [source]

About this dataset QASPER is an incredible collection of over 5,000 questions and answers on a vast range of Natural Language Processing (NLP) papers -- all crowdsourced from experienced NLP practitioners. Each question in the dataset is written based only on the titles and abstracts of the corresponding paper, providing an insight into how the experts understood and parsed various materials. The answers to each query have been expertly enriched by evidence taken directly from the full text of each paper. Moreover, QASPER comes with carefully crafted fields that contain relevant information including ‘qas’ – questions and answers; ‘evidence’ – evidence provided for answering questions; title; abstract; figures_and_tables, and full_text. All this adds up to create a remarkable dataset for researchers looking to gain insights into how practitioners interpret NLP topics while providing effective validation when it comes to finding clear-cut solutions to problems encountered in existing literature

More Datasets For more datasets, click here.

Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This guide will provide instructions on how to use the QASPER dataset of Natural Language Processing (NLP) questions and evidence. The QASPER dataset contains 5,049 questions over 1,585 papers that has been crowdsourced by NLP practitioners. To get the most out of this dataset we will show you how to access the questions and evidence, as well as provide tips for getting started.

Step 1: Accessing the Dataset To access the data you can download it from Kaggle's website or through a code version control system like Github. Once downloaded, you will find five files in .csv format; two test data sets (test.csv and validation.csv), two train data sets (train-v2-0_lessons_only_.csv and trainv2-0_unsplit.csv) as well as one figure data set (figures_and_tables_.json). Each .csv file contains different datasets with columns representing titles, abstracts, full texts and Q&A fields with evidence for each paper mentioned in each row of each file respectively

**Step 2: Analyzing Your Data Sets ** Now would be a good time to explore your datasets using basic descriptive statistics or more advanced predictive analytics such as logistic regression or naive bayes models depending on what kind of analysis you would like to undertake with this dataset You can start simple by summarizing some basic crosstabs between any two variables comprise your dataset; titles abstracts etc.). As an example try correlating title lengths with certain number of words in their corresponding abstracts then check if there is anything worth investigating further

**Step 3: Define Your Research Questions & Perform Further Analysis ** Once satisfied with your initial exploration it is time to dig deeper into the underlying QR relationship among different variables comprising your main documents One way would be using text mining technologies such as topic modeling machine learning techniques or even automated processes that may help summarize any underlying patterns Yet another approach could involve filtering terms that are relevant per specific research hypothesis then process such terms via web crawlers search engines document similarity algorithms etc

Finally once all relevant parameters are defined analyzed performed searched it would make sense to draw preliminary connsusison linking them back together before conducting replicable tests ensuring reproducible results

Research Ideas Developing AI models to automatically generate questions and answers from paper titles and abstracts. Enhancing machine learning algorithms by combining the answers with the evidence provided in the dataset to find relationships between papers. Creating online forums for NLP practitioners that uses questions from this dataset to spark discussion within the community

License

CC0

Original Data Source: QASPER: NLP Questions and Evidence
q
Movie Data - Y - Test
data.researchdatafinder.qut.edu.au
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Movie Data - Y - Test [Dataset]. https://data.researchdatafinder.qut.edu.au/am/dataset/survey-word-vector/resource/bd278820-eb30-4fad-af8a-46c956261fa0
Explore at:
Dataset updated
Apr 16, 2024
License
http://researchdatafinder.qut.edu.au/display/n15252http://researchdatafinder.qut.edu.au/display/n15252
Description
This file contains the labels for the test portion of the movie dataset. This is 50% of the total movie results. QUT Research Data Respository Dataset Resource available for download
Titanic - Labelled Test Set
kaggle.com
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wesley Howe (2023). Titanic - Labelled Test Set [Dataset]. https://www.kaggle.com/datasets/wesleyhowe/titanic-labelled-test-set
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Wesley Howe
Description
The test set from "Titanic - Machine Learning from Disaster" doesn't include labels.

This is an augmented version of the test set with the correct labels, retrieved from the original Titanic dataset at: https://www.openml.org/search?type=data&sort=runs&id=40945&status=active

The accuracy of the labels was validated by getting a 1.0 score in the competition with them.

This dataset is provided for educational purposes, and is not intended to help people cheat in the competition. If the only reason you want to download this is so you can get a shiny 1.0 on the leaderboards, don't do it.
h
live_stream_dataset_huggingface
huggingface.co
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dylan guo (2023). live_stream_dataset_huggingface [Dataset]. https://huggingface.co/datasets/dd123/live_stream_dataset_huggingface
Explore at:
Dataset updated
Jul 26, 2023
Authors
dylan guo
Description
"""

_HOMEPAGE = "https://github.com/freeziyou/live_stream_dataset"

_LICENSE = "Creative Commons Attribution 4.0 International"

_TRAIN_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/train.csv"

_TRAIN_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/train.csv"

_TEST_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/test.csv"

_TEST_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/test.csv"

class live_stream_dataset_huggingface(datasets.GeneratorBasedBuilder):
Level Crossing Warning Bell (LCWB) Dataset
zenodo.org
data.niaid.nih.gov
Updated May 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo De Donato; Lorenzo De Donato; Valeria Vittorini; Valeria Vittorini; Francesco Flammini; Francesco Flammini; Stefano Marrone; Stefano Marrone (2023). Level Crossing Warning Bell (LCWB) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7945412
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7945412
Dataset updated
May 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorenzo De Donato; Lorenzo De Donato; Valeria Vittorini; Valeria Vittorini; Francesco Flammini; Francesco Flammini; Stefano Marrone; Stefano Marrone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Acknowledgement
These data are a product of a research activity conducted in the context of the RAILS (Roadmaps for AI integration in the raiL Sector) project which has received funding from the Shift2Rail Joint Undertaking under the European Union’s Horizon 2020 research and innovation programme under grant agreement n. 881782 Rails. The JU receives support from the European Union’s Horizon 2020 research and innovation program and the Shift2Rail JU members other than the Union.

Disclaimers
The information and views set out in this document are those of the author(s) and do not necessarily reflect the official opinion of Shift2Rail Joint Undertaking. The JU does not guarantee the accuracy of the data included in this document. Neither the JU nor any person acting on the JU’s behalf may be held responsible for the use which may be made of the information contained therein.

This "dataset" has been created for scientific purposes only - and WITHOUT ANY COMMERCIAL purposes - to study the potentials of Deep Learning and Transfer Learning approaches. We are NOT re-distributing any video or audio; our files just contain pointers and indications needed to reproduce our study. The authors DO NOT ASSUME any responsibility for the use that other researchers or users will make of these data.

General Info
The CSV files contained in this folder (and subfolders) compose the Level Crossing (LC) Warning Bell (WB) Dataset.

When using any of these data, please mention:

De Donato, L., Marrone, S., Flammini, F., Sansone, C., Vittorini, V., Nardone, R., Mazzariello, C., and Bernaudine, F., "Intelligent Detection of Warning Bells at Level Crossings through Deep Transfer Learning for Smarter Railway Maintenance", Engineering Applications of Artificial Intelligence, Elsevier, 2023

Content of the folder
This folder contains the following subfolders and files.

"Data Files" contains all the CSV files related to the data composing the LCWB Dataset:

WB_data.csv (WB_labels.csv): representing data of the "Warning Bell (WB)" class;

NA_data.csv (NA_labels.csv): representing data of the "No Alarm (NA)" class;

GE_data.csv (GE_labels.csv): representing data of the "GEneric alarm (GE)" class.

"LCWB Dataset" contains all the JSON files that show how the aforementioned data have been distributed among training, validation, and test sets:

IT_Distribution.json and UK_distribution.json respectively show how Italian (IT) WBs and British (UK) WBs have been distributed;

The same goes for NA_Distribution.json and GE_Distribution.json, which show the distribution of NA and GE data respectively;

DatasetDistribution.json simply incorporates the content of the aforementioned JSON files in a unique file that can be exploited to obtain exactly the same dataset we adopted in our analyses.

"Additional Files" contains some CSV files related to data we adopted to further test the deep neural network leveraged in the aforementioned manuscript:

FR_DE_data.csv (FR_DE_labels.csv): representing data that have been used to test the generalisation performances of the network we exploited on LC WBs related to countries that were not considered in the training phase.

Noises_data.csv (Noises_labels.csv): representing the noises that were considered to study the behaviour of the network in case of noisy data.

CSV Files Structure
Each "XX_labels.csv" file contains, for each entry, the following information:

The identifier ("index") of the sub-class (which is not relevant in our case);

The code-name ("mid") of the class, which is used in the "XX_data.csv" file to indicate the sub-class of a specific audio;

The extended name of the class ("display_name").

Worth mentioning, sub-classes do not have a specific purpose in our task. They have been kept to maintain as much as possible the structure of the "class_labels_indices.csv" file provided by AudioSet. The same applies to the "XX_data.csv" files, which have roughly the same structures of "Evaluation", "Balanced train", and "Unbalanced train" AudioSet CSV files.

Indeed, each "XX_data.csv" file contains, for each entry, the following information:

ID: the identifier of the entry;

YTID: the YouTube identifier of the video;

start_seconds and end_seconds: which delimit the portion of audio (extracted from YTID) which is of interest for this task;

positive_labels: the label(s) associated with the audio.

Credits
The structure of the CSV files contained in this dataset, as well as part of their content, was inspired by the CSV files composing the AudioSet dataset which is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, while its ontology is available under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Particularly, from AudioSet, we retrieved:

The structure of the CSV files as discussed above.

Data contained in GE_data.csv (which is a minimal portion of data made available by AudioSet) as well as the related 19 classes (in GE_labels.csv) which we selected among the hundreds of classes included in the AudioSet ontology.

Pointers contained in "XX_data.csv" files other than GE_data.csv have been retrieved manually from scratch. Then, the related "XX_labels.csv" files have been created consequently.

More about downloading the AudioSet dataset can be found here.
h
GenoAdv
huggingface.co
Updated May 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MagicsLab (2025). GenoAdv [Dataset]. https://huggingface.co/datasets/magicslabnu/GenoAdv
Explore at:
Dataset updated
May 4, 2025
Dataset authored and provided by
MagicsLab
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
GFM-Attack

How to download this dataset

git lfs install

git clone https://huggingface.co/datasets/magicslabnu/GFM-Attack

How to use data

Each directory corresponds to a dataset and contains the standard files: train.csv, dev.csv, and test.csv. You may select any of these dataset folders to perform adversarial training. For example, to use the tf1 dataset for adversarial training, utilize the train.csv file located within the tf1 folder.

Paper… See the full description on the dataset page: https://huggingface.co/datasets/magicslabnu/GenoAdv.
Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Speedtest Global Index (2023). Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.13621169.v24
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13621169.v24
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Richard Ferrers; Speedtest Global Index
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU

ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.

LAX file named 0320, when should be Q320. Amended in v8.

*lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)

Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.

This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.

** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).

** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract

Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).

**VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)

Melb 14784 lines Avg download speed 69.4M Tests 0.39M

SHG 31207 lines Avg 233.7M Tests 0.56M

ALC 113 lines Avg 51.5M Test 1092

BKK 29684 lines Avg 215.9M Tests 1.2M

LAX 15505 lines Avg 218.5M Tests 0.74M

v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record

** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.

** Other uses of Speedtest Open Data; - see link at Speedtest below.
p
Data from checkmynet.lu, ILR's internet access measuring tool
data.public.lu
zip
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institut Luxembourgeois de Régulation (ILR) (2025). Data from checkmynet.lu, ILR's internet access measuring tool [Dataset]. https://data.public.lu/en/datasets/data-from-checkmynet-lu-ilrs-internet-access-measuring-tool/
Explore at:
zipAvailable download formats
Dataset updated
Jan 24, 2025
Dataset authored and provided by
Institut Luxembourgeois de Régulation (ILR)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Checkmynet.lu is a measurement tool to test speed and quality of internet connections. It was published by ILR (Luxembourg Institute of Regulation, www.ilr.lu). The published data, accessible via API or simple *.csv download, contains the test results such as download speed, upload speed, operator, equipment used, GPS coordinates, a.s.o. Checkmynet.lu is independent, crowd-sourced, open-source and open-data based solution: • Designed to measure availability, quality and neutrality of the internet • Generates and processes all results objectively, securely and transparently • Tests 150+ parameters: speed, Quality of Service & Quality of Experience • Runs on Android, iOS, web browsers • Displays results on a map with several filter options
h
RAVDESS
huggingface.co
Updated Oct 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maha Tufail Agro (2024). RAVDESS [Dataset]. https://huggingface.co/datasets/MahiA/RAVDESS
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 12, 2024
Authors
Maha Tufail Agro
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
RAVDESS

This is an audio classification dataset for Emotion Recognition. Classes = 8 , Split = Train-Test

Structure

audios folder contains audio files. train.csv for training split and test.csv for the testing split.

Download

import os import huggingface_hub audio_datasets_path = "DATASET_PATH/Audio-Datasets" if not os.path.exists(audio_datasets_path): print(f"Given {audio_datasets_path=} does not exist. Specify a valid path ending with… See the full description on the dataset page: https://huggingface.co/datasets/MahiA/RAVDESS.
FSDnoisy18k
zenodo.org
paperswithcode.com
+3more
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Eduardo Fonseca; Mercedes Collado; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Frederic Font; Frederic Font; Xavier Favory; Xavier Serra; Xavier Serra; Mercedes Collado; Manoj Plakal; Xavier Favory (2020). FSDnoisy18k [Dataset]. http://doi.org/10.5281/zenodo.2529934
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2529934
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Fonseca; Eduardo Fonseca; Mercedes Collado; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Frederic Font; Frederic Font; Xavier Favory; Xavier Serra; Xavier Serra; Mercedes Collado; Manoj Plakal; Xavier Favory
Description
FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

Data curators

Eduardo Fonseca and Mercedes Collado

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

Citation

If you use this dataset or part of it, please cite the following ICASSP 2019 paper:

Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

You can also consider citing our ISMIR 2017 paper that describes the Freesound Annotator, which was used to gather the manual annotations included in FSDnoisy18k:

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound Datasets: A Platform for the Creation of Open Audio Datasets”, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

FSDnoisy18k description

What follows is a summary of the most basic aspects of FSDnoisy18k. For a complete description of FSDnoisy18k, make sure to check:

the FSDnoisy18k companion site: http://www.eduardofonseca.net/FSDnoisy18k/

the description provided in Section 2 of our ICASSP 2019 paper

FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. The 20 classes are: "Acoustic guitar", "Bass guitar", "Clapping", "Coin (dropping)", "Crash cymbal", "Dishes, pots, and pans", "Engine", "Fart", "Fire", "Fireworks", "Glass", "Hi-hat", "Piano", "Rain", "Slam", "Squeak", "Tearing", "Walk, footsteps", "Wind", and "Writing". FSDnoisy18k was created with the Freesound Annotator, which is a platform for the collaborative creation of open audio datasets.

We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).

The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.

The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.

Code

We've released the code for our ICASSP 2019 paper at https://github.com/edufonseca/icassp19. The framework comprises all the basic stages: feature extraction, training, inference and evaluation. After loading the FSDnoisy18k dataset, log-mel energies are computed and a CNN baseline is trained and evaluated. The code also allows to test four noise-robust loss functions. Please check our paper for more details.

Label noise characteristics

FSDnoisy18k features real label noise that is representative of audio data retrieved from the web, particularly from Freesound. The analysis of a per-class, random, 15% of the noisy portion of FSDnoisy18k revealed that roughly 40% of the analyzed labels are correct and complete, whereas 60% of the labels show some type of label noise. Please check the FSDnoisy18k companion site for a detailed characterization of the label noise in the dataset, including a taxonomy of label noise for singly-labeled data as well as a per-class description of the label noise.

FSDnoisy18k basic characteristics

The dataset most relevant characteristics are as follows:

FSDnoisy18k contains 18,532 audio clips (42.5h) unequally distributed in the 20 aforementioned classes drawn from the AudioSet Ontology.

The audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

The audio clips are of variable length ranging from 300ms to 30s, and each clip has a single ground truth label (singly-labeled data).

The dataset is split into a test set and a train set. The test set is drawn entirely from the clean portion, while the remainder of data forms the train set.

The train set is composed of 17,585 clips (41.1h) unequally distributed among the 20 classes. It features a clean subset and a noisy subset. In terms of number of clips their proportion is 10%/90%, whereas in terms of duration the proportion is slightly more extreme (6%/94%). The per-class percentage of clean data within the train set is also imbalanced, ranging from 6.1% to 22.4%. The number of audio clips per class ranges from 51 to 170, and from 250 to 1000 in the clean and noisy subsets, respectively. Further, a noisy small subset is defined, which includes an amount of (noisy) data comparable (in terms of duration) to that of the clean subset.

The test set is composed of 947 clips (1.4h) that belong to the clean portion of the data. Its class distribution is similar to that of the clean subset of the train set. The number of per-class audio clips in the test set ranges from 30 to 72. The test set enables a multi-class classification problem.

FSDnoisy18k is an expandable dataset that features a per-class varying degree of types and amount of label noise. The dataset allows investigation of label noise as well as other approaches, from semi-supervised learning, e.g., self-training to learning with minimal supervision.

License

FSDnoisy18k has licenses at two different levels, as explained next. All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. In particular, all Freesound clips included in FSDnoisy18k are released under either CC-BY or CC0. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of audio clips and their corresponding license in the LICENSE-INDIVIDUAL-CLIPS file downloaded with the dataset.

In addition, FSDnoisy18k as a whole is the result of a curation process and it has an additional license. FSDnoisy18k is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the dataset.

Files

FSDnoisy18k can be downloaded as a series of zip files with the following directory structure:

root │ └───FSDnoisy18k.audio_train/ Audio clips in the train set │ └───FSDnoisy18k.audio_test/ Audio clips in the test set │ └───FSDnoisy18k.meta/ Files for evaluation setup │ │ │ └───train.csv Data split and ground truth for the train set │ │ │ └───test.csv Ground truth for the test set │ └───FSDnoisy18k.doc/ │ └───README.md The dataset description file that you are reading │ └───LICENSE-DATASET License of the FSDnoisy18k dataset as an entity │ └───LICENSE-INDIVIDUAL-CLIPS.csv Licenses of the individual audio clips from Freesound

Each row (i.e. audio clip) of the train.csv file contains the following
i
NSL-KDD
ieee-dataport.org
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RUIZHE ZHAO (2022). NSL-KDD [Dataset]. https://ieee-dataport.org/documents/nsl-kdd-0
Explore at:
Dataset updated
Feb 2, 2022
Authors
RUIZHE ZHAO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The train set and test set of NSL-KDD
Z
Data from: INCLUDE: A Large Scale Dataset for Indian Sign Language...
data.niaid.nih.gov
live.european-language-grid.eu
Updated Dec 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ganesan, Rohith Gandhi (2021). INCLUDE: A Large Scale Dataset for Indian Sign Language Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4010759
Explore at:
Dataset updated
Dec 19, 2021
Dataset provided by
Ganesan, Rohith Gandhi
Khapra, Mitesh
Sridhar, Advaith
Kumar, Pratyush
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
Dataset Details: The INCLUDE dataset has 4292 videos (the paper mentions 4287 videos but 5 videos were added later). The videos used for training are mentioned in train.csv (3475), while that used for testing is mentioned in test.csv (817 files). Each video is a recording of 1 ISL sign, signed by deaf students from St. Louis School for the Deaf, Adyar, Chennai.

INCLUDE50 has 766 train videos and 192 test videos.

Train-Test Split: Please download the train-test split for INCLUDE and INCLUDE50 from here: Train-Test Split

Publication Link: https://dl.acm.org/doi/10.1145/3394171.3413528

AI4Bharat website: https://sign-language.ai4bharat.org/

Download Instructions

For ease of access, we have prepared a Shell Script to download all the parts of the dataset and extract them to form the complete INCLUDE dataset.

You can find the script here: http://bit.ly/include_dl

Paper Abstract: Indian Sign Language (ISL) is a complete language with its own grammar, syntax, vocabulary and several unique linguistic attributes. It is used by over 5 million deaf people in India. Currently, there is no publicly available dataset on ISL to evaluate Sign Language Recognition (SLR) approaches. In this work, we present the Indian Lexicon Sign Language Dataset - INCLUDE - an ISL dataset that contains 0.27 million frames across 4,287 videos over 263 word signs from 15 different word categories. INCLUDE is recorded with the help of experienced signers to provide close resemblance to natural conditions. A subset of 50 word signs is chosen across word categories to define INCLUDE-50 for rapid evaluation of SLR methods with hyperparameter tuning. The best performing model achieves an accuracy of 94.5% on the INCLUDE-50 dataset and 85.6% on the INCLUDE dataset

Facebook

Twitter

Click to copy link

Link copied

Cite

Bastian Eichenberger; YinXiu Zhan (2023). Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.12958037.v1

Datasets

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.12958037.v1

Dataset updated

May 31, 2023

Dataset provided by

figshare

Authors

Bastian Eichenberger; YinXiu Zhan

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.

Clear search

Close search

Google apps

Main menu

Datasets

CSV file used in statistical analyses

MOT testing data for Great Britain

About this data set

MOT test results by class

MOT test results by class of vehicle

Initial failures by defect category

MOT class 1 and 2 vehicles: initial failures by defect category

MOT class 3 and 4 vehicles: initial failures by defect category</h3

Movie Data - X - Test - w2v

PIPr: A Dataset of Public Infrastructure as Code Programs

Metadata

Dataset Creation

Searching Repositories

Limitations

Downloading Repositories

Sample Dataset for Testing

test_data_huggingface

_TRAIN_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/train.csv"

_TEST_DOWNLOAD_URL = "https://raw.githubusercontent.com/freeziyou/live_stream_dataset/main/test.csv"

Dataset

QASPER: NLP Questions and Evidence

License

Movie Data - Y - Test

Titanic - Labelled Test Set

live_stream_dataset_huggingface

_TRAIN_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/train.csv"

_TEST_DOWNLOAD_URL = "https://gitee.com/didi233/test_date_gitee/raw/master/test.csv"

Level Crossing Warning Bell (LCWB) Dataset

GenoAdv

Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...

Melb 14784 lines Avg download speed 69.4M Tests 0.39M

SHG 31207 lines Avg 233.7M Tests 0.56M

ALC 113 lines Avg 51.5M Test 1092

BKK 29684 lines Avg 215.9M Tests 1.2M

LAX 15505 lines Avg 218.5M Tests 0.74M

Data from checkmynet.lu, ILR's internet access measuring tool

RAVDESS

FSDnoisy18k

NSL-KDD

Data from: INCLUDE: A Large Scale Dataset for Indian Sign Language...

Datasets