MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.
While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.
This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.
The dataset is divided into two subsets:
- Training: 16,000 samples (proteinas_train.csv
).
- Testing: 4,000 samples (proteinas_test.csv
).
This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.
As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy.
Methods
Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification.
Data Source Collection
Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.
Data Preprocessing:
Data preprocessing shall be done to remove noise and outlier.
Transformation:
The data shall be transformed from analog to electronic record.
Data Partitioning
The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.
The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics.
Classification and prediction:
Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows:
i. Data collection and preprocessing shall be done.
ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification.
iii. Test data set is shall be stored in database test data set.
iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows:
Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is designed for training and evaluating machine learning models to recognize American Sign Language (ASL) hand gestures, including both numbers (0-9) and English alphabet letters (a-z). It is a well-organized dataset that can be used for computer vision tasks, particularly image classification and gesture recognition.
The dataset contains two main folders:
1. Train:
- Used for training the model.
- Includes 36 subdirectories (one for each class: 0-9 and a-z).
- Each subdirectory contains 56 images of the corresponding class.
Folder | Number of Classes | Total Images per Class | Total Images |
---|---|---|---|
Train | 36 | 56 | 2,016 |
Test | 36 | 14 | 504 |
This dataset is ideal for: - Training convolutional neural networks (CNNs) for ASL recognition. - Exploring data augmentation techniques for image classification. - Developing real-world AI applications like sign language translators.
This dataset is curated to facilitate the development of models for sign language recognition and gesture-based interaction systems. If you use this dataset in your research or projects, please consider sharing your findings or improvements!
Description:
This dataset was created to serve as an easy-to-use image dataset, perfect for experimenting with object detection algorithms. The main goal was to provide a simplified dataset that allows for quick setup and minimal effort in exploratory data analysis (EDA). This dataset is ideal for users who want to test and compare object detection models without spending too much time navigating complex data structures. Unlike datasets like chest x-rays, which require expert interpretation to evaluate model performance, the simplicity of balloon detection enables users to visually verify predictions without domain expertise.
The original Balloon dataset was more complex, as it was split into separate training and testing sets, with annotations stored in two separate JSON files. To streamline the experience, this updated version of the dataset merges all images into a single folder and replaces the JSON annotations with a single, easy-to-use CSV file. This new format ensures that the dataset can be loaded seamlessly with tools like Pandas, simplifying the workflow for researchers and developers.
Download Dataset
The dataset contains a total of 74 high-quality JPG images. Each featuring one or more balloons in different scenes and contexts. Accompanying the images is a CSV file that provides annotation data. Such as bounding box coordinates and labels for each balloon within the images. This structure makes the dataset easily accessible for a range of machine learning and computer vision tasks. Including object detection and image classification. The dataset is versatile and can be used to test algorithms like YOLO, Faster R-CNN, SSD, or other popular object detection models.
Key Features:
Image Format: 74 JPG images, ensuring high compatibility with most machine learning frameworks.
Annotations: A single CSV file that contains structure data. Including bounding box coordinates, class labels, and image file names, which can be load with Python libraries like Pandas.
Simplicity: Design for users to quickly start training object detection models without needing to preprocess or deeply explore the dataset.
Variety: The images feature balloons in various sizes, colors, and scenes, making it suitable for testing the robustness of detection models.
This dataset is sourced from Kaggle.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is derived from TIGER-Lab/MMLU-Pro as part of our MMLU-Leagues Encoder benchmark series, containing:
MMLU-Amateur, where the train set contains all questions Llama-3-8B-Instruct (5-shot) gets wrong and the test set contains all questions it gets right. The aim is to measure the ability of an encoder, with relatively limited training data, to match the performance of a small frontier model. MMLU-SemiPro (this dataset), where the data is evenly split between a train and a test set.… See the full description on the dataset page: https://huggingface.co/datasets/answerdotai/MMLU-SemiPro.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Hard Hat
dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.
The original dataset has a 75/25 train-test split.
Example Image:
https://i.imgur.com/7spoIJT.png" alt="Example Image">
One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.
Use the fork
or Download this Dataset
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Image Preprocessing | Image Augmentation | Modify Classes
* v1
(resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations
* v2
(raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images
* v3
(v3): generated with the original 75/25 train-test split | Modify Classes used to drop person
class | Preprocessing and Augmentation applied
* v5
(raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person
class
* v8
(raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head
and person
classes
* v9
(raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head
and helmet
classes
* v10
(raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images
* v11
(augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model
* v12
(augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person
class | 3x image generation | Trained with Roboflow's Fast Model
* v13
(augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person
class | 3x image generation | Trained with Roboflow's Accurate Model
* v14
(raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person
class, and remap/relabel helmet
class to head
Choosing Between Computer Vision Model Sizes | Roboflow Train
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.
Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.
Annotations: The dataset has annotations for
object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This dataset contains 3000+ images generated from OOC (organ-on-a-chip) setup with different cell types. The images were generated by an automated brightfield microscopy setup; for each image, such parameters as cell type, time after seeding, and class label ('good' or 'bad' sample quality as assessed by a biology expert) are provided. Furthermore, for some images, seeding density and flow rate are given as well. The dataset can be used for training machine learning classifiers for the automated analysis of the data generated with OOC setup, allowing to create more reliable tissue models and automate decision making processes for growing OOC.
The dataset comprises images of OOC samples from the following cell lines:
Structure of the dataset: The dataset is split into three main folders that correspond to the data split for training machine learning models, i.e., 'train', 'val', and 'test'. The train/val/test split is done proportionally with respect to the class labels, cell lines, and time after seeding (see below), yet the data can be split or merged in other ways to suit the needs of prospective users of the dataset. Within each of the main folders, there are a 'bad' and a 'good' folder with the images corresponding to the respective class labels (see 'Overview' above). The images in 'bad' / 'ģood' folders are further subdivided into folders corresponding to respective cell lines, which are in their turn subdivided into folders corresponding to the different times after seeding. Therefore, it is easy to find images of interest, e.g., '4+ days' 'good' images of the cell line A549 from the 'train' dataset. Further information about the images is available in the file 'OOC_datasheet.xlsx'.
Acknowledgement: The work presented in this paper was supported by the project 'AI-improved organ on chip cultivation for personalised medicine (AimOOC)' (contract with Central Finance and Contracting Agency of Republic of Latvia no. 1.1.1.1/21/A/079; the project is co-financed by REACT-EU funding for mitigating the consequences of the pandemic crisis).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: The ability to assess adverse outcomes in patients with community-acquired pneumonia (CAP) could improve clinical decision-making to enhance clinical practice, but the studies remain insufficient, and similarly, few machine learning (ML) models have been developed.Objective: We aimed to explore the effectiveness of predicting adverse outcomes in CAP through ML models.Methods: A total of 2,302 adults with CAP who were prospectively recruited between January 2012 and March 2015 across three cities in South America were extracted from DryadData. After a 70:30 training set: test set split of the data, nine ML algorithms were executed and their diagnostic accuracy was measured mainly by the area under the curve (AUC). The nine ML algorithms included decision trees, random forests, extreme gradient boosting (XGBoost), support vector machines, Naïve Bayes, K-nearest neighbors, ridge regression, logistic regression without regularization, and neural networks. The adverse outcomes included hospital admission, mortality, ICU admission, and one-year post-enrollment status.Results: The XGBoost algorithm had the best performance in predicting hospital admission. Its AUC reached 0.921, and accuracy, precision, recall, and F1-score were better than those of other models. In the prediction of ICU admission, a model trained with the XGBoost algorithm showed the best performance with AUC 0.801. XGBoost algorithm also did a good job at predicting one-year post-enrollment status. The results of AUC, accuracy, precision, recall, and F1-score indicated the algorithm had high accuracy and precision. In addition, the best performance was seen by the neural network algorithm when predicting death (AUC 0.831).Conclusions: ML algorithms, particularly the XGBoost algorithm, were feasible and effective in predicting adverse outcomes of CAP patients. The ML models based on available common clinical features had great potential to guide individual treatment and subsequent clinical decisions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into building and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Building is useful in applications such as high-quality 3D basemap creation, urban planning, and planning climate change response.Building could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Building in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.The model is trained with classified LiDAR that follows the The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 6 BuildingApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Auckland, Christchurch, Kapiti, Wellington Testing dataset - Auckland, WellingtonValidation/Evaluation dataset - Hutt City Dataset City Training Auckland, Christchurch, Kapiti, Wellington Testing Auckland, Wellington Validating HuttModel architectureThis model uses the SemanticQueryNetwork model architecture implemented in ArcGIS Pro.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.984921 0.975853 0.979762 Building 0.951285 0.967563 0.9584Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 75~%, Test: 25~%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-137.74 m to 410.50 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-17 to +17 Maximum points per block8192 Block Size50 Meters Class structure[0, 6]Sample resultsModel to classify a dataset with 23pts/m density Wellington city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story
Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.
We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.
This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).
This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.
Python version:
from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor
from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')
dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()
y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)
parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}
RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
StreetSurfaceVis
StreetSurfaceVis is an image dataset containing 9,122 street-level images from Germany with labels on road surface type and quality. The CSV file streetSurfaceVis_v1_0.csv contains all image metadata and four folders contain the image files. All images are available in four different sizes, based on the image width, in 256px, 1024px, 2048px and the original size.Folders containing the images are named according to the respective image size. Image files are named based on the mapillary_image_id.
Image metadata
Each CSV record contains information about one street-level image with the following attributes:
mapillary_image_id: ID provided by Mapillary (see information below on Mapillary)
user_id: Mapillary user ID of contributor
user_name: Mapillary user name of contributor
captured_at: timestamp, capture time of image
longitude, latitude: location the image was taken at
train: Suggestion to split train and test data. True
for train data and False
for test data. Test data contains data from 5 cities which are excluded in the training data.
surface_type: Surface type of the road in the focal area (the center of the lower image half) of the image. Possible values: asphalt, concrete, paving_stones, sett, unpaved
surface_quality: Surface quality of the road in the focal area of the image. Possible values: (1) excellent, (2) good, (3) intermediate, (4) bad, (5) very bad (see the attached Labeling Guide document for details)
Image source
Images are obtained from Mapillary, a crowd-sourcing plattform for street-level imagery. More metadata about each image can be obtained via the Mapillary API . User-generated images are shared by Mapillary under the CC-BY-SA License.
For each image, the dataset contains the mapillary_image_id and user_name. You can access user information on the Mapillary website by https://www.mapillary.com/app/user/
If you use the provided images, please adhere to the terms of use of Mapillary.
Instances per class
Total number of images: 9,122
excellent good intermediate bad very bad
asphalt 971 1697 821
concrete 314 350 250
paving stones 385 1063 519
129 694
-
326 387 303
For modeling, we recommend using a train-test split where the test data includes geospatially distinct areas, thereby ensuring the model's ability to generalize to unseen regions is tested. We propose five cities varying in population size and from different regions in Germany for testing - images are tagged accordingly.
Number of test images (train-test split): 776
Inter-rater-reliablility
Three annotators labeled the dataset, such that each image was annotated by one person. Annotators were encouraged to consult each other for a second opinion when uncertain.1,800 images were annotated by all three annotators, resulting in a Krippendorff's alpha of 0.96 for surface type and 0.74 for surface quality.
Recommended image preprocessing
As the focal road located in the bottom center of the street-level image is labeled, it is recommended to crop images to their lower and middle half prior using for classification tasks.
This is an exemplary code for recommended image preprocessing in Python:
from PIL import Imageimg = Image.open(image_path)width, height = img.sizeimg_cropped = img.crop((0.25 * width, 0.5 * height, 0.75 * width, height))
License
CC-BY-SA
This is part of the SurfaceAI project at the University of Applied Sciences, HTW Berlin.
Contact: surface-ai@htw-berlin.de
https://surfaceai.github.io/surfaceai/
Funding: SurfaceAI is a mFund project funded by the Federal Ministry for Digital and Transportation Germany.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Training Dataset for HNTSMRG 2024 Challenge
Overview
This repository houses the publicly available training dataset for the Head and Neck Tumor Segmentation for MR-Guided Applications (HNTSMRG) 2024 Challenge.
Patient cohorts correspond to patients with histologically proven head and neck cancer who underwent radiotherapy (RT) at The University of Texas MD Anderson Cancer Center. The cancer types are predominately oropharyngeal cancer or cancer of unknown primary. Images include a pre-RT T2w MRI scan (1-3 weeks before start of RT) and a mid-RT T2w MRI scan (2-4 weeks intra-RT) for each patient. Segmentation masks of primary gross tumor volumes (abbreviated GTVp) and involved metastatic lymph nodes (abbreviated GTVn) are provided for each image (derived from multi-observer STAPLE consensus).
HNTSMRG 2024 is split into 2 tasks:
Task 1: Segmentation of tumor volumes (GTVp and GTVn) on pre-RT MRI.
Task 2: Segmentation of tumor volumes (GTVp and GTVn) on mid-RT MRI.
The same patient cases will be used for the training and test sets of both tasks of this challenge. Therefore, we are releasing a single training dataset that can be used to construct solutions for either segmentation task. The test data provided (via Docker containers), however, will be different for the two tasks. Please consult the challenge website for more details.
Data Details
DICOM files (images and structure files) have been converted to NIfTI format (.nii.gz) for ease of use by participants via DICOMRTTool v. 1.0.
Images are a mix of fat-suppressed and non-fat-suppressed MRI sequences. Pre-RT and mid-RT image pairs for a given patient are consistently either fat-suppressed or non-fat-suppressed.
Though some sequences may appear to be contrast enhancing, no exogenous contrast is used.
All images have been manually cropped from the top of the clavicles to the bottom of the nasal septum (~ oropharynx region to shoulders), allowing for more consistent image field of views and removal of identifiable facial structures.
The mask files have one of three possible values: background = 0, GTVp = 1, GTVn = 2 (in the case of multiple lymph nodes, they are concatenated into one single label). This labeling convention is similar to the 2022 HECKTOR Challenge.
150 unique patients are included in this dataset. Anonymized patient numeric identifiers are utilized.
The entire training dataset is ~15 GB.
Dataset Folder/File Structure
The dataset is uploaded as a ZIP archive. Please unzip before use. NIfTI files conform to the following standardized nomenclature: ID_timepoint_image/mask.nii.gz. For mid-RT files, a "registered" suffix (ID_timepoint_image/mask_registered.nii.gz) indicates the image or mask has been registered to the mid-RT image space (see more details in Additional Notes below).
The data is provided with the following folder hierarchy:
Top-level folder (named "HNTSMRG24_train")
Patient-level folder (anonymized patient ID, example: "2")
Pre-radiotherapy data folder ("preRT")
Original pre-RT T2w MRI volume (example: "2_preRT_T2.nii.gz").
Original pre-RT tumor segmentation mask (example: "2_preRT_mask.nii.gz").
Mid-radiotherapy data folder ("midRT")
Original mid-RT T2w MRI volume (example: "2_midRT_T2.nii.gz").
Original mid-RT tumor segmentation mask (example: "2_midRT_mask.nii.gz").
Registered pre-RT T2w MRI volume (example: "2_preRT_T2_registered.nii.gz").
Registered pre-RT tumor segmentation mask (example: "2_preRT_mask_registered.nii.gz").
Note: Cases will exhibit variable presentation of ground truth mask structures. For example, a case could have only a GTVp label present, only a GTVn label present, both GTVp and GTVn labels present, or a completely empty mask (i.e., complete tumor response at mid-RT). The following case IDs have empty masks at mid-RT (indicating a complete response): 21, 25, 29, 42. These empty masks are not errors. There will similarly be some cases in the test set for Task 2 that have empty masks.
Details Relevant for Algorithm Building
The goal of Task 1 is to generate a pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz" is the relevant label). During blind testing for Task 1, only the pre-RT MRI (e.g., "2_preRT_T2.nii.gz") will be provided to the participants algorithms.
The goal of Task 2 is to generate a mid-RT segmentation mask (e.g., "2_midRT_mask.nii.gz" is the relevant label). During blind testing for Task 2, the mid-RT MRI (e.g., "2_midRT_T2.nii.gz"), original pre-RT MRI (e.g., "2_preRT_T2.nii.gz"), original pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz"), registered pre-RT MRI (e.g., "2_preRT_T2_registered.nii.gz"), and registered pre-RT tumor segmentation mask (e.g., "2_preRT_mask_registered.nii.gz") will be provided to the participants algorithms.
When building models, the resolution of the generated prediction masks should be the same as the corresponding MRI for the given task. In other words, the generated masks should be in the correct pixel spacing and origin with respect to the original reference frame (i.e., pre-RT image for Task 1, mid-RT image for Task 2). More details on the submission of models will be located on the challenge website.
Additional Notes
General notes.
NIfTI format images and segmentations may be easily visualized in any NIfTI viewing software such as 3D Slicer.
Test data will not be made public until the completion of the challenge. The complete training and test data will be published together (along with all original multi-observer annotations and relevant clinical data) at a later date via The Cancer Imaging Archive. Expected date ~ Spring 2025.
Task 1 related notes.
When training their algorithms for Task 1, participants can choose to use only pre-RT data or add in mid-RT data as well. Initially, our plan was to limit participants to utilizing only pre-RT data for training their algorithms in Task 1. However, upon reflection, we recognized that in a practical setting, individuals aiming to develop auto-segmentation algorithms could theoretically train models using any accessible data at their disposal. Based on current literature, we actually don't know what the best solution would be! Would the incorporation of mid-RT data for training a pre-RT segmentation model actually be helpful, or would it merely introduce harmful noise? The answer remains unclear. Therefore, we leave this choice to the participants.
Remember, though, during testing, you will ONLY have the pre-RT image as an input to your model (naturally, since Task 1 is a pre-RT segmentation task and you won't know what mid-RT data for a patient will look like).
Task 2 related notes.
In addition to the mid-RT MRI and segmentation mask, we have also provided a registered pre-RT MRI and the corresponding registered pre-RT segmentation mask for each patient. We offer this data for participants who opt not to integrate any image registration techniques into their algorithms for Task 2 but still wish to use the two images as a joint input to their model. Moreover, in a real-world adaptive RT context, such registered scans are typically readily accessible. Naturally, participants are also free to incorporate their own image registration processes into their pipelines if they wish (or ignore the pre-RT images/masks altogether).
Registrations were generated using SimpleITK, where the mid-RT image serves as the fixed image and the pre-RT image serves as the moving image. Specifically, we utilized the following steps: 1. Apply a centered transformation, 2. Apply a rigid transformation, 3. Apply a deformable transformation with Elastix using a preset parameter map (Parameter map 23 in the Elastix Zoo). This particular deformable transformation was selected as it is open-source and was benchmarked in a previous similar application (https://doi.org/10.1002/mp.16128). For cases where excessive warping was noted during deformable registration (a small minority of cases), only the rigid transformation was applied.
Contact
We have set up a general email address that you can message to notify all organizers at: hntsmrg2024@gmail.com. Additional specific organizer contacts:
Kareem A. Wahid, PhD (kawahid@mdanderson.org)
Cem Dede, MD (cdede@mdanderson.org)
Mohamed A. Naser, PhD (manaser@mdanderson.org)
Dataset from Schiaffino S, Codari M, Cozzi A, Albano D, Alì M, Arioli R, Avola E, Bnà C, Cariati M, Carriero S, Cressoni M, Danna PSC, Della Pepa G, Di Leo G, Dolci F, Falaschi Z, Flor N, Foà RA, Gitto S, Leati G, Magni V, Malavazos AE, Mauri G, Messina C, Monfardini L, Paschè A, Pesapane F, Sconfienza LM, Secchi F, Segalini E, Spinazzola A, Tombini V, Tresoldi S, Vanzulli A, Vicentin I, Zagaria D, Fleischmann D, Sardanelli F. Machine Learning to Predict In-Hospital Mortality in COVID-19 Patients Using Computed Tomography-Derived Pulmonary and Vascular Features. J Pers Med. 2021 Jun 3;11(6):501. doi: 10.3390/jpm11060501. PMID: 34204911; PMCID: PMC8230339.
Abstract
Pulmonary parenchymal and vascular damage are frequently reported in COVID-19 patients and can be assessed with unenhanced chest computed tomography (CT), widely used as a triaging exam. Integrating clinical data, chest CT features, and CT-derived vascular metrics, we aimed to build a predictive model of in-hospital mortality using univariate analysis (Mann-Whitney U test) and machine learning models (support vectors machines (SVM) and multilayer perceptrons (MLP)). Patients with RT-PCR-confirmed SARS-CoV-2 infection and unenhanced chest CT performed on emergency department admission were included after retrieving their outcome (discharge or death), with an 85/15% training/test dataset split. Out of 897 patients, the 229 (26%) patients who died during hospitalization had higher median pulmonary artery diameter (29.0 mm) than patients who survived (27.0 mm, p < 0.001) and higher median ascending aortic diameter (36.6 mm versus 34.0 mm, p < 0.001). SVM and MLP best models considered the same ten input features, yielding a 0.747 (precision 0.522, recall 0.800) and 0.844 (precision 0.680, recall 0.567) area under the curve, respectively. In this model integrating clinical and radiological data, pulmonary artery diameter was the third most important predictor after age and parenchymal involvement extent, contributing to reliable in-hospital mortality prediction, highlighting the value of vascular metrics in improving patient stratification.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Update: New version includes additional samples taken in November 2022.
Dataset Description
This dataset is a large-scale set of measurements for RSS-based localization. The data consists of received signal strength (RSS) measurements taken using the POWDER Testbed at the University of Utah. Samples include either 0, 1, or 2 active transmitters.
The dataset consists of 5,214 unique samples, with transmitters in 5,514 unique locations. The majority of the samples contain only 1 transmitter, but there are small sets of samples with 0 or 2 active transmitters, as shown below. Each sample has RSS values from between 10 and 25 receivers. The majority of the receivers are stationary endpoints fixed on the side of buildings, on rooftop towers, or on free-standing poles. A small set of receivers are located on shuttles which travel specific routes throughout campus.
Dataset Description | Sample Count | Receiver Count |
---|---|---|
No-Tx Samples | 46 | 10 to 25 |
1-Tx Samples | 4822 | 10 to 25 |
2-Tx Samples | 346 | 11 to 12 |
The transmitters for this dataset are handheld walkie-talkies (Baofeng BF-F8HP) transmitting in the FRS/GMRS band at 462.7 MHz. These devices have a rated transmission power of 1 W. The raw IQ samples were processed through a 6 kHz bandpass filter to remove neighboring transmissions, and the RSS value was calculated as follows:
\(RSS = \frac{10}{N} \log_{10}\left(\sum_i^N x_i^2 \right) \)
Measurement Parameters | Description |
---|---|
Frequency | 462.7 MHz |
Radio Gain | 35 dB |
Receiver Sample Rate | 2 MHz |
Sample Length | N=10,000 |
Band-pass Filter | 6 kHz |
Transmitters | 0 to 2 |
Transmission Power | 1 W |
Receivers consist of Ettus USRP X310 and B210 radios, and a mix of wide- and narrow-band antennas, as shown in the table below Each receiver took measurements with a receiver gain of 35 dB. However, devices have different maxmimum gain settings, and no calibration data was available, so all RSS values in the dataset are uncalibrated, and are only relative to the device.
Usage Instructions
Data is provided in .json
format, both as one file and as split files.
import json
data_file = 'powder_462.7_rss_data.json'
with open(data_file) as f:
data = json.load(f)
The json
data is a dictionary with the sample timestamp as a key. Within each sample are the following keys:
rx_data
: A list of data from each receiver. Each entry contains RSS value, latitude, longitude, and device name.tx_coords
: A list of coordinates for each transmitter. Each entry contains latitude and longitude.metadata
: A list of dictionaries containing metadata for each transmitter, in the same order as the rows in tx_coords
File Separations and Train/Test Splits
In the separated_data.zip
folder there are several train/test separations of the data.
all_data
contains all the data in the main JSON file, separated by the number of transmitters.stationary
consists of 3 cases where a stationary receiver remained in one location for several minutes. This may be useful for evaluating localization using mobile shuttles, or measuring the variation in the channel characteristics for stationary receivers.train_test_splits
contains unique data splits used for training and evaluating ML models. These splits only used data from the single-tx case. In other words, the union of each splits, along with unused.json
, is equivalent to the file all_data/single_tx.json
.
random
split is a random 80/20 split of the data.special_test_cases
contains the stationary transmitter data, indoor transmitter data (with high noise in GPS location), and transmitters off campus.grid
split divides the campus region in to a 10 by 10 grid. Each grid square is assigned to the training or test set, with 80 squares in the training set and the remainder in the test set. If a square is assigned to the test set, none of its four neighbors are included in the test set. Transmitters occuring in each grid square are assigned to train or test. One such random assignment of grid squares makes up the grid
split.seasonal
split contains data separated by the month of collection, in April, July, or Novembertransportation
split contains data separated by the method of movement for the transmitter: walking, cycling, or driving. The non-driving.json
file contains the union of the walking and cycling data.campus.json
contains the on-campus data, so is equivalent to the union of each split, not including unused.json
.Digital Surface Model
The dataset includes a digital surface model (DSM) from a State of Utah 2013-2014 LiDAR survey. This map includes the University of Utah campus and surrounding area. The DSM includes buildings and trees, unlike some digital elevation models.
To read the data in python:
import rasterio as rio
import numpy as np
import utm
dsm_object = rio.open('dsm.tif')
dsm_map = dsm_object.read(1) # a np.array containing elevation values
dsm_resolution = dsm_object.res # a tuple containing x,y resolution (0.5 meters)
dsm_transform = dsm_object.transform # an Affine transform for conversion to UTM-12 coordinates
utm_transform = np.array(dsm_transform).reshape((3,3))[:2]
utm_top_left = utm_transform @ np.array([0,0,1])
utm_bottom_right = utm_transform @ np.array([dsm_object.shape[0], dsm_object.shape[1], 1])
latlon_top_left = utm.to_latlon(utm_top_left[0], utm_top_left[1], 12, 'T')
latlon_bottom_right = utm.to_latlon(utm_bottom_right[0], utm_bottom_right[1], 12, 'T')
Dataset Acknowledgement: This DSM file is acquired by the State of Utah and its partners, and is in the public domain and can be freely distributed with proper credit to the State of Utah and its partners. The State of Utah and its partners makes no warranty, expressed or implied, regarding its suitability for a particular use and shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to users of this product.
DSM DOI: https://doi.org/10.5069/G9TH8JNQ
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We recorded a dataset of more than 280,000 close-up eye images with ground truth annotation of the gaze location. A total of 17 participants were recorded, covering a wide range of appearances: Gender: Five (29%) female and 12 (71%) male Nationality: Seven (41%) German, seven (41%) Indian, one (6%) Bangladeshi, one (6%) Iranian, and one (6%) Greek Eye Color: 12 (70%) brown, four (23%) blue, and one (5%) green Glasses: Four participants (23%) wore regular glasses and one (6%) wore contact lenses For each participant, two sets of data were recorded: one set of training data and a separate set of test data. For each set, a series of gaze targets was shown on a display that participants were instructed to look at. For both training and test data the gaze targets covered a uniform grid in a random order, where the grid corresponding to the test data was positioned to lie in between the training points. Since the NanEye cameras record at about 44 FPS, we gathered approximately 22 frames per camera and gaze target. The training data was recorded using a uniform 24 × 17 grid of points, with an angular distance in gaze angle of 1.45° horizontally and 1.30° vertically between the points. In total the training set contained about 8,800 images per camera and participant. The test set’s points belonged to a 23 × 16 grid of points and it contains about 8,000 images per camera and participant. This way, the gaze targets covered a field of view of 35° horizontally and 22° vertically. The recording procedure was split into two parts for training and test data. For both parts, participants were instructed to put on the prototype and rest their head on a chin rest positioned exactly 510 mm in front of a display. The display was a 30-inch LED monitor with a pixel pitch of 0.25 mm and viewable image dimensions of 641.3 × 400.8 mm, set to 2560 × 1600-pixel resolution. On the display, the grid of gaze targets was shown, which the participants were instructed to look at. Each point appeared as a big circle 300 pixels in diameter and shrunk to a circle of 8 pixels diameter over the course of 700 ms. The small circle was then displayed for another 500 ms, until the display of the next point started. Data was only recorded during the latter 500 ms, i.e. while the small circle was shown (see Figure 7a). It is important to note that the chin rest did not fully restrain participants and we noticed that their head sometimes moved noticeably, thus resulting in a certain amount of label noise. Using the shrinking animation for the circle helps the participants to locate the circle on the screen and gives them time to relocate their gaze. Similar to [30], we also showed an “L” or an “R” in between every 20th pair of points in the sequence. The letter was displayed for 500 ms at the position of the last point. Participants were asked to confirm the letter they had seen by pressing the corresponding left or right arrow-key. This was done to ensure participants focused on the gaze targets and task at hand throughout the recording.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
DESCRIPTION:
The Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University (TAU), and in Tokyo, Japan by SONY, using a similar setup and annotation procedure. The dataset is delivered in two 4-channel spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). These recordings serve as the development dataset for the DCASE 2023 Sound Event Localization and Detection Task of the DCASE 2023 Challenge.
The STARSS23 dataset is a continuation of the STARSS22 dataset. It extends the previous version with the following:
Contrary to the three previous datasets of synthetic spatial sound scenes of TAU Spatial Sound Events 2019 (development/evaluation), TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021 associated with previous iterations of the DCASE Challenge, the STARS22-23 dataset contains recordings of real sound scenes and hence it avoids some of the pitfalls of synthetic generation of scenes. Some such key properties are:
The first round of recordings was collected between September 2021 and January 2022. A second round of recordings was collected between November 2022 and February 2023.
Collection of data from the TAU side has received funding from Google.
REPORT & REFERENCE:
If you use this dataset you could cite this report on its design, capturing, and annotation process:
Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen (2022). STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France.
found here.
A more detailed report on the properties of the new dataset and its audiovisual processing with a suitable baseline for DCASE2023 will be published soon.
AIM:
The STARSS22-23 dataset is suitable for training and evaluation of machine-listening models for sound event detection (SED), general sound source localization with diverse sounds or signal-of-interest localization, and joint sound-event-localization-and-detection (SELD). Additionally, the dataset can be used for evaluation of signal processing methods that do not necessarily rely on training, such as acoustic source localization methods and multiple-source acoustic tracking. The dataset allows evaluation of the performance and robustness of the aforementioned applications for diverse types of sounds, and under diverse acoustic conditions.
Specifically the STARSS23 allows additionally evaluation of audiovisual processing methods, such as audiovisual source localization.
SPECIFICATIONS:
General:
Volume, duration, and data split:
Audio:
Video:
More detailed information on the dataset can be found in the included README file.
SOUND CLASSES:
13 target sound event classes are annotated. The classes follow loosely the Audioset ontology.
0. Female speech, woman speaking
1. Male speech, man speaking
2. Clapping
3. Telephone
4. Laughter
5. Domestic sounds
6. Walk, footsteps
7. Door, open or close
8. Music
9. Musical instrument
10. Water tap, faucet
11. Bell
12. Knock
The content of some of these classes corresponds to events of a limited range of Audioset-related subclasses. For more information see the README file.
EXAMPLE APPLICATION:
An implementation of a trainable model of a convolutional recurrent neural network, performing joint SELD, trained and evaluated with this dataset is provided here. This implementation will serve as the baseline method for the audio-only track in the DCASE 2023 Sound Event Localization and Detection Task.
A baseline for the audiovisual track of DCASE 2023 Sound Event Localization and Detection Task will be published soon and referenced here.
DEVELOPMENT AND EVALUATION:
The current version (Version 1.0) of the dataset includes only the 168 development audio/video recordings and labels, used by the participants of Task 3 of DCASE2023 Challenge to train and validate their submitted systems. Version 1.1 will be including additionally the evaluation audio and video recordings without labels, for the evaluation phase of DCASE2023.
If researchers wish to compare their system against the submissions of DCASE2023 Challenge, they will have directly comparable results if they use the evaluation data as their testing set.
DOWNLOAD INSTRUCTIONS:
The file foa_dev.zip, correspond to audio data of the FOA recording format.
The file mic_dev.zip, correspond to audio data of the MIC recording format.
The file video_dev.zip
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.
Data Set Description
The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.
Directory Contents
images - contains all 6,820 images
class_map.csv - string-integer class mappings
train-set-v2.1.txt - label file for the training set
val-set-v2.1.txt - label file for the validation set
test-set-v2.1.txt - label file for the test set
The label files are formatted as below:
"Image-file-name class_in_integer_representation"
Labeling Process
Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:
If all three labels agree with each other, then use the label as the final label.
If the three labels do not agree with each other, then we manually review the labels and decide the final label.
We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.
Classes
There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:
Class name, counts (training set), counts (validation set), counts (test set), integer representation
Arm cover, 10, 1, 4, 0
Other rover part, 190, 11, 10, 1
Artifact, 680, 62, 132, 2
Nearby surface, 1554, 74, 187, 3
Close-up rock, 1422, 50, 84, 4
DRT, 8, 4, 6, 5
DRT spot, 214, 1, 7, 6
Distant landscape, 342, 14, 34, 7
Drill hole, 252, 5, 12, 8
Night sky, 40, 3, 4, 9
Float, 190, 5, 1, 10
Layers, 182, 21, 17, 11
Light-toned veins, 42, 4, 27, 12
Mastcam cal target, 122, 12, 29, 13
Sand, 228, 19, 16, 14
Sun, 182, 5, 19, 15
Wheel, 212, 5, 5, 16
Wheel joint, 62, 1, 5, 17
Wheel tracks, 26, 3, 1, 18
Image Augmentation
Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.
90 degrees clockwise rotation (file name ends with -r90.jpg)
180 degrees clockwise rotation (file name ends with -r180.jpg)
270 degrees clockwise rotation (file name ends with -r270.jpg)
Horizontal flip (file name ends with -fh.jpg)
Vertical flip (file name ends with -fv.jpg)
Acknowledgment
The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."
This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.
While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.
This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.
The dataset is divided into two subsets:
- Training: 16,000 samples (proteinas_train.csv
).
- Testing: 4,000 samples (proteinas_test.csv
).
This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.