100+ datasets found
  1. Bioinformatics Protein Dataset - Simulated

    • kaggle.com
    Updated Dec 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10315204
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rafael Gallo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Subtitle

    "Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

    Description

    Introduction

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

    Columns Included

    • ID_Protein: Unique identifier for each protein.
    • Sequence: String of amino acids.
    • Molecular_Weight: Molecular weight calculated from the sequence.
    • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
    • Hydrophobicity: Average hydrophobicity calculated from the sequence.
    • Total_Charge: Sum of the charges of the amino acids in the sequence.
    • Polar_Proportion: Percentage of polar amino acids in the sequence.
    • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
    • Sequence_Length: Total number of amino acids in the sequence.
    • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

    Inspiration and Sources

    While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

    Proposed Uses

    This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

    How This Dataset Was Created

    1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
    2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
    3. Class Assignment: Classes were randomly assigned for classification purposes.

    Limitations

    • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
    • The functional classes are simulated and do not correspond to actual biological characteristics.

    Data Split

    The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

    Acknowledgment

    This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  2. c

    Input Files and Code for: Machine learning can accurately assign geologic...

    • s.cnmilf.com
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Input Files and Code for: Machine learning can accurately assign geologic basin to produced water samples using major geochemical parameters [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/input-files-and-code-for-machine-learning-can-accurately-assign-geologic-basin-to-produced
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    As more hydrocarbon production from hydraulic fracturing and other methods produce large volumes of water, innovative methods must be explored for treatment and reuse of these waters. However, understanding the general water chemistry of these fluids is essential to providing the best treatment options optimized for each producing area. Machine learning algorithms can often be applied to datasets to solve complex problems. In this study, we used the U.S. Geological Survey’s National Produced Waters Geochemical Database (USGS PWGD) in an exploratory exercise to determine if systematic variations exist between produced waters and geologic environment that could be used to accurately classify a water sample to a given geologic province. Two datasets were used, one with fewer attributes (n = 7) but more samples (n = 58,541) named PWGD7, and another with more attributes (n = 9) but fewer samples (n = 33,271) named PWGD9. The attributes of interest were specific gravity, pH, HCO3, Na, Mg, Ca, Cl, SO4, and total dissolved solids. The two datasets, PWGD7 and PWGD9, contained samples from 20 and 19 geologic provinces, respectively. Outliers across all attributes for each province were removed at a 99% confidence interval. Both datasets were divided into a training and test set using an 80/20 split and a 90/10 split, respectively. Random forest, Naïve Bayes, and k-Nearest Neighbors algorithms were applied to the two different training datasets and used to predict on three different testing datasets. Overall model accuracies across the two datasets and three applied models ranged from 23.5% to 73.5%. A random forest algorithm (split rule = extratrees, mtry = 5) performed best on both datasets, producing an accuracy of 67.1% for a training set based on the PWGD7 dataset, and 73.5% for a training set based on the PWGD9 dataset. Overall, the three algorithms predicted more accurately on the PWGD7 dataset than PWGD9 dataset, suggesting that either a larger sample size and/or fewer attributes lead to a more successful predicting algorithm. Individual balanced accuracies for each producing province ranged from 50.6% (Anadarko) to 100% (Raton) for PWGD7, and from 44.5% (Gulf Coast) to 99.8% (Sedgwick) for PWGD9. Results from testing the model on recently published data outside of the USGS PWGD suggests that some provinces may be lacking information about their true geochemical diversity while others included in this dataset are well described. Expanding on this effort could lead to predictive tools that provide ranges of contaminants or other chemicals of concern within each province to design future treatment facilities to reclaim wastewater. We anticipate that this classification model will be improved over time as more diverse data are added to the USGS PWGD.

  3. Malaria disease and grading system dataset from public hospitals reflecting...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Nov 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2023
    Dataset provided by
    Nasarawa State University
    Authors
    Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
    Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.

  4. American Sign Language Dataset

    • kaggle.com
    Updated Dec 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Rasol Esfandiari (2024). American Sign Language Dataset [Dataset]. https://www.kaggle.com/datasets/esfiam/american-sign-language-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Rasol Esfandiari
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    United States
    Description

    About Dataset

    This dataset is designed for training and evaluating machine learning models to recognize American Sign Language (ASL) hand gestures, including both numbers (0-9) and English alphabet letters (a-z). It is a well-organized dataset that can be used for computer vision tasks, particularly image classification and gesture recognition.

    Dataset Structure:

    The dataset contains two main folders: 1. Train:
    - Used for training the model.
    - Includes 36 subdirectories (one for each class: 0-9 and a-z).
    - Each subdirectory contains 56 images of the corresponding class.

    1. Test:
      • Used for evaluating the model.
      • Includes 36 subdirectories (one for each class: 0-9 and a-z).
      • Each subdirectory contains 14 images of the corresponding class.

    Dataset Summary:

    FolderNumber of ClassesTotal Images per ClassTotal Images
    Train36562,016
    Test3614504

    Features:

    • Number of Classes: 36 (10 digits + 26 letters).
    • Image Format: JPEG.

    Applications:

    This dataset is ideal for: - Training convolutional neural networks (CNNs) for ASL recognition. - Exploring data augmentation techniques for image classification. - Developing real-world AI applications like sign language translators.

    Suggested Workflow:

    1. Load the dataset and split it into training and testing sets.
    2. Apply data augmentation to enhance diversity in training data.
    3. Train a CNN model to classify the 36 ASL hand gestures.
    4. Evaluate the model's performance using the provided test set.

    Credits:

    This dataset is curated to facilitate the development of models for sign language recognition and gesture-based interaction systems. If you use this dataset in your research or projects, please consider sharing your findings or improvements!

  5. P

    V2 Balloon Detection Dataset Dataset

    • paperswithcode.com
    Updated Sep 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). V2 Balloon Detection Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/v2-balloon-detection-dataset
    Explore at:
    Dataset updated
    Sep 5, 2024
    Description

    Description:

    👉 Download the dataset here

    This dataset was created to serve as an easy-to-use image dataset, perfect for experimenting with object detection algorithms. The main goal was to provide a simplified dataset that allows for quick setup and minimal effort in exploratory data analysis (EDA). This dataset is ideal for users who want to test and compare object detection models without spending too much time navigating complex data structures. Unlike datasets like chest x-rays, which require expert interpretation to evaluate model performance, the simplicity of balloon detection enables users to visually verify predictions without domain expertise.

    The original Balloon dataset was more complex, as it was split into separate training and testing sets, with annotations stored in two separate JSON files. To streamline the experience, this updated version of the dataset merges all images into a single folder and replaces the JSON annotations with a single, easy-to-use CSV file. This new format ensures that the dataset can be loaded seamlessly with tools like Pandas, simplifying the workflow for researchers and developers.

    Download Dataset

    The dataset contains a total of 74 high-quality JPG images. Each featuring one or more balloons in different scenes and contexts. Accompanying the images is a CSV file that provides annotation data. Such as bounding box coordinates and labels for each balloon within the images. This structure makes the dataset easily accessible for a range of machine learning and computer vision tasks. Including object detection and image classification. The dataset is versatile and can be used to test algorithms like YOLO, Faster R-CNN, SSD, or other popular object detection models.

    Key Features:

    Image Format: 74 JPG images, ensuring high compatibility with most machine learning frameworks.

    Annotations: A single CSV file that contains structure data. Including bounding box coordinates, class labels, and image file names, which can be load with Python libraries like Pandas.

    Simplicity: Design for users to quickly start training object detection models without needing to preprocess or deeply explore the dataset.

    Variety: The images feature balloons in various sizes, colors, and scenes, making it suitable for testing the robustness of detection models.

    This dataset is sourced from Kaggle.

  6. h

    MMLU-SemiPro

    • huggingface.co
    Updated Jul 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Answer.AI (2024). MMLU-SemiPro [Dataset]. https://huggingface.co/datasets/answerdotai/MMLU-SemiPro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Answer.AI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is derived from TIGER-Lab/MMLU-Pro as part of our MMLU-Leagues Encoder benchmark series, containing:

    MMLU-Amateur, where the train set contains all questions Llama-3-8B-Instruct (5-shot) gets wrong and the test set contains all questions it gets right. The aim is to measure the ability of an encoder, with relatively limited training data, to match the performance of a small frontier model. MMLU-SemiPro (this dataset), where the data is evenly split between a train and a test set.… See the full description on the dataset page: https://huggingface.co/datasets/answerdotai/MMLU-SemiPro.

  7. R

    Hard Hat Workers Object Detection Dataset - resize-416x416-reflectEdges

    • public.roboflow.com
    zip
    Updated Sep 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Northeastern University - China (2022). Hard Hat Workers Object Detection Dataset - resize-416x416-reflectEdges [Dataset]. https://public.roboflow.com/object-detection/hard-hat-workers/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 30, 2022
    Dataset authored and provided by
    Northeastern University - China
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Bounding Boxes of Workers
    Description

    Overview

    The Hard Hat dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.

    The original dataset has a 75/25 train-test split.

    Example Image: https://i.imgur.com/7spoIJT.png" alt="Example Image">

    Use Cases

    One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.

    Using this Dataset

    Use the fork or Download this Dataset button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

    Dataset Versions:

    Image Preprocessing | Image Augmentation | Modify Classes * v1 (resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations * v2 (raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images * v3 (v3): generated with the original 75/25 train-test split | Modify Classes used to drop person class | Preprocessing and Augmentation applied * v5 (raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class * v8 (raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and person classes * v9 (raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and helmet classes * v10 (raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images * v11 (augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model * v12 (augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Fast Model * v13 (augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Accurate Model * v14 (raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class, and remap/relabel helmet class to head

    Choosing Between Computer Vision Model Sizes | Roboflow Train

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

    Roboflow Workmark

  8. P

    MS COCO Dataset

    • paperswithcode.com
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár, MS COCO Dataset [Dataset]. https://paperswithcode.com/dataset/coco
    Explore at:
    Dataset updated
    Apr 15, 2024
    Authors
    Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár
    Description

    The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

    Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.

    Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.

    Annotations: The dataset has annotations for

    object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.

  9. Organ-on-a-Chip (OOC) Image Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valērija Movčana; Arnis Strods; Karīna Narbute; Fēlikss Rūmnieks; Roberts Rimša; Gatis Mozolevskis; Roberts Kadiķis; Maksims Ivanovs; Kārlis Gustavs Zviedris; Laura Leja; Anastasija Zujeva; Tamāra Laimiņa; Arturs Abols; Valērija Movčana; Arnis Strods; Karīna Narbute; Fēlikss Rūmnieks; Roberts Rimša; Gatis Mozolevskis; Roberts Kadiķis; Maksims Ivanovs; Kārlis Gustavs Zviedris; Laura Leja; Anastasija Zujeva; Tamāra Laimiņa; Arturs Abols (2023). Organ-on-a-Chip (OOC) Image Dataset [Dataset]. http://doi.org/10.5281/zenodo.10203721
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valērija Movčana; Arnis Strods; Karīna Narbute; Fēlikss Rūmnieks; Roberts Rimša; Gatis Mozolevskis; Roberts Kadiķis; Maksims Ivanovs; Kārlis Gustavs Zviedris; Laura Leja; Anastasija Zujeva; Tamāra Laimiņa; Arturs Abols; Valērija Movčana; Arnis Strods; Karīna Narbute; Fēlikss Rūmnieks; Roberts Rimša; Gatis Mozolevskis; Roberts Kadiķis; Maksims Ivanovs; Kārlis Gustavs Zviedris; Laura Leja; Anastasija Zujeva; Tamāra Laimiņa; Arturs Abols
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview: This dataset contains 3000+ images generated from OOC (organ-on-a-chip) setup with different cell types. The images were generated by an automated brightfield microscopy setup; for each image, such parameters as cell type, time after seeding, and class label ('good' or 'bad' sample quality as assessed by a biology expert) are provided. Furthermore, for some images, seeding density and flow rate are given as well. The dataset can be used for training machine learning classifiers for the automated analysis of the data generated with OOC setup, allowing to create more reliable tissue models and automate decision making processes for growing OOC.

    The dataset comprises images of OOC samples from the following cell lines:

    • A549 (human lung adenocarcinoma alveolar basal epithelial cells, CCL-185, ATTC)
    • Caco-2 (colorectal adenocarcinoma epithelial cells, HTB-37, ATCC)
    • HPMEC (human pulmonary microvascular endothelial cells; 3000, ScienCell)
    • HUVEC (human umbilical vein endothelial cells, CRL-1730, ATCC)
    • NHBE (normal human bronchial epithelial cells, CC-2541, Lonza)
    • HSAEC (human small airway epithelial cells, PCS-301-010, ATCC)

    Structure of the dataset: The dataset is split into three main folders that correspond to the data split for training machine learning models, i.e., 'train', 'val', and 'test'. The train/val/test split is done proportionally with respect to the class labels, cell lines, and time after seeding (see below), yet the data can be split or merged in other ways to suit the needs of prospective users of the dataset. Within each of the main folders, there are a 'bad' and a 'good' folder with the images corresponding to the respective class labels (see 'Overview' above). The images in 'bad' / 'ģood' folders are further subdivided into folders corresponding to respective cell lines, which are in their turn subdivided into folders corresponding to the different times after seeding. Therefore, it is easy to find images of interest, e.g., '4+ days' 'good' images of the cell line A549 from the 'train' dataset. Further information about the images is available in the file 'OOC_datasheet.xlsx'.

    Acknowledgement: The work presented in this paper was supported by the project 'AI-improved organ on chip cultivation for personalised medicine (AimOOC)' (contract with Central Finance and Contracting Agency of Republic of Latvia no. 1.1.1.1/21/A/079; the project is co-financed by REACT-EU funding for mitigating the consequences of the pandemic crisis).

  10. f

    Table1_Performance of Machine Learning Algorithms for Predicting Adverse...

    • frontiersin.figshare.com
    • figshare.com
    docx
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhixiao Xu; Kun Guo; Weiwei Chu; Jingwen Lou; Chengshui Chen (2023). Table1_Performance of Machine Learning Algorithms for Predicting Adverse Outcomes in Community-Acquired Pneumonia.DOCX [Dataset]. http://doi.org/10.3389/fbioe.2022.903426.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Frontiers
    Authors
    Zhixiao Xu; Kun Guo; Weiwei Chu; Jingwen Lou; Chengshui Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: The ability to assess adverse outcomes in patients with community-acquired pneumonia (CAP) could improve clinical decision-making to enhance clinical practice, but the studies remain insufficient, and similarly, few machine learning (ML) models have been developed.Objective: We aimed to explore the effectiveness of predicting adverse outcomes in CAP through ML models.Methods: A total of 2,302 adults with CAP who were prospectively recruited between January 2012 and March 2015 across three cities in South America were extracted from DryadData. After a 70:30 training set: test set split of the data, nine ML algorithms were executed and their diagnostic accuracy was measured mainly by the area under the curve (AUC). The nine ML algorithms included decision trees, random forests, extreme gradient boosting (XGBoost), support vector machines, Naïve Bayes, K-nearest neighbors, ridge regression, logistic regression without regularization, and neural networks. The adverse outcomes included hospital admission, mortality, ICU admission, and one-year post-enrollment status.Results: The XGBoost algorithm had the best performance in predicting hospital admission. Its AUC reached 0.921, and accuracy, precision, recall, and F1-score were better than those of other models. In the prediction of ICU admission, a model trained with the XGBoost algorithm showed the best performance with AUC 0.801. XGBoost algorithm also did a good job at predicting one-year post-enrollment status. The results of AUC, accuracy, precision, recall, and F1-score indicated the algorithm had high accuracy and precision. In addition, the best performance was seen by the neural network algorithm when predicting death (AUC 0.831).Conclusions: ML algorithms, particularly the XGBoost algorithm, were feasible and effective in predicting adverse outcomes of CAP patients. The ML models based on available common clinical features had great potential to guide individual treatment and subsequent clinical decisions.

  11. a

    Building Point Classification - New Zealand

    • arc-gis-hub-home-arcgishub.hub.arcgis.com
    • pacificgeoportal.com
    • +2more
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eagle Technology Group Ltd (2023). Building Point Classification - New Zealand [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/content/ebc54f498df94224990cf5f6598a5665
    Explore at:
    Dataset updated
    Sep 17, 2023
    Dataset authored and provided by
    Eagle Technology Group Ltd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New Zealand
    Description

    This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into building and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Building is useful in applications such as high-quality 3D basemap creation, urban planning, and planning climate change response.Building could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Building in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.The model is trained with classified LiDAR that follows the The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 6 BuildingApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Auckland, Christchurch, Kapiti, Wellington Testing dataset - Auckland, WellingtonValidation/Evaluation dataset - Hutt City Dataset City Training Auckland, Christchurch, Kapiti, Wellington Testing Auckland, Wellington Validating HuttModel architectureThis model uses the SemanticQueryNetwork model architecture implemented in ArcGIS Pro.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.984921 0.975853 0.979762 Building 0.951285 0.967563 0.9584Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 75~%, Test: 25~%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-137.74 m to 410.50 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-17 to +17 Maximum points per block8192 Block Size50 Meters Class structure[0, 6]Sample resultsModel to classify a dataset with 23pts/m density Wellington city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story

  12. n

    Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

    • narcis.nl
    • data.mendeley.com
    Updated Jan 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoo, T (via Mendeley Data) (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
    Explore at:
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Yoo, T (via Mendeley Data)
    Description

    Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

    We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

    This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

    This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

    Python version:

    from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

    connect data in your google drive

    from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

    Change the path for the custom data

    In this case, we used ICL vault prediction using preop measurement

    dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

    optimal features (sorted by importance) :

    1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

    7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

    y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

    Split the dataset to train and test data, if necessary.

    For example, we can split data to 8:2 as a simple validation test

    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

    In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

    Optimal parameter search could be performed in this section

    parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

    RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_

  13. p

    Tree Point Classification - New Zealand

    • pacificgeoportal.com
    • geoportal-pacificcore.hub.arcgis.com
    • +1more
    Updated Jul 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eagle Technology Group Ltd (2022). Tree Point Classification - New Zealand [Dataset]. https://www.pacificgeoportal.com/content/0e2e3d0d0ef843e690169cac2f5620f9
    Explore at:
    Dataset updated
    Jul 25, 2022
    Dataset authored and provided by
    Eagle Technology Group Ltd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story

  14. Z

    StreetSurfaceVis: a dataset of street-level imagery with annotations of road...

    • data.niaid.nih.gov
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mihaljevic, Helena (2024). StreetSurfaceVis: a dataset of street-level imagery with annotations of road surface type and quality [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11449976
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Hoffmann, Edith
    Weigmann, Esther
    Mihaljevic, Helena
    Kapp, Alexandra
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    StreetSurfaceVis

    StreetSurfaceVis is an image dataset containing 9,122 street-level images from Germany with labels on road surface type and quality. The CSV file streetSurfaceVis_v1_0.csv contains all image metadata and four folders contain the image files. All images are available in four different sizes, based on the image width, in 256px, 1024px, 2048px and the original size.Folders containing the images are named according to the respective image size. Image files are named based on the mapillary_image_id.

    Image metadata

    Each CSV record contains information about one street-level image with the following attributes:

    mapillary_image_id: ID provided by Mapillary (see information below on Mapillary)

    user_id: Mapillary user ID of contributor

    user_name: Mapillary user name of contributor

    captured_at: timestamp, capture time of image

    longitude, latitude: location the image was taken at

    train: Suggestion to split train and test data. True for train data and False for test data. Test data contains data from 5 cities which are excluded in the training data.

    surface_type: Surface type of the road in the focal area (the center of the lower image half) of the image. Possible values: asphalt, concrete, paving_stones, sett, unpaved

    surface_quality: Surface quality of the road in the focal area of the image. Possible values: (1) excellent, (2) good, (3) intermediate, (4) bad, (5) very bad (see the attached Labeling Guide document for details)

    Image source

    Images are obtained from Mapillary, a crowd-sourcing plattform for street-level imagery. More metadata about each image can be obtained via the Mapillary API . User-generated images are shared by Mapillary under the CC-BY-SA License.

    For each image, the dataset contains the mapillary_image_id and user_name. You can access user information on the Mapillary website by https://www.mapillary.com/app/user/

    If you use the provided images, please adhere to the terms of use of Mapillary.

    Instances per class

    Total number of images: 9,122

    excellent good intermediate bad very bad

    asphalt 971 1697 821

    246

    concrete 314 350 250

    58

    paving stones 385 1063 519

    70

    sett

    129 694

    540

    unpaved

    -

    326 387 303

    For modeling, we recommend using a train-test split where the test data includes geospatially distinct areas, thereby ensuring the model's ability to generalize to unseen regions is tested. We propose five cities varying in population size and from different regions in Germany for testing - images are tagged accordingly.

    Number of test images (train-test split): 776

    Inter-rater-reliablility

    Three annotators labeled the dataset, such that each image was annotated by one person. Annotators were encouraged to consult each other for a second opinion when uncertain.1,800 images were annotated by all three annotators, resulting in a Krippendorff's alpha of 0.96 for surface type and 0.74 for surface quality.

    Recommended image preprocessing

    As the focal road located in the bottom center of the street-level image is labeled, it is recommended to crop images to their lower and middle half prior using for classification tasks.

    This is an exemplary code for recommended image preprocessing in Python:

    from PIL import Imageimg = Image.open(image_path)width, height = img.sizeimg_cropped = img.crop((0.25 * width, 0.5 * height, 0.75 * width, height))

    License

    CC-BY-SA

    This is part of the SurfaceAI project at the University of Applied Sciences, HTW Berlin.

    • Prof. Dr. Helena Mihajlević- Alexandra Kapp- Edith Hoffmann- Esther Weigmann

    Contact: surface-ai@htw-berlin.de

    https://surfaceai.github.io/surfaceai/

    Funding: SurfaceAI is a mFund project funded by the Federal Ministry for Digital and Transportation Germany.

  15. Z

    Training Dataset for HNTSMRG 2024 Challenge

    • data.niaid.nih.gov
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dede, Cem (2024). Training Dataset for HNTSMRG 2024 Challenge [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199558
    Explore at:
    Dataset updated
    Jun 21, 2024
    Dataset provided by
    Dede, Cem
    Fuller, Clifton
    Naser, Mohamed
    Wahid, Kareem
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Training Dataset for HNTSMRG 2024 Challenge

    Overview

    This repository houses the publicly available training dataset for the Head and Neck Tumor Segmentation for MR-Guided Applications (HNTSMRG) 2024 Challenge.

    Patient cohorts correspond to patients with histologically proven head and neck cancer who underwent radiotherapy (RT) at The University of Texas MD Anderson Cancer Center. The cancer types are predominately oropharyngeal cancer or cancer of unknown primary. Images include a pre-RT T2w MRI scan (1-3 weeks before start of RT) and a mid-RT T2w MRI scan (2-4 weeks intra-RT) for each patient. Segmentation masks of primary gross tumor volumes (abbreviated GTVp) and involved metastatic lymph nodes (abbreviated GTVn) are provided for each image (derived from multi-observer STAPLE consensus).

    HNTSMRG 2024 is split into 2 tasks:

    Task 1: Segmentation of tumor volumes (GTVp and GTVn) on pre-RT MRI.

    Task 2: Segmentation of tumor volumes (GTVp and GTVn) on mid-RT MRI.

    The same patient cases will be used for the training and test sets of both tasks of this challenge. Therefore, we are releasing a single training dataset that can be used to construct solutions for either segmentation task. The test data provided (via Docker containers), however, will be different for the two tasks. Please consult the challenge website for more details.

    Data Details

    DICOM files (images and structure files) have been converted to NIfTI format (.nii.gz) for ease of use by participants via DICOMRTTool v. 1.0.

    Images are a mix of fat-suppressed and non-fat-suppressed MRI sequences. Pre-RT and mid-RT image pairs for a given patient are consistently either fat-suppressed or non-fat-suppressed.

    Though some sequences may appear to be contrast enhancing, no exogenous contrast is used.

    All images have been manually cropped from the top of the clavicles to the bottom of the nasal septum (~ oropharynx region to shoulders), allowing for more consistent image field of views and removal of identifiable facial structures.

    The mask files have one of three possible values: background = 0, GTVp = 1, GTVn = 2 (in the case of multiple lymph nodes, they are concatenated into one single label). This labeling convention is similar to the 2022 HECKTOR Challenge.

    150 unique patients are included in this dataset. Anonymized patient numeric identifiers are utilized.

    The entire training dataset is ~15 GB.

    Dataset Folder/File Structure

    The dataset is uploaded as a ZIP archive. Please unzip before use. NIfTI files conform to the following standardized nomenclature: ID_timepoint_image/mask.nii.gz. For mid-RT files, a "registered" suffix (ID_timepoint_image/mask_registered.nii.gz) indicates the image or mask has been registered to the mid-RT image space (see more details in Additional Notes below).

    The data is provided with the following folder hierarchy:

    Top-level folder (named "HNTSMRG24_train")

    Patient-level folder (anonymized patient ID, example: "2")

    Pre-radiotherapy data folder ("preRT")

    Original pre-RT T2w MRI volume (example: "2_preRT_T2.nii.gz").

    Original pre-RT tumor segmentation mask (example: "2_preRT_mask.nii.gz").

    Mid-radiotherapy data folder ("midRT")

    Original mid-RT T2w MRI volume (example: "2_midRT_T2.nii.gz").

    Original mid-RT tumor segmentation mask (example: "2_midRT_mask.nii.gz").

    Registered pre-RT T2w MRI volume (example: "2_preRT_T2_registered.nii.gz").

    Registered pre-RT tumor segmentation mask (example: "2_preRT_mask_registered.nii.gz").

    Note: Cases will exhibit variable presentation of ground truth mask structures. For example, a case could have only a GTVp label present, only a GTVn label present, both GTVp and GTVn labels present, or a completely empty mask (i.e., complete tumor response at mid-RT). The following case IDs have empty masks at mid-RT (indicating a complete response): 21, 25, 29, 42. These empty masks are not errors. There will similarly be some cases in the test set for Task 2 that have empty masks.

    Details Relevant for Algorithm Building

    The goal of Task 1 is to generate a pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz" is the relevant label). During blind testing for Task 1, only the pre-RT MRI (e.g., "2_preRT_T2.nii.gz") will be provided to the participants algorithms.

    The goal of Task 2 is to generate a mid-RT segmentation mask (e.g., "2_midRT_mask.nii.gz" is the relevant label). During blind testing for Task 2, the mid-RT MRI (e.g., "2_midRT_T2.nii.gz"), original pre-RT MRI (e.g., "2_preRT_T2.nii.gz"), original pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz"), registered pre-RT MRI (e.g., "2_preRT_T2_registered.nii.gz"), and registered pre-RT tumor segmentation mask (e.g., "2_preRT_mask_registered.nii.gz") will be provided to the participants algorithms.

    When building models, the resolution of the generated prediction masks should be the same as the corresponding MRI for the given task. In other words, the generated masks should be in the correct pixel spacing and origin with respect to the original reference frame (i.e., pre-RT image for Task 1, mid-RT image for Task 2). More details on the submission of models will be located on the challenge website.

    Additional Notes

    General notes.

    NIfTI format images and segmentations may be easily visualized in any NIfTI viewing software such as 3D Slicer.

    Test data will not be made public until the completion of the challenge. The complete training and test data will be published together (along with all original multi-observer annotations and relevant clinical data) at a later date via The Cancer Imaging Archive. Expected date ~ Spring 2025.

    Task 1 related notes.

    When training their algorithms for Task 1, participants can choose to use only pre-RT data or add in mid-RT data as well. Initially, our plan was to limit participants to utilizing only pre-RT data for training their algorithms in Task 1. However, upon reflection, we recognized that in a practical setting, individuals aiming to develop auto-segmentation algorithms could theoretically train models using any accessible data at their disposal. Based on current literature, we actually don't know what the best solution would be! Would the incorporation of mid-RT data for training a pre-RT segmentation model actually be helpful, or would it merely introduce harmful noise? The answer remains unclear. Therefore, we leave this choice to the participants.

    Remember, though, during testing, you will ONLY have the pre-RT image as an input to your model (naturally, since Task 1 is a pre-RT segmentation task and you won't know what mid-RT data for a patient will look like).

    Task 2 related notes.

    In addition to the mid-RT MRI and segmentation mask, we have also provided a registered pre-RT MRI and the corresponding registered pre-RT segmentation mask for each patient. We offer this data for participants who opt not to integrate any image registration techniques into their algorithms for Task 2 but still wish to use the two images as a joint input to their model. Moreover, in a real-world adaptive RT context, such registered scans are typically readily accessible. Naturally, participants are also free to incorporate their own image registration processes into their pipelines if they wish (or ignore the pre-RT images/masks altogether).

    Registrations were generated using SimpleITK, where the mid-RT image serves as the fixed image and the pre-RT image serves as the moving image. Specifically, we utilized the following steps: 1. Apply a centered transformation, 2. Apply a rigid transformation, 3. Apply a deformable transformation with Elastix using a preset parameter map (Parameter map 23 in the Elastix Zoo). This particular deformable transformation was selected as it is open-source and was benchmarked in a previous similar application (https://doi.org/10.1002/mp.16128). For cases where excessive warping was noted during deformable registration (a small minority of cases), only the rigid transformation was applied.

    Contact

    We have set up a general email address that you can message to notify all organizers at: hntsmrg2024@gmail.com. Additional specific organizer contacts:

    Kareem A. Wahid, PhD (kawahid@mdanderson.org)

    Cem Dede, MD (cdede@mdanderson.org)

    Mohamed A. Naser, PhD (manaser@mdanderson.org)

  16. z

    Data from: Machine Learning to Predict In-Hospital Mortality in COVID-19...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Feb 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Schiaffino; Marina Codari; Andrea Cozzi; Domenico Albano; Marco Alì; Roberto Arioli; Emanuele Avola; Claudio Bnà; Maurizio Cariati; Serena Carriero; Massimo Cressoni; Pietro S C Danna; Gianmarco Della Pepa; Giovanni Di Leo; Francesco Dolci; Zeno Falaschi; Nicola Flor; Riccardo A Foà; Salvatore Gitto; Giovanni Leati; Veronica Magni; Alexis E Malavazos; Giovanni Mauri; Carmelo Messina; Lorenzo Monfardini; Alessio Paschè; Filippo Pesapane; Luca M Sconfienza; Francesco Secchi; Edoardo Segalini; Angelo Spinazzola; Valeria Tombini; Silvia Tresoldi; Angelo Vanzulli; Ilaria Vicentin; Domenico Zagaria; Dominik Fleischmann; Francesco Sardanelli; Simone Schiaffino; Marina Codari; Andrea Cozzi; Domenico Albano; Marco Alì; Roberto Arioli; Emanuele Avola; Claudio Bnà; Maurizio Cariati; Serena Carriero; Massimo Cressoni; Pietro S C Danna; Gianmarco Della Pepa; Giovanni Di Leo; Francesco Dolci; Zeno Falaschi; Nicola Flor; Riccardo A Foà; Salvatore Gitto; Giovanni Leati; Veronica Magni; Alexis E Malavazos; Giovanni Mauri; Carmelo Messina; Lorenzo Monfardini; Alessio Paschè; Filippo Pesapane; Luca M Sconfienza; Francesco Secchi; Edoardo Segalini; Angelo Spinazzola; Valeria Tombini; Silvia Tresoldi; Angelo Vanzulli; Ilaria Vicentin; Domenico Zagaria; Dominik Fleischmann; Francesco Sardanelli (2022). Machine Learning to Predict In-Hospital Mortality in COVID-19 Patients Using Computed Tomography-Derived Pulmonary and Vascular Features [Dataset]. http://doi.org/10.5281/zenodo.6277756
    Explore at:
    Dataset updated
    Feb 25, 2022
    Dataset provided by
    Zenodo
    Authors
    Simone Schiaffino; Marina Codari; Andrea Cozzi; Domenico Albano; Marco Alì; Roberto Arioli; Emanuele Avola; Claudio Bnà; Maurizio Cariati; Serena Carriero; Massimo Cressoni; Pietro S C Danna; Gianmarco Della Pepa; Giovanni Di Leo; Francesco Dolci; Zeno Falaschi; Nicola Flor; Riccardo A Foà; Salvatore Gitto; Giovanni Leati; Veronica Magni; Alexis E Malavazos; Giovanni Mauri; Carmelo Messina; Lorenzo Monfardini; Alessio Paschè; Filippo Pesapane; Luca M Sconfienza; Francesco Secchi; Edoardo Segalini; Angelo Spinazzola; Valeria Tombini; Silvia Tresoldi; Angelo Vanzulli; Ilaria Vicentin; Domenico Zagaria; Dominik Fleischmann; Francesco Sardanelli; Simone Schiaffino; Marina Codari; Andrea Cozzi; Domenico Albano; Marco Alì; Roberto Arioli; Emanuele Avola; Claudio Bnà; Maurizio Cariati; Serena Carriero; Massimo Cressoni; Pietro S C Danna; Gianmarco Della Pepa; Giovanni Di Leo; Francesco Dolci; Zeno Falaschi; Nicola Flor; Riccardo A Foà; Salvatore Gitto; Giovanni Leati; Veronica Magni; Alexis E Malavazos; Giovanni Mauri; Carmelo Messina; Lorenzo Monfardini; Alessio Paschè; Filippo Pesapane; Luca M Sconfienza; Francesco Secchi; Edoardo Segalini; Angelo Spinazzola; Valeria Tombini; Silvia Tresoldi; Angelo Vanzulli; Ilaria Vicentin; Domenico Zagaria; Dominik Fleischmann; Francesco Sardanelli
    Description

    Dataset from Schiaffino S, Codari M, Cozzi A, Albano D, Alì M, Arioli R, Avola E, Bnà C, Cariati M, Carriero S, Cressoni M, Danna PSC, Della Pepa G, Di Leo G, Dolci F, Falaschi Z, Flor N, Foà RA, Gitto S, Leati G, Magni V, Malavazos AE, Mauri G, Messina C, Monfardini L, Paschè A, Pesapane F, Sconfienza LM, Secchi F, Segalini E, Spinazzola A, Tombini V, Tresoldi S, Vanzulli A, Vicentin I, Zagaria D, Fleischmann D, Sardanelli F. Machine Learning to Predict In-Hospital Mortality in COVID-19 Patients Using Computed Tomography-Derived Pulmonary and Vascular Features. J Pers Med. 2021 Jun 3;11(6):501. doi: 10.3390/jpm11060501. PMID: 34204911; PMCID: PMC8230339.

    Abstract

    Pulmonary parenchymal and vascular damage are frequently reported in COVID-19 patients and can be assessed with unenhanced chest computed tomography (CT), widely used as a triaging exam. Integrating clinical data, chest CT features, and CT-derived vascular metrics, we aimed to build a predictive model of in-hospital mortality using univariate analysis (Mann-Whitney U test) and machine learning models (support vectors machines (SVM) and multilayer perceptrons (MLP)). Patients with RT-PCR-confirmed SARS-CoV-2 infection and unenhanced chest CT performed on emergency department admission were included after retrieving their outcome (discharge or death), with an 85/15% training/test dataset split. Out of 897 patients, the 229 (26%) patients who died during hospitalization had higher median pulmonary artery diameter (29.0 mm) than patients who survived (27.0 mm, p < 0.001) and higher median ascending aortic diameter (36.6 mm versus 34.0 mm, p < 0.001). SVM and MLP best models considered the same ten input features, yielding a 0.747 (precision 0.522, recall 0.800) and 0.844 (precision 0.680, recall 0.567) area under the curve, respectively. In this model integrating clinical and radiological data, pulmonary artery diameter was the third most important predictor after age and parenchymal involvement extent, contributing to reliable in-hospital mortality prediction, highlighting the value of vascular metrics in improving patient stratification.

  17. A Dataset of Outdoor RSS Measurements for Localization

    • zenodo.org
    • data.niaid.nih.gov
    tiff, zip
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara (2024). A Dataset of Outdoor RSS Measurements for Localization [Dataset]. http://doi.org/10.5281/zenodo.10962857
    Explore at:
    tiff, zipAvailable download formats
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update: New version includes additional samples taken in November 2022.

    Dataset Description

    This dataset is a large-scale set of measurements for RSS-based localization. The data consists of received signal strength (RSS) measurements taken using the POWDER Testbed at the University of Utah. Samples include either 0, 1, or 2 active transmitters.

    The dataset consists of 5,214 unique samples, with transmitters in 5,514 unique locations. The majority of the samples contain only 1 transmitter, but there are small sets of samples with 0 or 2 active transmitters, as shown below. Each sample has RSS values from between 10 and 25 receivers. The majority of the receivers are stationary endpoints fixed on the side of buildings, on rooftop towers, or on free-standing poles. A small set of receivers are located on shuttles which travel specific routes throughout campus.

    Dataset DescriptionSample CountReceiver Count
    No-Tx Samples4610 to 25
    1-Tx Samples482210 to 25
    2-Tx Samples34611 to 12

    The transmitters for this dataset are handheld walkie-talkies (Baofeng BF-F8HP) transmitting in the FRS/GMRS band at 462.7 MHz. These devices have a rated transmission power of 1 W. The raw IQ samples were processed through a 6 kHz bandpass filter to remove neighboring transmissions, and the RSS value was calculated as follows:

    \(RSS = \frac{10}{N} \log_{10}\left(\sum_i^N x_i^2 \right) \)

    Measurement ParametersDescription
    Frequency462.7 MHz
    Radio Gain35 dB
    Receiver Sample Rate2 MHz
    Sample LengthN=10,000
    Band-pass Filter6 kHz
    Transmitters0 to 2
    Transmission Power1 W

    Receivers consist of Ettus USRP X310 and B210 radios, and a mix of wide- and narrow-band antennas, as shown in the table below Each receiver took measurements with a receiver gain of 35 dB. However, devices have different maxmimum gain settings, and no calibration data was available, so all RSS values in the dataset are uncalibrated, and are only relative to the device.

    Usage Instructions

    Data is provided in .json format, both as one file and as split files.

    import json
    data_file = 'powder_462.7_rss_data.json'
    with open(data_file) as f:
      data = json.load(f)
    

    The json data is a dictionary with the sample timestamp as a key. Within each sample are the following keys:

    • rx_data: A list of data from each receiver. Each entry contains RSS value, latitude, longitude, and device name.
    • tx_coords: A list of coordinates for each transmitter. Each entry contains latitude and longitude.
    • metadata: A list of dictionaries containing metadata for each transmitter, in the same order as the rows in tx_coords

    File Separations and Train/Test Splits

    In the separated_data.zip folder there are several train/test separations of the data.

    • all_data contains all the data in the main JSON file, separated by the number of transmitters.
    • stationary consists of 3 cases where a stationary receiver remained in one location for several minutes. This may be useful for evaluating localization using mobile shuttles, or measuring the variation in the channel characteristics for stationary receivers.
    • train_test_splits contains unique data splits used for training and evaluating ML models. These splits only used data from the single-tx case. In other words, the union of each splits, along with unused.json, is equivalent to the file all_data/single_tx.json.
      • The random split is a random 80/20 split of the data.
      • special_test_cases contains the stationary transmitter data, indoor transmitter data (with high noise in GPS location), and transmitters off campus.
      • The grid split divides the campus region in to a 10 by 10 grid. Each grid square is assigned to the training or test set, with 80 squares in the training set and the remainder in the test set. If a square is assigned to the test set, none of its four neighbors are included in the test set. Transmitters occuring in each grid square are assigned to train or test. One such random assignment of grid squares makes up the grid split.
      • The seasonal split contains data separated by the month of collection, in April, July, or November
      • The transportation split contains data separated by the method of movement for the transmitter: walking, cycling, or driving. The non-driving.json file contains the union of the walking and cycling data.
      • campus.json contains the on-campus data, so is equivalent to the union of each split, not including unused.json.

    Digital Surface Model

    The dataset includes a digital surface model (DSM) from a State of Utah 2013-2014 LiDAR survey. This map includes the University of Utah campus and surrounding area. The DSM includes buildings and trees, unlike some digital elevation models.

    To read the data in python:

    import rasterio as rio
    import numpy as np
    import utm
    
    dsm_object = rio.open('dsm.tif')
    dsm_map = dsm_object.read(1)   # a np.array containing elevation values
    dsm_resolution = dsm_object.res   # a tuple containing x,y resolution (0.5 meters) 
    dsm_transform = dsm_object.transform   # an Affine transform for conversion to UTM-12 coordinates
    utm_transform = np.array(dsm_transform).reshape((3,3))[:2]
    utm_top_left = utm_transform @ np.array([0,0,1])
    utm_bottom_right = utm_transform @ np.array([dsm_object.shape[0], dsm_object.shape[1], 1])
    latlon_top_left = utm.to_latlon(utm_top_left[0], utm_top_left[1], 12, 'T')
    latlon_bottom_right = utm.to_latlon(utm_bottom_right[0], utm_bottom_right[1], 12, 'T')
    

    Dataset Acknowledgement: This DSM file is acquired by the State of Utah and its partners, and is in the public domain and can be freely distributed with proper credit to the State of Utah and its partners. The State of Utah and its partners makes no warranty, expressed or implied, regarding its suitability for a particular use and shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to users of this product.

    DSM DOI: https://doi.org/10.5069/G9TH8JNQ

  18. D

    InvisibleEye

    • darus.uni-stuttgart.de
    zip
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Bulling; Andreas Bulling (2024). InvisibleEye [Dataset]. http://doi.org/10.18419/DARUS-3288
    Explore at:
    zip(1235691219), zip(2370006865), zip(1832535377), zip(2691875234), zip(1163575562), zip(1307174209), zip(1479061695), zip(1756841357), zip(1336569492), zip(2151001053), zip(1793491707), zip(1282196267), zip(1688704726), zip(1236630637), zip(2411744938), zip(1215164360), zip(1062125923), zip(1554962973), zip(1686975553), zip(1580571025), zip(1234665098), zip(2274454879), zip(1601682439), zip(1379202941)Available download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    DaRUS
    Authors
    Andreas Bulling; Andreas Bulling
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Dataset funded by
    DFG
    Alexander von Humboldt Foundation
    Description

    We recorded a dataset of more than 280,000 close-up eye images with ground truth annotation of the gaze location. A total of 17 participants were recorded, covering a wide range of appearances: Gender: Five (29%) female and 12 (71%) male Nationality: Seven (41%) German, seven (41%) Indian, one (6%) Bangladeshi, one (6%) Iranian, and one (6%) Greek Eye Color: 12 (70%) brown, four (23%) blue, and one (5%) green Glasses: Four participants (23%) wore regular glasses and one (6%) wore contact lenses For each participant, two sets of data were recorded: one set of training data and a separate set of test data. For each set, a series of gaze targets was shown on a display that participants were instructed to look at. For both training and test data the gaze targets covered a uniform grid in a random order, where the grid corresponding to the test data was positioned to lie in between the training points. Since the NanEye cameras record at about 44 FPS, we gathered approximately 22 frames per camera and gaze target. The training data was recorded using a uniform 24 × 17 grid of points, with an angular distance in gaze angle of 1.45° horizontally and 1.30° vertically between the points. In total the training set contained about 8,800 images per camera and participant. The test set’s points belonged to a 23 × 16 grid of points and it contains about 8,000 images per camera and participant. This way, the gaze targets covered a field of view of 35° horizontally and 22° vertically. The recording procedure was split into two parts for training and test data. For both parts, participants were instructed to put on the prototype and rest their head on a chin rest positioned exactly 510 mm in front of a display. The display was a 30-inch LED monitor with a pixel pitch of 0.25 mm and viewable image dimensions of 641.3 × 400.8 mm, set to 2560 × 1600-pixel resolution. On the display, the grid of gaze targets was shown, which the participants were instructed to look at. Each point appeared as a big circle 300 pixels in diameter and shrunk to a circle of 8 pixels diameter over the course of 700 ms. The small circle was then displayed for another 500 ms, until the display of the next point started. Data was only recorded during the latter 500 ms, i.e. while the small circle was shown (see Figure 7a). It is important to note that the chin rest did not fully restrain participants and we noticed that their head sometimes moved noticeably, thus resulting in a certain amount of label noise. Using the shrinking animation for the circle helps the participants to locate the circle on the screen and gives them time to relocate their gaze. Similar to [30], we also showed an “L” or an “R” in between every 20th pair of points in the sequence. The letter was displayed for 500 ms at the position of the last point. Participants were asked to confirm the letter they had seen by pressing the corresponding left or right arrow-key. This was done to ensure participants focused on the gaze targets and task at hand throughout the recording.

  19. STARSS23: Sony-TAu Realistic Spatial Soundscapes 2023

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jun 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archontis Politis; Archontis Politis; Kazuki Shimada; Kazuki Shimada; Parthasaarathy Sudarsanam; Aapo Hakala; Shusuke Takahashi; Daniel Alexander Krause; Naoya Takahashi; Sharath Adavanne; Sharath Adavanne; Yuichiro Koyama; Kengo Uchida; Yuki Mitsufuji; Yuki Mitsufuji; Tuomas Virtanen; Tuomas Virtanen; Parthasaarathy Sudarsanam; Aapo Hakala; Shusuke Takahashi; Daniel Alexander Krause; Naoya Takahashi; Yuichiro Koyama; Kengo Uchida (2023). STARSS23: Sony-TAu Realistic Spatial Soundscapes 2023 [Dataset]. http://doi.org/10.5281/zenodo.7709052
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jun 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Archontis Politis; Archontis Politis; Kazuki Shimada; Kazuki Shimada; Parthasaarathy Sudarsanam; Aapo Hakala; Shusuke Takahashi; Daniel Alexander Krause; Naoya Takahashi; Sharath Adavanne; Sharath Adavanne; Yuichiro Koyama; Kengo Uchida; Yuki Mitsufuji; Yuki Mitsufuji; Tuomas Virtanen; Tuomas Virtanen; Parthasaarathy Sudarsanam; Aapo Hakala; Shusuke Takahashi; Daniel Alexander Krause; Naoya Takahashi; Yuichiro Koyama; Kengo Uchida
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    DESCRIPTION:

    The Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University (TAU), and in Tokyo, Japan by SONY, using a similar setup and annotation procedure. The dataset is delivered in two 4-channel spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). These recordings serve as the development dataset for the DCASE 2023 Sound Event Localization and Detection Task of the DCASE 2023 Challenge.

    The STARSS23 dataset is a continuation of the STARSS22 dataset. It extends the previous version with the following:

    • An additional 2.5hrs of recordings in the development set, from 5 new rooms distributed in 47 new recording clips.
    • Distance labels (in cm) for the spatially annotated sound events, apart from only the previous azimuth and elevation labels.
    • 360° videos spatially and temporally aligned to the audio recordings of the dataset (apart from 12 audio-only clips).
    • Additional new audio and video recordings will be added in the evaluation set of the dataset in a subsequent version.

    Contrary to the three previous datasets of synthetic spatial sound scenes of TAU Spatial Sound Events 2019 (development/evaluation), TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021 associated with previous iterations of the DCASE Challenge, the STARS22-23 dataset contains recordings of real sound scenes and hence it avoids some of the pitfalls of synthetic generation of scenes. Some such key properties are:

    • annotations are based on a combination of human annotators for sound event activity and optical tracking for spatial positions,
    • the annotated target event classes are determined by the composition of the real scenes,
    • the density, polyphony, occurences and co-occurences of events and sound classes is not random, and it follows actions and interactions of participants in the real scenes.

    The first round of recordings was collected between September 2021 and January 2022. A second round of recordings was collected between November 2022 and February 2023.

    Collection of data from the TAU side has received funding from Google.

    REPORT & REFERENCE:

    If you use this dataset you could cite this report on its design, capturing, and annotation process:

    Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen (2022). STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France.

    found here.

    A more detailed report on the properties of the new dataset and its audiovisual processing with a suitable baseline for DCASE2023 will be published soon.

    AIM:

    The STARSS22-23 dataset is suitable for training and evaluation of machine-listening models for sound event detection (SED), general sound source localization with diverse sounds or signal-of-interest localization, and joint sound-event-localization-and-detection (SELD). Additionally, the dataset can be used for evaluation of signal processing methods that do not necessarily rely on training, such as acoustic source localization methods and multiple-source acoustic tracking. The dataset allows evaluation of the performance and robustness of the aforementioned applications for diverse types of sounds, and under diverse acoustic conditions.

    Specifically the STARSS23 allows additionally evaluation of audiovisual processing methods, such as audiovisual source localization.

    SPECIFICATIONS:

    General:

    • Recordings are taken in two different sites.
    • Each recording clip is part of a recording session happening in a unique room.
    • Groups of participants, sound making props, and scene scenarios are unique for each session (with a few exceptions).
    • To achieve good variability and efficiency in the data, in terms of presence, density, movement, and/or spatial distribution of the sounds events, the scenes are loosely scripted.
    • 13 target classes are identified in the recordings and strongly annotated by humans.
    • Spatial annotations for those active events are captured by an optical tracking system.
    • Sound events out of the target classes are considered as interference.
    • Occurrences of up to 3 simultaneous events are fairly common, while higher numbers of overlapping events (up to 5) can occur but are rare.

    Volume, duration, and data split:

    • A total of 16 unique rooms captured in the recordings, 4 in Tokyo and 12 in Tampere (development set).
    • 70 recording clips of 30 sec ~ 5 min durations, with a total time of ~2hrs, captured in Tokyo (development dataset).
    • 98 recording clips of 40 sec ~ 9 min durations, with a total time of ~5.5hrs, captured in Tampere (development dataset).
    • A training-testing split is provided for reporting results using the development dataset.
    • 40 recordings contributed by Sony for the training split, captured in 2 rooms (dev-train-sony).
    • 30 recordings contributed by Sony for the testing split, captured in 2 rooms (dev-test-sony).
    • 50 recordings contributed by TAU for the training split, captured in 7 rooms (dev-train-tau).
    • 48 recordings contributed by TAU for the testing split, captured in 5 rooms (dev-test-tau).
    • About ~3.5hrs of additional recordings from both sites, captured in different rooms from the development set, will be released later as the evaluation set.

    Audio:

    • Sampling rate: 24kHz.
    • Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array (MIC).

    Video:

    • Video 360° format: equirectangular
    • Video resolution: 1920x960
    • Video frames per second (fps): 29.97
    • All audio recordings are accompanied by synchronised video recordings, apart from 12 audio recordings with missing videos (fold3_room21_mix001.wav - fold3_room21_mix012.wav)

    More detailed information on the dataset can be found in the included README file.

    SOUND CLASSES:

    13 target sound event classes are annotated. The classes follow loosely the Audioset ontology.

    0. Female speech, woman speaking
    1. Male speech, man speaking
    2. Clapping
    3. Telephone
    4. Laughter
    5. Domestic sounds
    6. Walk, footsteps
    7. Door, open or close
    8. Music
    9. Musical instrument
    10. Water tap, faucet
    11. Bell
    12. Knock

    The content of some of these classes corresponds to events of a limited range of Audioset-related subclasses. For more information see the README file.

    EXAMPLE APPLICATION:

    An implementation of a trainable model of a convolutional recurrent neural network, performing joint SELD, trained and evaluated with this dataset is provided here. This implementation will serve as the baseline method for the audio-only track in the DCASE 2023 Sound Event Localization and Detection Task.

    A baseline for the audiovisual track of DCASE 2023 Sound Event Localization and Detection Task will be published soon and referenced here.

    DEVELOPMENT AND EVALUATION:

    The current version (Version 1.0) of the dataset includes only the 168 development audio/video recordings and labels, used by the participants of Task 3 of DCASE2023 Challenge to train and validate their submitted systems. Version 1.1 will be including additionally the evaluation audio and video recordings without labels, for the evaluation phase of DCASE2023.

    If researchers wish to compare their system against the submissions of DCASE2023 Challenge, they will have directly comparable results if they use the evaluation data as their testing set.

    DOWNLOAD INSTRUCTIONS:

    The file foa_dev.zip, correspond to audio data of the FOA recording format.
    The file mic_dev.zip, correspond to audio data of the MIC recording format.

    The file video_dev.zip

  20. Z

    MSL Curiosity Rover Images with Science and Engineering Classes

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven Lu (2020). MSL Curiosity Rover Images with Science and Engineering Classes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3892023
    Explore at:
    Dataset updated
    Sep 17, 2020
    Dataset provided by
    Steven Lu
    Kiri L. Wagstaff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.

    Data Set Description

    The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.

    Directory Contents

    images - contains all 6,820 images

    class_map.csv - string-integer class mappings

    train-set-v2.1.txt - label file for the training set

    val-set-v2.1.txt - label file for the validation set

    test-set-v2.1.txt - label file for the test set

    The label files are formatted as below:

    "Image-file-name class_in_integer_representation"

    Labeling Process

    Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:

    If all three labels agree with each other, then use the label as the final label.

    If the three labels do not agree with each other, then we manually review the labels and decide the final label.

    We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.

    Classes

    There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:

    Class name, counts (training set), counts (validation set), counts (test set), integer representation

    Arm cover, 10, 1, 4, 0

    Other rover part, 190, 11, 10, 1

    Artifact, 680, 62, 132, 2

    Nearby surface, 1554, 74, 187, 3

    Close-up rock, 1422, 50, 84, 4

    DRT, 8, 4, 6, 5

    DRT spot, 214, 1, 7, 6

    Distant landscape, 342, 14, 34, 7

    Drill hole, 252, 5, 12, 8

    Night sky, 40, 3, 4, 9

    Float, 190, 5, 1, 10

    Layers, 182, 21, 17, 11

    Light-toned veins, 42, 4, 27, 12

    Mastcam cal target, 122, 12, 29, 13

    Sand, 228, 19, 16, 14

    Sun, 182, 5, 19, 15

    Wheel, 212, 5, 5, 16

    Wheel joint, 62, 1, 5, 17

    Wheel tracks, 26, 3, 1, 18

    Image Augmentation

    Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.

    90 degrees clockwise rotation (file name ends with -r90.jpg)

    180 degrees clockwise rotation (file name ends with -r180.jpg)

    270 degrees clockwise rotation (file name ends with -r270.jpg)

    Horizontal flip (file name ends with -fh.jpg)

    Vertical flip (file name ends with -fv.jpg)

    Acknowledgment

    The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Gallo (2024). Bioinformatics Protein Dataset - Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10315204
Organization logo

Bioinformatics Protein Dataset - Simulated

Synthetic protein dataset with sequences, physical properties, and functional cl

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 27, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rafael Gallo
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Subtitle

"Synthetic protein dataset with sequences, physical properties, and functional classification for machine learning tasks."

Description

Introduction

This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification.

Columns Included

  • ID_Protein: Unique identifier for each protein.
  • Sequence: String of amino acids.
  • Molecular_Weight: Molecular weight calculated from the sequence.
  • Isoelectric_Point: Estimated isoelectric point based on the sequence composition.
  • Hydrophobicity: Average hydrophobicity calculated from the sequence.
  • Total_Charge: Sum of the charges of the amino acids in the sequence.
  • Polar_Proportion: Percentage of polar amino acids in the sequence.
  • Nonpolar_Proportion: Percentage of nonpolar amino acids in the sequence.
  • Sequence_Length: Total number of amino acids in the sequence.
  • Class: The functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other.

Inspiration and Sources

While this is a simulated dataset, it was inspired by patterns observed in real protein datasets, such as: - UniProt: A comprehensive database of protein sequences and annotations. - Kyte-Doolittle Scale: Calculations of hydrophobicity. - Biopython: A tool for analyzing biological sequences.

Proposed Uses

This dataset is ideal for: - Training classification models for proteins. - Exploratory analysis of physicochemical properties of proteins. - Building machine learning pipelines in bioinformatics.

How This Dataset Was Created

  1. Sequence Generation: Amino acid chains were randomly generated with lengths between 50 and 300 residues.
  2. Property Calculation: Physicochemical properties were calculated using the Biopython library.
  3. Class Assignment: Classes were randomly assigned for classification purposes.

Limitations

  • The sequences and properties do not represent real proteins but follow patterns observed in natural proteins.
  • The functional classes are simulated and do not correspond to actual biological characteristics.

Data Split

The dataset is divided into two subsets: - Training: 16,000 samples (proteinas_train.csv). - Testing: 4,000 samples (proteinas_test.csv).

Acknowledgment

This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

Search
Clear search
Close search
Google apps
Main menu