21 datasets found

P
Titanic Dataset
paperswithcode.com
Updated Oct 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Titanic Dataset [Dataset]. https://paperswithcode.com/dataset/titanic
Explore at:
Dataset updated
Oct 27, 2024
Description
Titanic Dataset Description Overview The data is divided into two groups: - Training set (train.csv): Used to build machine learning models. It includes the outcome (also called the "ground truth") for each passenger, allowing models to predict survival based on “features” like gender and class. Feature engineering can also be applied to create new features. - Test set (test.csv): Used to evaluate model performance on unseen data. The ground truth is not provided; the task is to predict survival for each passenger in the test set using the trained model.

Additionally, gender_submission.csv is provided as an example submission file, containing predictions based on the assumption that all and only female passengers survive.

Data Dictionary | Variable | Definition | Key | |------------|------------------------------------------|-------------------------------------------------| | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | age | Age in years | | | sibsp | # of siblings/spouses aboard the Titanic | | | parch | # of parents/children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes

pclass: Proxy for socio-economic status (SES): 1st = Upper 2nd = Middle 3rd = Lower age:
Fractional if less than 1 year.
Estimated ages are represented in the form xx.5. sibsp: Defines family relations as: Sibling: Brother, sister, stepbrother, stepsister. Spouse: Husband, wife (excluding mistresses and fiancés). parch: Defines family relations as: Parent: Mother, father. Child: Daughter, son, stepdaughter, stepson. Some children traveled only with a nanny, so parch = 0 for them.
a
Data from: RESIDE
datasets.activeloop.ai
deeplake
Updated Jun 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li, Boyi and Ren, Wenqi and Fu, Dengpan and Tao, Dacheng and Feng, Dan and Zeng, Wenjun and Wang, Zhangyang (2022). RESIDE [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/reside-dataset/
Explore at:
deeplakeAvailable download formats
Dataset updated
Jun 12, 2022
Authors
Li, Boyi and Ren, Wenqi and Fu, Dengpan and Tao, Dacheng and Feng, Dan and Zeng, Wenjun and Wang, Zhangyang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The REalistic Single Image DEhazing (RESIDE) dataset is a large-scale benchmark dataset for single image dehazing. It consists of 1000 hazy images and their corresponding ground truth clear images. The dataset is divided into a training set of 800 images and a test set of 200 images.
mini CURE-OR
zenodo.org
data.niaid.nih.gov
csv, zip
Updated Dec 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dogancan Temel; Jinsol Lee; Ghassan AlRegib; Dogancan Temel; Jinsol Lee; Ghassan AlRegib (2020). mini CURE-OR [Dataset]. http://doi.org/10.5281/zenodo.4299330
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4299330
Dataset updated
Dec 2, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dogancan Temel; Jinsol Lee; Ghassan AlRegib; Dogancan Temel; Jinsol Lee; Ghassan AlRegib
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
File descriptions

train.zip - the training set

test.zip - the test set

train.csv - the ground truth for the training images with the following information: imageID, class, background, perspective, challengeType, challengeLevel

test.csv - the ground truth for the training images with the following information: imageID, class, background, perspective, challengeType, challengeLevel

Data fields

imageID - an anonymous id unique to a given image

class - the class of the object in the given image: 1-10

1: Canon camera

2: Training marker cone

3: Baseball

4: Pan

5: Toy

6: LG Cell phone

7: Hair brush

8: DYMO Label maker

9: Calcium bottle

10: Shoes

background - the background of the object

1: 2D white

2: 2D living room

3: 2D kitchen

perspective - the perspective/orientation of the object

1: Front

2: Left side - 90 degrees

3: Back - 180 degrees

4: Right side - 270 degrees

5: Top

challengeType - the type of generated challenging conditions

01: No challenge

02: Resize

03: Underexposure

04: Overexposure

05: Gaussian blur

06: Contrast

07: Dirty lens 1

08: Dirty lens 2

09: Salt & pepper noise

10: Grayscale

11: Grayscale resize

12: Grayscale underexposure

13: Grayscale overexposure

14: Grayscale gaussian blur

15: Grayscale contrast

16: Grayscale dirty lens 1

17: Grayscale dirty lens 2

18: Grayscale salt & pepper noise

challengeLevel - the level of generated challenging conditions

0: No challenge (01) and Grayscale (10) only - no challenge level

1 - 4: the degree of a challenge from least to most

For more information about CURE-OR dataset, please refer to the webpage.
Dataset for "Exploring the viability of a machine learning based multimodel...
zenodo.org
Updated May 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Review; Anonymous Review (2025). Dataset for "Exploring the viability of a machine learning based multimodel for quantitative precipitation forecast post-processing" [Dataset]. http://doi.org/10.5281/zenodo.14923826
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14923826
Dataset updated
May 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous Review; Anonymous Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Title: Dataset for "Exploring the viability of a machine learning based multimodel for quantitative precipitation forecast post-processing"

Description:
This dataset supports the study presented in the paper "Exploring the viability of a machine learning based multimodel for quantitative precipitation forecast post-processing". The work focuses on improving quantiative precipitation forecast over the Piedmont and Aosta Valley regions in Italy by blending outputs from four Numerical Weather Prediction (NWP) models using machine learning architectures including Multi-Layer Perceptrons (MLPs), U-Net and Residual U-Net as Convolutional Neural Networks (CNNs), and NWIOI as observational data (Turco et al., 2013).

Observational data from NWIOI serve as the ground truth for model training. The dataset contains 406 gridded precipitation events from 2018 to 2022.

Dataset contents:

obs.zip: NWIOI observed precipitation data (.csv format, one file per event)

subsets.zip: Events dates for 10 different training-validation-test sets, retrieved with 10-fold cross validation (.csv format, one file per set and per split)

domain_mask.csv: Binary mask (1 for grid points in the study area, 0 otherwise)

allevents_dates_zenodo.csv: Summary statistics and classification of all events by intensity and nature, used for subsets creation with 10-fold cross validation

Citations:

NWIOI: Turco, M., Zollo, A. L., Ronchi, C., De Luigi, C., & Mercogliano, P. (2013). Assessing gridded observations for daily precipitation extremes in the Alps with a focus on northwest Italy. Natural Hazards and Earth System Sciences, 13(6), 1457–1468.
Z
TrackML Particle Tracking Challenge
data.niaid.nih.gov
Updated May 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Golling, Tobias (2021). TrackML Particle Tracking Challenge [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4730166
Explore at:
Dataset updated
May 10, 2021
Dataset provided by
Gray, Heather
Kiehn, Moritz
Hushchyn, Mikhail
Innocente, Vincenzo
Basara, Laurent
Germain, Cecile
rousseau, David
Salzburger, Andreas
Amrouche, Sabrina
Ustyuzhanin, Andrey
Guyon, Isabelle
Gligorov, Vladimir
Farell, Steven
Golling, Tobias
Moyse, Edward
Calafiura, Paolo
vlimant, jean-roch
Estrade, Victor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Original source from Kaggle : https://www.kaggle.com/c/trackml-particle-identification/data

The dataset comprises multiple independent events, where each event contains simulated measurements (essentially 3D points) of particles generated in a collision between proton bunches at the Large Hadron Collider at CERN. The goal of the tracking machine learning challenge is to group the recorded measurements or hits for each event into tracks, sets of hits that belong to the same initial particle. A solution must uniquely associate each hit to one track. The training dataset contains the recorded hits, their ground truth counterpart and their association to particles, and the initial parameters of those particles. The test dataset contains only the recorded hits.

Once unzipped, the dataset is provided as a set of plain .csv files. Each event has four associated files that contain hits, hit cells, particles, and the ground truth association between them. The common prefix, e.g. event000000010, is always event followed by 9 digits.

event000000000-hits.csv

event000000000-cells.csv

event000000000-particles.csv

event000000000-truth.csv

event000000001-hits.csv

event000000001-cells.csv

event000000001-particles.csv

event000000001-truth.csv

Event hits

The hits file contains the following values for each hit/entry:

hit_id: numerical identifier of the hit inside the event.

x, y, z: measured x, y, z position (in millimeter) of the hit in global coordinates.

volume_id: numerical identifier of the detector group.

layer_id: numerical identifier of the detector layer inside the group.

module_id: numerical identifier of the detector module inside the layer.

The volume/layer/module id could in principle be deduced from x, y, z. They are given here to simplify detector-specific data handling.

Event truth

The truth file contains the mapping between hits and generating particles and the true particle state at each measured hit. Each entry maps one hit to one particle.

hit_id: numerical identifier of the hit as defined in the hits file.

particle_id: numerical identifier of the generating particle as defined in the particles file. A value of 0 means that the hit did not originate from a reconstructible particle, but e.g. from detector noise.

tx, ty, tz true intersection point in global coordinates (in millimeters) between the particle trajectory and the sensitive surface.

tpx, tpy, tpz true particle momentum (in GeV/c) in the global coordinate system at the intersection point. The corresponding vector is tangent to the particle trajectory at the intersection point.

weight per-hit weight used for the scoring metric; total sum of weights within one event equals to one.

Event particles

The particles files contains the following values for each particle/entry:

particle_id: numerical identifier of the particle inside the event.

vx, vy, vz: initial position or vertex (in millimeters) in global coordinates.

px, py, pz: initial momentum (in GeV/c) along each global axis.

q: particle charge (as multiple of the absolute electron charge).

nhits: number of hits generated by this particle.

All entries contain the generated information or ground truth.

Event hit cells

The cells file contains the constituent active detector cells that comprise each hit. The cells can be used to refine the hit to track association. A cell is the smallest granularity inside each detector module, much like a pixel on a screen, except that depending on the volume_id a cell can be a square or a long rectangle. It is identified by two channel identifiers that are unique within each detector module and encode the position, much like column/row numbers of a matrix. A cell can provide signal information that the detector module has recorded in addition to the position. Depending on the detector type only one of the channel identifiers is valid, e.g. for the strip detectors, and the value might have different resolution.

hit_id: numerical identifier of the hit as defined in the hits file.

ch0, ch1: channel identifier/coordinates unique within one module.

value: signal value information, e.g. how much charge a particle has deposited.

Additional detector geometry information

The detector is built from silicon slabs (or modules, rectangular or trapezoïdal), arranged in cylinders and disks, which measure the position (or hits) of the particles that cross them. The detector modules are organized into detector groups or volumes identified by a volume id. Inside a volume they are further grouped into layers identified by a layer id. Each layer can contain an arbitrary number of detector modules, the smallest geometrically distinct detector object, each identified by a module_id. Within each group, detector modules are of the same type have e.g. the same granularity. All simulated detector modules are so-called semiconductor sensors that are build from thin silicon sensor chips. Each module can be represented by a two-dimensional, planar, bounded sensitive surface. These sensitive surfaces are subdivided into regular grids that define the detectors cells, the smallest granularity within the detector.

Each module has a different position and orientation described in the detectors file. A local, right-handed coordinate system is defined on each sensitive surface such that the first two coordinates u and v are on the sensitive surface and the third coordinate w is normal to the surface. The orientation and position are defined by the following transformation

pos_xyz = rotation_matrix * pos_uvw + translation

that transform a position described in local coordinates u,v,w into the equivalent position x,y,z in global coordinates using a rotation matrix and and translation vector (cx,cy,cz).

volume_id: numerical identifier of the detector group.

layer_id: numerical identifier of the detector layer inside the group.

module_id: numerical identifier of the detector module inside the layer.

cx, cy, cz: position of the local origin in the global coordinate system (in millimeter).

rot_xu, rot_xv, rot_xw, rot_yu, ...: components of the rotation matrix to rotate from local u,v,w to global x,y,z coordinates.

module_t: half thickness of the detector module (in millimeter).

module_minhu, module_maxhu: the minimum/maximum half-length of the module boundary along the local u direction (in millimeter).

module_hv: the half-length of the module boundary along the local v direction (in millimeter).

pitch_u, pitch_v: the size of detector cells along the local u and v direction (in millimeter).

There are two different module shapes in the detector, rectangular and trapezoidal. The pixel detector ( with volume_id = 7,8,9) is fully built from rectangular modules, and so are the cylindrical barrels in volume_id=13,17. The remaining layers are made out disks that need trapezoidal shapes to cover the full disk.
T
wider_face
tensorflow.org
opendatalab.com
+3more
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wider_face [Dataset]. https://www.tensorflow.org/datasets/catalog/wider_face
Explore at:
Dataset updated
Dec 6, 2022
Description
WIDER FACE dataset is a face detection benchmark dataset, of which images are selected from the publicly available WIDER dataset. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes. For each event class, we randomly select 40%/10%/50% data as training, validation and testing sets. We adopt the same evaluation metric employed in the PASCAL VOC dataset. Similar to MALF and Caltech datasets, we do not release bounding box ground truth for the test images. Users are required to submit final prediction files, which we shall proceed to evaluate.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wider_face', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/wider_face-0.1.0.png" alt="Visualization" width="500px">
Z
RAD-ChestCT Dataset
data.niaid.nih.gov
zenodo.org
Updated Apr 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mazurowski, Maciej A (2023). RAD-ChestCT Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6406113
Explore at:
Dataset updated
Apr 4, 2023
Dataset provided by
Mazurowski, Maciej A
Dov, David
Henao, Ricardo
Rubin, Geoffrey D.
Carin, Lawrence
Lo, Joseph Y.
Draelos, Rachel Lea
Description
Overview

The RAD-ChestCT dataset is a large medical imaging dataset developed by Duke MD/PhD student Rachel Draelos during her Computer Science PhD supervised by Lawrence Carin. The full dataset includes 35,747 chest CT scans from 19,661 adult patients. This Zenodo repository contains an initial release of 3,630 chest CT scans, approximately 10% of the dataset. This dataset is of significant interest to the machine learning and medical imaging research communities.

Papers

The following published paper includes a description of how the RAD-ChestCT dataset was created: Draelos et al., "Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale Chest Computed Tomography Volumes," Medical Image Analysis 2021. DOI: 10.1016/j.media.2020.101857 https://pubmed.ncbi.nlm.nih.gov/33129142/

Two additional papers leveraging the RAD-ChestCT dataset are available as preprints:

"Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks" (https://arxiv.org/abs/2011.08891)

"Explainable multiple abnormality classification of chest CT volumes with deep learning" (https://arxiv.org/abs/2111.12215)

Details about the files included in this data release

Metadata Files (4)

CT_Scan_Metadata_Complete_35747.csv: includes metadata about the whole dataset, with information extracted from DICOM headers.

Extrema_5747.csv: includes coordinates for lung bounding boxes for the whole dataset. Coordinates were derived computationally using a morphological image processing lung segmentation pipeline.

Indications_35747.csv: includes scan indications for the whole dataset. Indications were extracted from the free-text reports.

Summary_3630.csv: includes a listing of the 3,630 scans that are part of this repository.

Label Files (3)

The label files contain abnormality x location labels for the 3,630 shared CT volumes. Each CT volume is annotated with a matrix of 84 abnormality labels x 52 location labels. Labels were extracted from the free text reports using the Sentence Analysis for Radiology Label Extraction (SARLE) framework. For each CT scan, the label matrix has been flattened and the abnormalities and locations are separated by an asterisk in the CSV column headers (e.g. "mass*liver"). The labels can be used as the ground truth when training computer vision classifiers on the CT volumes. Label files include: imgtrain_Abnormality_and_Location_Labels.csv (for the training set)

imgvalid_Abnormality_and_Location_Labels.csv (for the validation set)

imgtest_Abnormality_and_Location_Labels.csv (for the test set)

CT Volume Files (3,630)

Each CT scan is provided as a compressed 3D numpy array (npz format). The CT scans can be read using the Python package numpy, version 1.14.5 and above.

Related Code

Code related to RAD-ChestCT is publicly available on GitHub at https://github.com/rachellea.

Repositories of interest include:

https://github.com/rachellea/ct-net-models contains PyTorch code to load the RAD-ChestCT dataset and train convolutional neural network models for multiple abnormality prediction from whole CT volumes.

https://github.com/rachellea/ct-volume-preprocessing contains an end-to-end Python framework to convert CT scans from DICOM to numpy format. This code was used to prepare the RAD-ChestCT volumes.

https://github.com/rachellea/sarle-labeler contains the Python implementation of the SARLE label extraction framework used to generate the abnormality and location label matrix from the free text reports. SARLE has minimal dependencies and the abnormality and location vocabulary terms can be easily modified to adapt SARLE to different radiologic modalities, abnormalities, and anatomical locations.
A
‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-solution-for-beginner-s-guide-03a8/ae3641d4/?iid=014-162&v=presentation
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Titanic Solution for Beginner's Guide’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harunshimanto/titanic-solution-for-beginners-guide on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Overview

The data has been split into two groups:

training set (train.csv) test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary

Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

--- Original source retains full ownership of the source dataset ---
o
TrackML Throughput Phase
explore.openaire.eu
data.niaid.nih.gov
Updated Sep 3, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Salzburger; Vincenzo Innocente; jean-roch vlimant; david rousseau; Vladimir Gligorov; Laurent Basara; Victor Estrade; Paolo Calafiura; Steven Farell; Heather Gray; Tobias Golling; Moritz Kiehn; Sabrina Amrouche; Mikhail Hushchyn; Andrey Ustyuzhanin; Edward Moyse; Cecile Germain; Isabelle Guyon (2018). TrackML Throughput Phase [Dataset]. http://doi.org/10.5281/zenodo.4730157
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4730157
Dataset updated
Sep 3, 2018
Authors
Andreas Salzburger; Vincenzo Innocente; jean-roch vlimant; david rousseau; Vladimir Gligorov; Laurent Basara; Victor Estrade; Paolo Calafiura; Steven Farell; Heather Gray; Tobias Golling; Moritz Kiehn; Sabrina Amrouche; Mikhail Hushchyn; Andrey Ustyuzhanin; Edward Moyse; Cecile Germain; Isabelle Guyon
Description
Original source from Codalab : https://competitions.codalab.org/competitions/20112 The dataset comprises multiple independent events, where each event contains simulated measurements (essentially 3D points) of particles generated in a collision between proton bunches at the Large Hadron Collider at CERN. The goal of the tracking machine learning challenge is to group the recorded measurements or hits for each event into tracks, sets of hits that belong to the same initial particle. A solution must uniquely associate each hit to one track. The training dataset contains the recorded hits, their ground truth counterpart and their association to particles, and the initial parameters of those particles. The test dataset contains only the recorded hits. Once unzipped, the dataset is provided as a set of plain .csv files. Each event has four associated files that contain hits, hit cells, particles, and the ground truth association between them. The common prefix, e.g. event000000010, is always event followed by 9 digits. event000000000-hits.csv event000000000-cells.csv event000000000-particles.csv event000000000-truth.csv event000000001-hits.csv event000000001-cells.csv event000000001-particles.csv event000000001-truth.csv Event hits The hits file contains the following values for each hit/entry: hit_id: numerical identifier of the hit inside the event. x, y, z: measured x, y, z position (in millimeter) of the hit in global coordinates. volume_id: numerical identifier of the detector group. layer_id: numerical identifier of the detector layer inside the group. module_id: numerical identifier of the detector module inside the layer. The volume/layer/module id could in principle be deduced from x, y, z. They are given here to simplify detector-specific data handling. Event truth The truth file contains the mapping between hits and generating particles and the true particle state at each measured hit. Each entry maps one hit to one particle. hit_id: numerical identifier of the hit as defined in the hits file. particle_id: numerical identifier of the generating particle as defined in the particles file. A value of 0 means that the hit did not originate from a reconstructible particle, but e.g. from detector noise. tx, ty, tz true intersection point in global coordinates (in millimeters) between the particle trajectory and the sensitive surface. tpx, tpy, tpz true particle momentum (in GeV/c) in the global coordinate system at the intersection point. The corresponding vector is tangent to the particle trajectory at the intersection point. weight per-hit weight used for the scoring metric; total sum of weights within one event equals to one. Event particles The particles files contains the following values for each particle/entry: particle_id: numerical identifier of the particle inside the event. vx, vy, vz: initial position or vertex (in millimeters) in global coordinates. px, py, pz: initial momentum (in GeV/c) along each global axis. q: particle charge (as multiple of the absolute electron charge). nhits: number of hits generated by this particle. All entries contain the generated information or ground truth. Event hit cells The cells file contains the constituent active detector cells that comprise each hit. The cells can be used to refine the hit to track association. A cell is the smallest granularity inside each detector module, much like a pixel on a screen, except that depending on the volume_id a cell can be a square or a long rectangle. It is identified by two channel identifiers that are unique within each detector module and encode the position, much like column/row numbers of a matrix. A cell can provide signal information that the detector module has recorded in addition to the position. Depending on the detector type only one of the channel identifiers is valid, e.g. for the strip detectors, and the value might have different resolution. hit_id: numerical identifier of the hit as defined in the hits file. ch0, ch1: channel identifier/coordinates unique within one module. value: signal value information, e.g. how much charge a particle has deposited. Additional detector geometry information The detector is built from silicon slabs (or modules, rectangular or trapezoïdal), arranged in cylinders and disks, which measure the position (or hits) of the particles that cross them. The detector modules are organized into detector groups or volumes identified by a volume id. Inside a volume they are further grouped into layers identified by a layer id. Each layer can contain an arbitrary number of detector modules, the smallest geometrically distinct detector object, each identified by a module_id. Within each group, detector modules are of the same type have e.g. the same granularity. All simulated detector modules are so-called semiconductor sensors that are build from thin silicon sensor chips. Each module can be represented by a two-dimensional, planar, bounded sensitive surface. These sensitive surfaces a...
Multi-Domain Outlier Detection Dataset
zenodo.org
explore.openaire.eu
+1more
zip
Updated Mar 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hannah Kerner; Hannah Kerner; Umaa Rebbapragada; Umaa Rebbapragada; Kiri Wagstaff; Kiri Wagstaff; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha (2022). Multi-Domain Outlier Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.5941339
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5941339
Dataset updated
Mar 31, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hannah Kerner; Hannah Kerner; Umaa Rebbapragada; Umaa Rebbapragada; Kiri Wagstaff; Kiri Wagstaff; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha; Steven Lu; Bryce Dubayah; Eric Huff; Raymond Francis; Jake Lee; Vinay Raman; Sakshum Kulshrestha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:

Astrophysics - detecting anomalous observations in the Dark Energy Survey (DES) catalog (data type: feature vectors)

Planetary science - selecting novel geologic targets for follow-up observation onboard the Mars Science Laboratory (MSL) rover (data type: grayscale images)

Earth science: detecting anomalous samples in satellite time series corresponding to ground-truth observations of maize crops (data type: time series/feature vectors)

Fashion-MNIST/MNIST: benchmark task to detect anomalous MNIST images among Fashion-MNIST images (data type: grayscale images)

Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).

To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:

Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.
H
Data from: The HAM10000 dataset, a large collection of multi-source...
dataverse.harvard.edu
tsv, zip
Updated Jan 29, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2021). The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions [Dataset]. http://doi.org/10.7910/DVN/DBW86T
Explore at:
tsv(830369), zip(10808743)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/DBW86T
Dataset updated
Jan 29, 2021
Dataset provided by
Harvard Dataverse
Description
Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (follow_up), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal). The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. Due to upload size limitations, images are stored in two files: HAM10000_images_part1.zip (5000 JPEG files) HAM10000_images_part2.zip (5015 JPEG files) Additional data for evaluation purposes The HAM10000 dataset served as the training set for the ISIC 2018 challenge (Task 3). The test-set images are available herein as ISIC2018_Task3_Test_Images.zip (1511 images), the official validation-set is available through the challenge website https://challenge2018.isic-archive.com/. The ISIC-Archive also provides a "Live challenge" submission site for continuous evaluation of automated classifiers on the official validation- and test-set. Comparison to physicians Test-set evaluations of the ISIC 2018 challenge were compared to physicians on an international scale, where the majority of challenge participants outperformed expert readers: Tschandl P. et al., Lancet Oncol 2019 Human-computer collaboration The test-set images were also used in a study comparing different methods and scenarios of human-computer collaboration: Tschandl P. et al., Nature Medicine 2020 Following corresponding metadata is available herein: ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv: Human ratings for Test images with and without interaction with a ResNet34 CNN (Malignancy Probability, Multi-Class probability, CBIR) or Human-Crowd Multi-Class probabilities. This is data was collected for and analyzed in Tschandl P. et al., Nature Medicine 2020, therefore please refer to this publication when using the data. HAM10000_segmentations_lesion_tschandl.zip: To evaluate regions of CNN activations in Tschandl P. et al., Nature Medicine 2020 (please refer to this publication when using the data), a single dermatologist (Tschandl P) created binary segmentation masks for all 10015 images from the HAM10000 dataset. Masks were initialized with the segmentation network as described by Tschandl et al., Computers in Biology and Medicine 2019, and following verified, corrected or replaced via the free-hand selection tool in FIJI.
f
Table_1_An Optical Coherence Tomography-Based Deep Learning Algorithm for...
frontiersin.figshare.com
docx
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ling Wei; Wenwen He; Jinrui Wang; Keke Zhang; Yu Du; Jiao Qi; Jiaqi Meng; Xiaodi Qiu; Lei Cai; Qi Fan; Zhennan Zhao; Yating Tang; Shuang Ni; Haike Guo; Yunxiao Song; Xixi He; Dayong Ding; Yi Lu; Xiangjia Zhu (2023). Table_1_An Optical Coherence Tomography-Based Deep Learning Algorithm for Visual Acuity Prediction of Highly Myopic Eyes After Cataract Surgery.DOCX [Dataset]. http://doi.org/10.3389/fcell.2021.652848.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fcell.2021.652848.s002
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Ling Wei; Wenwen He; Jinrui Wang; Keke Zhang; Yu Du; Jiao Qi; Jiaqi Meng; Xiaodi Qiu; Lei Cai; Qi Fan; Zhennan Zhao; Yating Tang; Shuang Ni; Haike Guo; Yunxiao Song; Xixi He; Dayong Ding; Yi Lu; Xiangjia Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundDue to complicated and variable fundus status of highly myopic eyes, their visual benefit from cataract surgery remains hard to be determined preoperatively. We therefore aimed to develop an optical coherence tomography (OCT)-based deep learning algorithms to predict the postoperative visual acuity of highly myopic eyes after cataract surgery.Materials and MethodsThe internal dataset consisted of 1,415 highly myopic eyes having cataract surgeries in our hospital. Another external dataset consisted of 161 highly myopic eyes from Heping Eye Hospital. Preoperative macular OCT images were set as the only feature. The best corrected visual acuity (BCVA) at 4 weeks after surgery was set as the ground truth. Five different deep learning algorithms, namely ResNet-18, ResNet-34, ResNet-50, ResNet-101, and Inception-v3, were used to develop the model aiming at predicting the postoperative BCVA, and an ensemble learning was further developed. The model was further evaluated in the internal and external test datasets.ResultsThe ensemble learning showed the lowest mean absolute error (MAE) of 0.1566 logMAR and the lowest root mean square error (RMSE) of 0.2433 logMAR in the validation dataset. Promising outcomes in the internal and external test datasets were revealed with MAEs of 0.1524 and 0.1602 logMAR and RMSEs of 0.2612 and 0.2020 logMAR, respectively. Considerable sensitivity and precision were achieved in the BCVA < 0.30 logMAR group, with 90.32 and 75.34% in the internal test dataset and 81.75 and 89.60% in the external test dataset, respectively. The percentages of the prediction errors within ± 0.30 logMAR were 89.01% in the internal and 88.82% in the external test dataset.ConclusionPromising prediction outcomes of postoperative BCVA were achieved by the novel OCT-trained deep learning model, which will be helpful for the surgical planning of highly myopic cataract patients.
Handwritten Text Recognition Test Set: Minutes of the Swiss Federal Council...
zenodo.org
data.niaid.nih.gov
application/gzip, zip
Updated May 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobias Hodel; Tobias Hodel; David Schoch; David Schoch (2021). Handwritten Text Recognition Test Set: Minutes of the Swiss Federal Council (1848-1903) [Dataset]. http://doi.org/10.5281/zenodo.4746342
Explore at:
application/gzip, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4746342
Dataset updated
May 11, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tobias Hodel; Tobias Hodel; David Schoch; David Schoch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Switzerland
Description
This data set is a test set generated to test the capabilities of engines for Optical Character Recognition and Handwritten Text Recognition.

The data set consists of extracts of the minutes of the Swiss Federal Council. The single lines have been randomly chosen from about 150'000 pages of handwritten minutes.

For each line, an image file is being provided by the Swiss Federal Archives/Schweizerisches Bundesarchiv [images.tar.gz]. Please cite the images as follows: Excerpts of BAR E1004.1#1000/9#1-215. The images are in the public domain.

A PageXML file [page.zip] accompanies every image file and indicates the transcription and coordinates of the line.

For PageXML see Pletschacher, S., & Antonacopoulos, A. (2010). The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. 257–260. https://doi.org/10.1109/ICPR.2010.72.
t
Seoul bike demand prediction artifacts
test.researchdata.tuwien.at
bin, csv, json, png
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kobler Lukas; Kobler Lukas; Kobler Lukas; Kobler Lukas (2025). Seoul bike demand prediction artifacts [Dataset]. http://doi.org/10.70124/wvqpp-nhb32
Explore at:
png, csv, bin, jsonAvailable download formats
Unique identifier
https://doi.org/10.70124/wvqpp-nhb32
Dataset updated
May 13, 2025
Dataset provided by
TU Wien
Authors
Kobler Lukas; Kobler Lukas; Kobler Lukas; Kobler Lukas
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Time period covered
Apr 8, 2024
Area covered
Seoul
Description
This project aims to predict bike rental demand using machine learning, specifically focusing on hourly predictions based on various environmental and temporal features. The dataset used for this analysis is the publicly available "Seoul Bike Sharing Demand" dataset, which includes factors like temperature, humidity, wind speed, and historical rental counts.

Key elements of the project:

Model: A trained XGBoost regression model that predicts bike rental counts for each hour, given the relevant environmental and temporal features. This model is built to optimize fleet distribution for bike-sharing companies, helping them efficiently manage resources and reduce operational costs.

Visualization: A plot that visualizes the comparison between the ground truth (actual bike rentals) and the predictions made by the XGBoost model. The plot provides insights into how well the model captures patterns in bike rental demand and the accuracy of its forecasts.

Predictions (CSV): A CSV file containing the model's predictions for the test set. The CSV includes the predicted bike rental counts, along with relevant features such as date, hour, temperature, and humidity. This dataset is intended for evaluating the performance of the trained model and for further analysis.

CodeMeta: A metadata file that provides essential information about the project's code, ensuring it adheres to best practices for reproducibility and transparency in computational research.

FAIR4ML: The project follows the FAIR4ML principles to ensure that the machine learning models, datasets, and results are Findable, Accessible, Interoperable, and Reproducible. All code, models, and results are made publicly available for further research and re-use.
OPEN-WINDOW: SOUND EVENT DATABASE FOR RESEARCH AND DEVELOPMENT
zenodo.org
wav
Updated Oct 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saeid Safavi; Saeid Safavi; Turab Iqbal; Philip Coleman; Wenwu Wang; Mark Plumbley; Turab Iqbal; Philip Coleman; Wenwu Wang; Mark Plumbley (2020). OPEN-WINDOW: SOUND EVENT DATABASE FOR RESEARCH AND DEVELOPMENT [Dataset]. http://doi.org/10.5281/zenodo.3620748
Explore at:
wavAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3620748
Dataset updated
Oct 22, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Saeid Safavi; Saeid Safavi; Turab Iqbal; Philip Coleman; Wenwu Wang; Mark Plumbley; Turab Iqbal; Philip Coleman; Wenwu Wang; Mark Plumbley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
(1) Background:

Situated in the domain of urban sound scene classification by humans and machines, the research in this project will be a first step towards mapping urban noise pollution experienced indoors and finding ways to reduce its negative impact in peoples' homes. The acoustic distinction between outdoor and indoor scenes is an active research field and can be automated with some success. A much subtler difference is the change in the indoor soundscape induced by an open window. Being able to determine this, however, would allow applications in warning systems and be a prerequisite for an app-based urban sound mapping project.

Acoustic detection requires neither line of sight nor sensors at the window frame or knowledge of the number of windows or their size. The task, however, varies substantially in difficulty with the amount of sound inside and outside. From the point of machine classification the lack of specificity is the most problematic aspect: Very few sounds if any can be assumed to originate exclusively from outside and be present at all times to aid automatic detection. The required generalisation ability, however, can be assumed for humans, who might also use very subtle cues in the change of reverberations.

(2) Aims

The aims are

(a) to determine the degree of reliability with which an open window can be recognised by humans and machines under varying circumstances based only on acoustic cues;

(b) to investigate whether the findings for humans and machines can inform each other and can be used for further application-related research, e.g., window noise cancellation.

(3) Method:

(a) Dataset acquisition:

A recording kit consisting of a dedicated laptop and microphone will be given to volunteers. Custom-programmed software will remind the user to specify the window state (establishing the so-called ground truth).

(b) Perception experiments:

Thirty participants will judge whether in the recorded clips a window is open or closed. After an extended familiarisation phase, they will proceed through two testing phases: In the first phase, all clips will originate from the recording locations with which the participants have already familiarised themselves, in the second they will judge clips from locations they haven't been exposed to before (partial data sets used for the familiar/unfamiliar conditions will be counterbalanced across participants).

(c) Machine recognition:

We will develop a machine learning system using state-of-the-art deep learning methods (artificial neural networks with multiple layers). To encourage other researchers to also take up this research, we will organise a machine learning challenge. In the challenge, a training data set including correct labels (ground truth) and a test set without the labels are provided. Researchers from academia and industry across the world will develop their own systems and send their classification results on the test set to the organisers to evaluate and publish online.
Navas-Olive, Rubio, et al. (2023). Figure 6 - data
figshare.com
bin
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liset M de la Prida; Andrea Navas-Olive; Adrian Rubio (2024). Navas-Olive, Rubio, et al. (2023). Figure 6 - data [Dataset]. http://doi.org/10.6084/m9.figshare.24999185.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24999185.v2
Dataset updated
Jan 15, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Liset M de la Prida; Andrea Navas-Olive; Adrian Rubio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Figure 6. Extending sharp-wave ripple detection to non-human primates. c) Significant differences between SWR recorded in mice and monkey. d) The best model of each architecture trained in mouse data, and the best filter configuration for mouse data, were applied to detect SWRs on the macaque data. We evaluated all models by computing F1-score against the ground truth (GT). Note relatively good results from non-retrained ML models and filter. e) Results of model re-training using macaque data. Data were split into a training and validation dataset (50% and 20% respectively), used to train the ML models; and a test set (30%), used to compute the F1 (left panel). Filter was not re-trained. f) F1-scores for the maximal performance of each model before and after re-training.
f
Machine-Learning Approaches for Classifying Haplogroup from Y Chromosome STR...
figshare.com
ai
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Schlecht; Matthew E. Kaplan; Kobus Barnard; Tatiana Karafet; Michael F. Hammer; Nirav C. Merchant (2023). Machine-Learning Approaches for Classifying Haplogroup from Y Chromosome STR Data [Dataset]. http://doi.org/10.1371/journal.pcbi.1000093
Explore at:
aiAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1000093
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS Computational Biology
Authors
Joseph Schlecht; Matthew E. Kaplan; Kobus Barnard; Tatiana Karafet; Michael F. Hammer; Nirav C. Merchant
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Genetic variation on the non-recombining portion of the Y chromosome contains information about the ancestry of male lineages. Because of their low rate of mutation, single nucleotide polymorphisms (SNPs) are the markers of choice for unambiguously classifying Y chromosomes into related sets of lineages known as haplogroups, which tend to show geographic structure in many parts of the world. However, performing the large number of SNP genotyping tests needed to properly infer haplogroup status is expensive and time consuming. A novel alternative for assigning a sampled Y chromosome to a haplogroup is presented here. We show that by applying modern machine-learning algorithms we can infer with high accuracy the proper Y chromosome haplogroup of a sample by scoring a relatively small number of Y-linked short tandem repeats (STRs). Learning is based on a diverse ground-truth data set comprising pairs of SNP test results (haplogroup) and corresponding STR scores. We apply several independent machine-learning methods in tandem to learn formal classification functions. The result is an integrated high-throughput analysis system that automatically classifies large numbers of samples into haplogroups in a cost-effective and accurate manner.
f
Data_Sheet_1_Using Deep Convolutional Neural Networks for Neonatal Brain...
frontiersin.figshare.com
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang Ding; Rolando Acosta; Vicente Enguix; Sabrina Suffren; Janosch Ortmann; David Luck; Jose Dolz; Gregory A. Lodygensky (2023). Data_Sheet_1_Using Deep Convolutional Neural Networks for Neonatal Brain Image Segmentation.docx [Dataset]. http://doi.org/10.3389/fnins.2020.00207.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fnins.2020.00207.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Yang Ding; Rolando Acosta; Vicente Enguix; Sabrina Suffren; Janosch Ortmann; David Luck; Jose Dolz; Gregory A. Lodygensky
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionDeep learning neural networks are especially potent at dealing with structured data, such as images and volumes. Both modified LiviaNET and HyperDense-Net performed well at a prior competition segmenting 6-month-old infant magnetic resonance images, but neonatal cerebral tissue type identification is challenging given its uniquely inverted tissue contrasts. The current study aims to evaluate the two architectures to segment neonatal brain tissue types at term equivalent age.MethodsBoth networks were retrained over 24 pairs of neonatal T1 and T2 data from the Developing Human Connectome Project public data set and validated on another eight pairs against ground truth. We then reported the best-performing model from training and its performance by computing the Dice similarity coefficient (DSC) for each tissue type against eight test subjects.ResultsDuring the testing phase, among the segmentation approaches tested, the dual-modality HyperDense-Net achieved the best statistically significantly test mean DSC values, obtaining 0.94/0.95/0.92 for the tissue types and took 80 h to train and 10 min to segment, including preprocessing. The single-modality LiviaNET was better at processing T2-weighted images than processing T1-weighted images across all tissue types, achieving mean DSC values of 0.90/0.90/0.88 for gray matter, white matter, and cerebrospinal fluid, respectively, while requiring 30 h to train and 8 min to segment each brain, including preprocessing.DiscussionOur evaluation demonstrates that both neural networks can segment neonatal brains, achieving previously reported performance. Both networks will be continuously retrained over an increasingly larger repertoire of neonatal brain data and be made available through the Canadian Neonatal Brain Platform to better serve the neonatal brain imaging research community.
Z
Data from: Coronary artery segmentation in non-contrast calcium scoring CT...
data.niaid.nih.gov
zenodo.org
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kostur, Marcin (2025). Data from: Coronary artery segmentation in non-contrast calcium scoring CT images using deep learning [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7808198
Explore at:
Dataset updated
Jan 27, 2025
Dataset provided by
Bujny, Mariusz
Wolny, Sabina
Miszalski-Jamka, Karol
Widawka-Żak, Katarzyna
Jesionek, Katarzyna
Kostur, Marcin
Nalepa, Jakub
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Precise segmentation of coronary arteries in non-contrast Computed Tomography (CT) scans plays an important role in the assessment of the coronary artery disease, where it is the key component for evaluating the Calcium Score (Agatston et al. 1990). In the paper by Bujny et al. (2024), a deep-learning approach for high-precision segmentation of coronary arteries in non-contrast CT was proposed along with a novel method for generating Ground Truth (GT) test data (test-GT) via manual registration of high-resolution coronary tree models obtained based on contrast CT with the non-contrast CT scans. In this dataset, we present the inferences of the neural network model together with the corresponding test-GT samples, based on 6 CT scans from the openly available OrCaScore dataset (Wolterink et al. 2016). The geometrical models included in the dataset can be used both for inspection of the proposed deep learning model and for testing of new non-contrast coronary vessel segmentation approaches, which is a unique opportunity since, to the best of our knowledge, manual generation of GT for non-contrast coronary artery segmentation was not addressed so far due to very challenging character of this particular segmentation task.

Methods

Manual Generation of test-GT

The geometric models of coronary arteries used for the evaluation of the proposed neural network model were generated according to the manual mesh-to-image registration process as described by Bujny et al. (2024). In this approach, the high-resolution coronary artery masks obtained based on contrast CT scans are manually aligned with the corresponding non-contrast CT images using tools available in the open-source 3D computer graphics software, Blender (https://www.blender.org/). To ease the manual alignment process, specialized add-ons for medical image processing such as Cardiac add-on for Blender of Graylight Imaging (https://graylight-imaging.com/3d-modelling/) can be used, as well. The STL models in this dataset were manually generated by a medical expert with 4 years of experience.

Segmentation of Coronary Arteries using a Deep Learning Model

For each of the cases presented in this dataset, we run an inference of an nnU-Net (Isensee et al. 2021) model trained according to the process described in our paper (Bujny et al. 2024). Since we use a standard nnU-Net, which utilizes a sliding window approach for processing of the CT scan, the context information within a patch is limited, which can lead to some false-positive detections. To mitigate this problem, we additionally post-process the inferences by eliminating small vessel fragments of less than 50 [mm^3] volume and structures outside of pericardium, which we segment using another nnU-Net model, SegTHOR (Lambert et al. 2020). The resulting geometric models are stored using the STL format and presented as green masks in the HTML reports with an embedded viewer based on the K3D-jupyter library (https://k3d-jupyter.org/).

Dataset organization

The root folder contains 6 folders whose names correspond to the CT scans from the OrCaScore dataset (Wolterink et al. 2016). In each of the folders, there are the following 4 files available:

‘manualGT_rater1.stl’ – high-resolution STL model of coronary arteries obtained via manual alignment of the geometric model segmented in contrast CT with the corresponding non-contrast CT scan by the first rater. A sample belonging to the test-GT set (Bujny et al. 2024).

‘manualGT_rater2.stl’ – corresponding test-GT sample by the second rater.

‘ML.stl’ – post-processed inference of the nnU-Net ML model in the STL format.

‘report.html’ – interactive HTML report consisting of a manually-aligned test-GT sample (red mask), the ML segmentation based on the non-contrast CT scan (green mask), and selected slices of the non-contrast CT scan. The reports contain the relevant information related to the scanning device and present the main segmentation quality metrics for the ML model inference.
Global ML-ready dataset for mining areas in satellite images
zenodo.org
zip
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Jasansky; Simon Jasansky; Victor Maus; Mirela Popa; Anna Wilbik; Anna Wilbik; Victor Maus; Mirela Popa (2024). Global ML-ready dataset for mining areas in satellite images [Dataset]. http://doi.org/10.5281/zenodo.14195737
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14195737
Dataset updated
Nov 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Simon Jasansky; Simon Jasansky; Victor Maus; Mirela Popa; Anna Wilbik; Anna Wilbik; Victor Maus; Mirela Popa
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset is a global resource for machine learning applications in mining area detection and semantic segmentation on satellite imagery. It contains Sentinel-2 satellite images and corresponding mining area masks + bounding boxes for 1,210 sites worldwide. Ground-truth masks are derived from Maus et al. (2022) and Tang et al. (2023), and validated through manual verification to ensure accurate alignment with Sentinel-2 imagery from specific timestamps.

The dataset includes three mask variants:

Masks exclusively from Maus et al. (n=1,090)

Masks exclusively from Tang et al. (n=817)

A preferred mask selected from either Maus or Tang based on alignment quality determined during manual review (n=1,210).

Each tile corresponds to a 2048x2048 pixel Sentinel-2 image, with metadata on mine type (surface, placer, underground, brine & evaporation) and scale (artisanal, industrial). For convenience, the preferred mask dataset is already split into training (75%), validation (15%), and test (10%) sets.

Furthermore, dataset quality was validated by re-validating test set tiles manually and correcting any mismatches between mining polygons and visually observed true mining area in the images, resulting in the following estimated quality metrics:

Combined Maus Tang
Accuracy 99.78 99.74 99.83
Precision 99.22 99.20 99.24
Recall 95.71 96.34 95.10

Note that the dataset does not contain the Sentinel-2 images themselves but contains a reference to specific Sentinel-2 images. Thus, for any ML applications, the images must be persisted first. For example, Sentinel-2 imagery is available from Microsoft's Planetary Computer and filterable via STAC API: https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a. Additionally, the temporal specificity of the data allows integration with other imagery sources from the indicated timestamp, such as Landsat or other high-resolution imagery.

Source code used to generate this dataset and to use it for ML model training is available at https://github.com/SimonJasansky/mine-segmentation. It includes useful Python scripts, e.g. to download Sentinel-2 images via STAC API, or to divide tile images (2048x2048px) into smaller chips (e.g. 512x512px).

A database schema, a schematic depiction of the dataset generation process, and a map of the global distribution of tiles are provided in the accompanying images.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). Titanic Dataset [Dataset]. https://paperswithcode.com/dataset/titanic

Titanic Dataset

Titanic - Machine Learning from Disaster

Explore at:

Dataset updated

Oct 27, 2024

Description

Titanic Dataset Description Overview The data is divided into two groups: - Training set (train.csv): Used to build machine learning models. It includes the outcome (also called the "ground truth") for each passenger, allowing models to predict survival based on “features” like gender and class. Feature engineering can also be applied to create new features. - Test set (test.csv): Used to evaluate model performance on unseen data. The ground truth is not provided; the task is to predict survival for each passenger in the test set using the trained model.

Additionally, gender_submission.csv is provided as an example submission file, containing predictions based on the assumption that all and only female passengers survive.

Data Dictionary | Variable | Definition | Key | |------------|------------------------------------------|-------------------------------------------------| | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | age | Age in years | | | sibsp | # of siblings/spouses aboard the Titanic | | | parch | # of parents/children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes

pclass: Proxy for socio-economic status (SES): 1st = Upper 2nd = Middle 3rd = Lower age:
Fractional if less than 1 year.
Estimated ages are represented in the form xx.5. sibsp: Defines family relations as: Sibling: Brother, sister, stepbrother, stepsister. Spouse: Husband, wife (excluding mistresses and fiancés). parch: Defines family relations as: Parent: Mother, father. Child: Daughter, son, stepdaughter, stepson. Some children traveled only with a nanny, so parch = 0 for them.

Clear search

Close search

Google apps

Main menu

	Combined	Maus	Tang
Accuracy	99.78	99.74	99.83
Precision	99.22	99.20	99.24
Recall	95.71	96.34	95.10

Titanic Dataset

Data from: RESIDE

mini CURE-OR

Dataset for "Exploring the viability of a machine learning based multimodel...

TrackML Particle Tracking Challenge

wider_face

RAD-ChestCT Dataset

‘Titanic Solution for Beginner's Guide’ analyzed by Analyst-2

Overview

Data Dictionary

Variable Notes

TrackML Throughput Phase

Multi-Domain Outlier Detection Dataset

Data from: The HAM10000 dataset, a large collection of multi-source...

Table_1_An Optical Coherence Tomography-Based Deep Learning Algorithm for...

Handwritten Text Recognition Test Set: Minutes of the Swiss Federal Council...

Seoul bike demand prediction artifacts

OPEN-WINDOW: SOUND EVENT DATABASE FOR RESEARCH AND DEVELOPMENT

Navas-Olive, Rubio, et al. (2023). Figure 6 - data

Machine-Learning Approaches for Classifying Haplogroup from Y Chromosome STR...

Data_Sheet_1_Using Deep Convolutional Neural Networks for Neonatal Brain...

Data from: Coronary artery segmentation in non-contrast calcium scoring CT...

Global ML-ready dataset for mining areas in satellite images

Titanic Dataset

Titanic - Machine Learning from Disaster