Titanic Dataset Description Overview The data is divided into two groups: - Training set (train.csv): Used to build machine learning models. It includes the outcome (also called the "ground truth") for each passenger, allowing models to predict survival based on “features” like gender and class. Feature engineering can also be applied to create new features. - Test set (test.csv): Used to evaluate model performance on unseen data. The ground truth is not provided; the task is to predict survival for each passenger in the test set using the trained model.
Additionally, gender_submission.csv is provided as an example submission file, containing predictions based on the assumption that all and only female passengers survive.
Data Dictionary | Variable | Definition | Key | |------------|------------------------------------------|-------------------------------------------------| | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | age | Age in years | | | sibsp | # of siblings/spouses aboard the Titanic | | | parch | # of parents/children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Variable Notes
pclass: Proxy for socio-economic status (SES):
1st = Upper
2nd = Middle
3rd = Lower
age:
Fractional if less than 1 year.
Estimated ages are represented in the form xx.5.
sibsp: Defines family relations as:
Sibling: Brother, sister, stepbrother, stepsister.
Spouse: Husband, wife (excluding mistresses and fiancés).
parch: Defines family relations as:
Parent: Mother, father.
Child: Daughter, son, stepdaughter, stepson.
Some children traveled only with a nanny, so parch = 0 for them.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The REalistic Single Image DEhazing (RESIDE) dataset is a large-scale benchmark dataset for single image dehazing. It consists of 1000 hazy images and their corresponding ground truth clear images. The dataset is divided into a training set of 800 images and a test set of 200 images.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
File descriptions
Data fields
For more information about CURE-OR dataset, please refer to the webpage.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Title: Dataset for "Exploring the viability of a machine learning based multimodel for quantitative precipitation forecast post-processing"
Description:
This dataset supports the study presented in the paper "Exploring the viability of a machine learning based multimodel for quantitative precipitation forecast post-processing". The work focuses on improving quantiative precipitation forecast over the Piedmont and Aosta Valley regions in Italy by blending outputs from four Numerical Weather Prediction (NWP) models using machine learning architectures including Multi-Layer Perceptrons (MLPs), U-Net and Residual U-Net as Convolutional Neural Networks (CNNs), and NWIOI as observational data (Turco et al., 2013).
Observational data from NWIOI serve as the ground truth for model training. The dataset contains 406 gridded precipitation events from 2018 to 2022.
Dataset contents:
obs.zip
: NWIOI observed precipitation data (.csv
format, one file per event)subsets.zip
: Events dates for 10 different training-validation-test sets, retrieved with 10-fold cross validation (.csv
format, one file per set and per split)domain_mask.csv
: Binary mask (1 for grid points in the study area, 0 otherwise)allevents_dates_zenodo.csv
: Summary statistics and classification of all events by intensity and nature, used for subsets creation with 10-fold cross validationCitations:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Original source from Kaggle : https://www.kaggle.com/c/trackml-particle-identification/data
The dataset comprises multiple independent events, where each event contains simulated measurements (essentially 3D points) of particles generated in a collision between proton bunches at the Large Hadron Collider at CERN. The goal of the tracking machine learning challenge is to group the recorded measurements or hits for each event into tracks, sets of hits that belong to the same initial particle. A solution must uniquely associate each hit to one track. The training dataset contains the recorded hits, their ground truth counterpart and their association to particles, and the initial parameters of those particles. The test dataset contains only the recorded hits.
Once unzipped, the dataset is provided as a set of plain .csv files. Each event has four associated files that contain hits, hit cells, particles, and the ground truth association between them. The common prefix, e.g. event000000010, is always event followed by 9 digits.
event000000000-hits.csv
event000000000-cells.csv
event000000000-particles.csv
event000000000-truth.csv
event000000001-hits.csv
event000000001-cells.csv
event000000001-particles.csv
event000000001-truth.csv
Event hits
The hits file contains the following values for each hit/entry:
hit_id: numerical identifier of the hit inside the event.
x, y, z: measured x, y, z position (in millimeter) of the hit in global coordinates.
volume_id: numerical identifier of the detector group.
layer_id: numerical identifier of the detector layer inside the group.
module_id: numerical identifier of the detector module inside the layer.
The volume/layer/module id could in principle be deduced from x, y, z. They are given here to simplify detector-specific data handling.
Event truth
The truth file contains the mapping between hits and generating particles and the true particle state at each measured hit. Each entry maps one hit to one particle.
hit_id: numerical identifier of the hit as defined in the hits file.
particle_id: numerical identifier of the generating particle as defined in the particles file. A value of 0 means that the hit did not originate from a reconstructible particle, but e.g. from detector noise.
tx, ty, tz true intersection point in global coordinates (in millimeters) between the particle trajectory and the sensitive surface.
tpx, tpy, tpz true particle momentum (in GeV/c) in the global coordinate system at the intersection point. The corresponding vector is tangent to the particle trajectory at the intersection point.
weight per-hit weight used for the scoring metric; total sum of weights within one event equals to one.
Event particles
The particles files contains the following values for each particle/entry:
particle_id: numerical identifier of the particle inside the event.
vx, vy, vz: initial position or vertex (in millimeters) in global coordinates.
px, py, pz: initial momentum (in GeV/c) along each global axis.
q: particle charge (as multiple of the absolute electron charge).
nhits: number of hits generated by this particle.
All entries contain the generated information or ground truth.
Event hit cells
The cells file contains the constituent active detector cells that comprise each hit. The cells can be used to refine the hit to track association. A cell is the smallest granularity inside each detector module, much like a pixel on a screen, except that depending on the volume_id a cell can be a square or a long rectangle. It is identified by two channel identifiers that are unique within each detector module and encode the position, much like column/row numbers of a matrix. A cell can provide signal information that the detector module has recorded in addition to the position. Depending on the detector type only one of the channel identifiers is valid, e.g. for the strip detectors, and the value might have different resolution.
hit_id: numerical identifier of the hit as defined in the hits file.
ch0, ch1: channel identifier/coordinates unique within one module.
value: signal value information, e.g. how much charge a particle has deposited.
Additional detector geometry information
The detector is built from silicon slabs (or modules, rectangular or trapezoïdal), arranged in cylinders and disks, which measure the position (or hits) of the particles that cross them. The detector modules are organized into detector groups or volumes identified by a volume id. Inside a volume they are further grouped into layers identified by a layer id. Each layer can contain an arbitrary number of detector modules, the smallest geometrically distinct detector object, each identified by a module_id. Within each group, detector modules are of the same type have e.g. the same granularity. All simulated detector modules are so-called semiconductor sensors that are build from thin silicon sensor chips. Each module can be represented by a two-dimensional, planar, bounded sensitive surface. These sensitive surfaces are subdivided into regular grids that define the detectors cells, the smallest granularity within the detector.
Each module has a different position and orientation described in the detectors file. A local, right-handed coordinate system is defined on each sensitive surface such that the first two coordinates u and v are on the sensitive surface and the third coordinate w is normal to the surface. The orientation and position are defined by the following transformation
pos_xyz = rotation_matrix * pos_uvw + translation
that transform a position described in local coordinates u,v,w into the equivalent position x,y,z in global coordinates using a rotation matrix and and translation vector (cx,cy,cz).
volume_id: numerical identifier of the detector group.
layer_id: numerical identifier of the detector layer inside the group.
module_id: numerical identifier of the detector module inside the layer.
cx, cy, cz: position of the local origin in the global coordinate system (in millimeter).
rot_xu, rot_xv, rot_xw, rot_yu, ...: components of the rotation matrix to rotate from local u,v,w to global x,y,z coordinates.
module_t: half thickness of the detector module (in millimeter).
module_minhu, module_maxhu: the minimum/maximum half-length of the module boundary along the local u direction (in millimeter).
module_hv: the half-length of the module boundary along the local v direction (in millimeter).
pitch_u, pitch_v: the size of detector cells along the local u and v direction (in millimeter).
There are two different module shapes in the detector, rectangular and trapezoidal. The pixel detector ( with volume_id = 7,8,9) is fully built from rectangular modules, and so are the cylindrical barrels in volume_id=13,17. The remaining layers are made out disks that need trapezoidal shapes to cover the full disk.
WIDER FACE dataset is a face detection benchmark dataset, of which images are selected from the publicly available WIDER dataset. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes. For each event class, we randomly select 40%/10%/50% data as training, validation and testing sets. We adopt the same evaluation metric employed in the PASCAL VOC dataset. Similar to MALF and Caltech datasets, we do not release bounding box ground truth for the test images. Users are required to submit final prediction files, which we shall proceed to evaluate.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wider_face', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/wider_face-0.1.0.png" alt="Visualization" width="500px">
Overview
The RAD-ChestCT dataset is a large medical imaging dataset developed by Duke MD/PhD student Rachel Draelos during her Computer Science PhD supervised by Lawrence Carin. The full dataset includes 35,747 chest CT scans from 19,661 adult patients. This Zenodo repository contains an initial release of 3,630 chest CT scans, approximately 10% of the dataset. This dataset is of significant interest to the machine learning and medical imaging research communities.
Papers
The following published paper includes a description of how the RAD-ChestCT dataset was created: Draelos et al., "Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale Chest Computed Tomography Volumes," Medical Image Analysis 2021. DOI: 10.1016/j.media.2020.101857 https://pubmed.ncbi.nlm.nih.gov/33129142/
Two additional papers leveraging the RAD-ChestCT dataset are available as preprints:
"Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks" (https://arxiv.org/abs/2011.08891)
"Explainable multiple abnormality classification of chest CT volumes with deep learning" (https://arxiv.org/abs/2111.12215)
Details about the files included in this data release
Metadata Files (4)
CT_Scan_Metadata_Complete_35747.csv: includes metadata about the whole dataset, with information extracted from DICOM headers.
Extrema_5747.csv: includes coordinates for lung bounding boxes for the whole dataset. Coordinates were derived computationally using a morphological image processing lung segmentation pipeline.
Indications_35747.csv: includes scan indications for the whole dataset. Indications were extracted from the free-text reports.
Summary_3630.csv: includes a listing of the 3,630 scans that are part of this repository.
Label Files (3)
The label files contain abnormality x location labels for the 3,630 shared CT volumes. Each CT volume is annotated with a matrix of 84 abnormality labels x 52 location labels. Labels were extracted from the free text reports using the Sentence Analysis for Radiology Label Extraction (SARLE) framework. For each CT scan, the label matrix has been flattened and the abnormalities and locations are separated by an asterisk in the CSV column headers (e.g. "mass*liver"). The labels can be used as the ground truth when training computer vision classifiers on the CT volumes. Label files include: imgtrain_Abnormality_and_Location_Labels.csv (for the training set)
imgvalid_Abnormality_and_Location_Labels.csv (for the validation set)
imgtest_Abnormality_and_Location_Labels.csv (for the test set)
CT Volume Files (3,630)
Each CT scan is provided as a compressed 3D numpy array (npz format). The CT scans can be read using the Python package numpy, version 1.14.5 and above.
Related Code
Code related to RAD-ChestCT is publicly available on GitHub at https://github.com/rachellea.
Repositories of interest include:
https://github.com/rachellea/ct-net-models contains PyTorch code to load the RAD-ChestCT dataset and train convolutional neural network models for multiple abnormality prediction from whole CT volumes.
https://github.com/rachellea/ct-volume-preprocessing contains an end-to-end Python framework to convert CT scans from DICOM to numpy format. This code was used to prepare the RAD-ChestCT volumes.
https://github.com/rachellea/sarle-labeler contains the Python implementation of the SARLE label extraction framework used to generate the abnormality and location label matrix from the free text reports. SARLE has minimal dependencies and the abnormality and location vocabulary terms can be easily modified to adapt SARLE to different radiologic modalities, abnormalities, and anatomical locations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Titanic Solution for Beginner's Guide’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harunshimanto/titanic-solution-for-beginners-guide on 14 February 2022.
--- Dataset description provided by original source is as follows ---
The data has been split into two groups:
training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
--- Original source retains full ownership of the source dataset ---
Original source from Codalab : https://competitions.codalab.org/competitions/20112 The dataset comprises multiple independent events, where each event contains simulated measurements (essentially 3D points) of particles generated in a collision between proton bunches at the Large Hadron Collider at CERN. The goal of the tracking machine learning challenge is to group the recorded measurements or hits for each event into tracks, sets of hits that belong to the same initial particle. A solution must uniquely associate each hit to one track. The training dataset contains the recorded hits, their ground truth counterpart and their association to particles, and the initial parameters of those particles. The test dataset contains only the recorded hits. Once unzipped, the dataset is provided as a set of plain .csv files. Each event has four associated files that contain hits, hit cells, particles, and the ground truth association between them. The common prefix, e.g. event000000010, is always event followed by 9 digits. event000000000-hits.csv event000000000-cells.csv event000000000-particles.csv event000000000-truth.csv event000000001-hits.csv event000000001-cells.csv event000000001-particles.csv event000000001-truth.csv Event hits The hits file contains the following values for each hit/entry: hit_id: numerical identifier of the hit inside the event. x, y, z: measured x, y, z position (in millimeter) of the hit in global coordinates. volume_id: numerical identifier of the detector group. layer_id: numerical identifier of the detector layer inside the group. module_id: numerical identifier of the detector module inside the layer. The volume/layer/module id could in principle be deduced from x, y, z. They are given here to simplify detector-specific data handling. Event truth The truth file contains the mapping between hits and generating particles and the true particle state at each measured hit. Each entry maps one hit to one particle. hit_id: numerical identifier of the hit as defined in the hits file. particle_id: numerical identifier of the generating particle as defined in the particles file. A value of 0 means that the hit did not originate from a reconstructible particle, but e.g. from detector noise. tx, ty, tz true intersection point in global coordinates (in millimeters) between the particle trajectory and the sensitive surface. tpx, tpy, tpz true particle momentum (in GeV/c) in the global coordinate system at the intersection point. The corresponding vector is tangent to the particle trajectory at the intersection point. weight per-hit weight used for the scoring metric; total sum of weights within one event equals to one. Event particles The particles files contains the following values for each particle/entry: particle_id: numerical identifier of the particle inside the event. vx, vy, vz: initial position or vertex (in millimeters) in global coordinates. px, py, pz: initial momentum (in GeV/c) along each global axis. q: particle charge (as multiple of the absolute electron charge). nhits: number of hits generated by this particle. All entries contain the generated information or ground truth. Event hit cells The cells file contains the constituent active detector cells that comprise each hit. The cells can be used to refine the hit to track association. A cell is the smallest granularity inside each detector module, much like a pixel on a screen, except that depending on the volume_id a cell can be a square or a long rectangle. It is identified by two channel identifiers that are unique within each detector module and encode the position, much like column/row numbers of a matrix. A cell can provide signal information that the detector module has recorded in addition to the position. Depending on the detector type only one of the channel identifiers is valid, e.g. for the strip detectors, and the value might have different resolution. hit_id: numerical identifier of the hit as defined in the hits file. ch0, ch1: channel identifier/coordinates unique within one module. value: signal value information, e.g. how much charge a particle has deposited. Additional detector geometry information The detector is built from silicon slabs (or modules, rectangular or trapezoïdal), arranged in cylinders and disks, which measure the position (or hits) of the particles that cross them. The detector modules are organized into detector groups or volumes identified by a volume id. Inside a volume they are further grouped into layers identified by a layer id. Each layer can contain an arbitrary number of detector modules, the smallest geometrically distinct detector object, each identified by a module_id. Within each group, detector modules are of the same type have e.g. the same granularity. All simulated detector modules are so-called semiconductor sensors that are build from thin silicon sensor chips. Each module can be represented by a two-dimensional, planar, bounded sensitive surface. These sensitive surfaces a...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multi-Domain Outlier Detection Dataset contains datasets for conducting outlier detection experiments for four different application domains:
Each dataset contains a "fit" dataset (used for fitting or training outlier detection models), a "score" dataset (used for scoring samples used to evaluate model performance, analogous to test set), and a label dataset (indicates whether samples in the score dataset are considered outliers or not in the domain of each dataset).
To read more about the datasets and how they are used for outlier detection, or to cite this dataset in your own work, please see the following citation:
Kerner, H. R., Rebbapragada, U., Wagstaff, K. L., Lu, S., Dubayah, B., Huff, E., Lee, J., Raman, V., and Kulshrestha, S. (2022). Domain-agnostic Outlier Ranking Algorithms (DORA)-A Configurable Pipeline for Facilitating Outlier Detection in Scientific Datasets. Under review for Frontiers in Astronomy and Space Sciences.
Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). More than 50% of lesions are confirmed through histopathology (histo), the ground truth for the rest of the cases is either follow-up examination (follow_up), expert consensus (consensus), or confirmation by in-vivo confocal microscopy (confocal). The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. Due to upload size limitations, images are stored in two files: HAM10000_images_part1.zip (5000 JPEG files) HAM10000_images_part2.zip (5015 JPEG files) Additional data for evaluation purposes The HAM10000 dataset served as the training set for the ISIC 2018 challenge (Task 3). The test-set images are available herein as ISIC2018_Task3_Test_Images.zip (1511 images), the official validation-set is available through the challenge website https://challenge2018.isic-archive.com/. The ISIC-Archive also provides a "Live challenge" submission site for continuous evaluation of automated classifiers on the official validation- and test-set. Comparison to physicians Test-set evaluations of the ISIC 2018 challenge were compared to physicians on an international scale, where the majority of challenge participants outperformed expert readers: Tschandl P. et al., Lancet Oncol 2019 Human-computer collaboration The test-set images were also used in a study comparing different methods and scenarios of human-computer collaboration: Tschandl P. et al., Nature Medicine 2020 Following corresponding metadata is available herein: ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv: Human ratings for Test images with and without interaction with a ResNet34 CNN (Malignancy Probability, Multi-Class probability, CBIR) or Human-Crowd Multi-Class probabilities. This is data was collected for and analyzed in Tschandl P. et al., Nature Medicine 2020, therefore please refer to this publication when using the data. HAM10000_segmentations_lesion_tschandl.zip: To evaluate regions of CNN activations in Tschandl P. et al., Nature Medicine 2020 (please refer to this publication when using the data), a single dermatologist (Tschandl P) created binary segmentation masks for all 10015 images from the HAM10000 dataset. Masks were initialized with the segmentation network as described by Tschandl et al., Computers in Biology and Medicine 2019, and following verified, corrected or replaced via the free-hand selection tool in FIJI.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundDue to complicated and variable fundus status of highly myopic eyes, their visual benefit from cataract surgery remains hard to be determined preoperatively. We therefore aimed to develop an optical coherence tomography (OCT)-based deep learning algorithms to predict the postoperative visual acuity of highly myopic eyes after cataract surgery.Materials and MethodsThe internal dataset consisted of 1,415 highly myopic eyes having cataract surgeries in our hospital. Another external dataset consisted of 161 highly myopic eyes from Heping Eye Hospital. Preoperative macular OCT images were set as the only feature. The best corrected visual acuity (BCVA) at 4 weeks after surgery was set as the ground truth. Five different deep learning algorithms, namely ResNet-18, ResNet-34, ResNet-50, ResNet-101, and Inception-v3, were used to develop the model aiming at predicting the postoperative BCVA, and an ensemble learning was further developed. The model was further evaluated in the internal and external test datasets.ResultsThe ensemble learning showed the lowest mean absolute error (MAE) of 0.1566 logMAR and the lowest root mean square error (RMSE) of 0.2433 logMAR in the validation dataset. Promising outcomes in the internal and external test datasets were revealed with MAEs of 0.1524 and 0.1602 logMAR and RMSEs of 0.2612 and 0.2020 logMAR, respectively. Considerable sensitivity and precision were achieved in the BCVA < 0.30 logMAR group, with 90.32 and 75.34% in the internal test dataset and 81.75 and 89.60% in the external test dataset, respectively. The percentages of the prediction errors within ± 0.30 logMAR were 89.01% in the internal and 88.82% in the external test dataset.ConclusionPromising prediction outcomes of postoperative BCVA were achieved by the novel OCT-trained deep learning model, which will be helpful for the surgical planning of highly myopic cataract patients.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set is a test set generated to test the capabilities of engines for Optical Character Recognition and Handwritten Text Recognition.
The data set consists of extracts of the minutes of the Swiss Federal Council. The single lines have been randomly chosen from about 150'000 pages of handwritten minutes.
For each line, an image file is being provided by the Swiss Federal Archives/Schweizerisches Bundesarchiv [images.tar.gz]. Please cite the images as follows: Excerpts of BAR E1004.1#1000/9#1-215. The images are in the public domain.
A PageXML file [page.zip] accompanies every image file and indicates the transcription and coordinates of the line.
For PageXML see Pletschacher, S., & Antonacopoulos, A. (2010). The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. 257–260. https://doi.org/10.1109/ICPR.2010.72.
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This project aims to predict bike rental demand using machine learning, specifically focusing on hourly predictions based on various environmental and temporal features. The dataset used for this analysis is the publicly available "Seoul Bike Sharing Demand" dataset, which includes factors like temperature, humidity, wind speed, and historical rental counts.
Key elements of the project:
Model: A trained XGBoost regression model that predicts bike rental counts for each hour, given the relevant environmental and temporal features. This model is built to optimize fleet distribution for bike-sharing companies, helping them efficiently manage resources and reduce operational costs.
Visualization: A plot that visualizes the comparison between the ground truth (actual bike rentals) and the predictions made by the XGBoost model. The plot provides insights into how well the model captures patterns in bike rental demand and the accuracy of its forecasts.
Predictions (CSV): A CSV file containing the model's predictions for the test set. The CSV includes the predicted bike rental counts, along with relevant features such as date, hour, temperature, and humidity. This dataset is intended for evaluating the performance of the trained model and for further analysis.
CodeMeta: A metadata file that provides essential information about the project's code, ensuring it adheres to best practices for reproducibility and transparency in computational research.
FAIR4ML: The project follows the FAIR4ML principles to ensure that the machine learning models, datasets, and results are Findable, Accessible, Interoperable, and Reproducible. All code, models, and results are made publicly available for further research and re-use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
(1) Background:
Situated in the domain of urban sound scene classification by humans and machines, the research in this project will be a first step towards mapping urban noise pollution experienced indoors and finding ways to reduce its negative impact in peoples' homes. The acoustic distinction between outdoor and indoor scenes is an active research field and can be automated with some success. A much subtler difference is the change in the indoor soundscape induced by an open window. Being able to determine this, however, would allow applications in warning systems and be a prerequisite for an app-based urban sound mapping project.
Acoustic detection requires neither line of sight nor sensors at the window frame or knowledge of the number of windows or their size. The task, however, varies substantially in difficulty with the amount of sound inside and outside. From the point of machine classification the lack of specificity is the most problematic aspect: Very few sounds if any can be assumed to originate exclusively from outside and be present at all times to aid automatic detection. The required generalisation ability, however, can be assumed for humans, who might also use very subtle cues in the change of reverberations.
(2) Aims
The aims are
(a) to determine the degree of reliability with which an open window can be recognised by humans and machines under varying circumstances based only on acoustic cues;
(b) to investigate whether the findings for humans and machines can inform each other and can be used for further application-related research, e.g., window noise cancellation.
(3) Method:
(a) Dataset acquisition:
A recording kit consisting of a dedicated laptop and microphone will be given to volunteers. Custom-programmed software will remind the user to specify the window state (establishing the so-called ground truth).
(b) Perception experiments:
Thirty participants will judge whether in the recorded clips a window is open or closed. After an extended familiarisation phase, they will proceed through two testing phases: In the first phase, all clips will originate from the recording locations with which the participants have already familiarised themselves, in the second they will judge clips from locations they haven't been exposed to before (partial data sets used for the familiar/unfamiliar conditions will be counterbalanced across participants).
(c) Machine recognition:
We will develop a machine learning system using state-of-the-art deep learning methods (artificial neural networks with multiple layers). To encourage other researchers to also take up this research, we will organise a machine learning challenge. In the challenge, a training data set including correct labels (ground truth) and a test set without the labels are provided. Researchers from academia and industry across the world will develop their own systems and send their classification results on the test set to the organisers to evaluate and publish online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Figure 6. Extending sharp-wave ripple detection to non-human primates. c) Significant differences between SWR recorded in mice and monkey. d) The best model of each architecture trained in mouse data, and the best filter configuration for mouse data, were applied to detect SWRs on the macaque data. We evaluated all models by computing F1-score against the ground truth (GT). Note relatively good results from non-retrained ML models and filter. e) Results of model re-training using macaque data. Data were split into a training and validation dataset (50% and 20% respectively), used to train the ML models; and a test set (30%), used to compute the F1 (left panel). Filter was not re-trained. f) F1-scores for the maximal performance of each model before and after re-training.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genetic variation on the non-recombining portion of the Y chromosome contains information about the ancestry of male lineages. Because of their low rate of mutation, single nucleotide polymorphisms (SNPs) are the markers of choice for unambiguously classifying Y chromosomes into related sets of lineages known as haplogroups, which tend to show geographic structure in many parts of the world. However, performing the large number of SNP genotyping tests needed to properly infer haplogroup status is expensive and time consuming. A novel alternative for assigning a sampled Y chromosome to a haplogroup is presented here. We show that by applying modern machine-learning algorithms we can infer with high accuracy the proper Y chromosome haplogroup of a sample by scoring a relatively small number of Y-linked short tandem repeats (STRs). Learning is based on a diverse ground-truth data set comprising pairs of SNP test results (haplogroup) and corresponding STR scores. We apply several independent machine-learning methods in tandem to learn formal classification functions. The result is an integrated high-throughput analysis system that automatically classifies large numbers of samples into haplogroups in a cost-effective and accurate manner.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionDeep learning neural networks are especially potent at dealing with structured data, such as images and volumes. Both modified LiviaNET and HyperDense-Net performed well at a prior competition segmenting 6-month-old infant magnetic resonance images, but neonatal cerebral tissue type identification is challenging given its uniquely inverted tissue contrasts. The current study aims to evaluate the two architectures to segment neonatal brain tissue types at term equivalent age.MethodsBoth networks were retrained over 24 pairs of neonatal T1 and T2 data from the Developing Human Connectome Project public data set and validated on another eight pairs against ground truth. We then reported the best-performing model from training and its performance by computing the Dice similarity coefficient (DSC) for each tissue type against eight test subjects.ResultsDuring the testing phase, among the segmentation approaches tested, the dual-modality HyperDense-Net achieved the best statistically significantly test mean DSC values, obtaining 0.94/0.95/0.92 for the tissue types and took 80 h to train and 10 min to segment, including preprocessing. The single-modality LiviaNET was better at processing T2-weighted images than processing T1-weighted images across all tissue types, achieving mean DSC values of 0.90/0.90/0.88 for gray matter, white matter, and cerebrospinal fluid, respectively, while requiring 30 h to train and 8 min to segment each brain, including preprocessing.DiscussionOur evaluation demonstrates that both neural networks can segment neonatal brains, achieving previously reported performance. Both networks will be continuously retrained over an increasingly larger repertoire of neonatal brain data and be made available through the Canadian Neonatal Brain Platform to better serve the neonatal brain imaging research community.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Precise segmentation of coronary arteries in non-contrast Computed Tomography (CT) scans plays an important role in the assessment of the coronary artery disease, where it is the key component for evaluating the Calcium Score (Agatston et al. 1990). In the paper by Bujny et al. (2024), a deep-learning approach for high-precision segmentation of coronary arteries in non-contrast CT was proposed along with a novel method for generating Ground Truth (GT) test data (test-GT) via manual registration of high-resolution coronary tree models obtained based on contrast CT with the non-contrast CT scans. In this dataset, we present the inferences of the neural network model together with the corresponding test-GT samples, based on 6 CT scans from the openly available OrCaScore dataset (Wolterink et al. 2016). The geometrical models included in the dataset can be used both for inspection of the proposed deep learning model and for testing of new non-contrast coronary vessel segmentation approaches, which is a unique opportunity since, to the best of our knowledge, manual generation of GT for non-contrast coronary artery segmentation was not addressed so far due to very challenging character of this particular segmentation task.
Methods
Manual Generation of test-GT
The geometric models of coronary arteries used for the evaluation of the proposed neural network model were generated according to the manual mesh-to-image registration process as described by Bujny et al. (2024). In this approach, the high-resolution coronary artery masks obtained based on contrast CT scans are manually aligned with the corresponding non-contrast CT images using tools available in the open-source 3D computer graphics software, Blender (https://www.blender.org/). To ease the manual alignment process, specialized add-ons for medical image processing such as Cardiac add-on for Blender of Graylight Imaging (https://graylight-imaging.com/3d-modelling/) can be used, as well. The STL models in this dataset were manually generated by a medical expert with 4 years of experience.
Segmentation of Coronary Arteries using a Deep Learning Model
For each of the cases presented in this dataset, we run an inference of an nnU-Net (Isensee et al. 2021) model trained according to the process described in our paper (Bujny et al. 2024). Since we use a standard nnU-Net, which utilizes a sliding window approach for processing of the CT scan, the context information within a patch is limited, which can lead to some false-positive detections. To mitigate this problem, we additionally post-process the inferences by eliminating small vessel fragments of less than 50 [mm^3] volume and structures outside of pericardium, which we segment using another nnU-Net model, SegTHOR (Lambert et al. 2020). The resulting geometric models are stored using the STL format and presented as green masks in the HTML reports with an embedded viewer based on the K3D-jupyter library (https://k3d-jupyter.org/).
Dataset organization
The root folder contains 6 folders whose names correspond to the CT scans from the OrCaScore dataset (Wolterink et al. 2016). In each of the folders, there are the following 4 files available:
‘manualGT_rater1.stl’ – high-resolution STL model of coronary arteries obtained via manual alignment of the geometric model segmented in contrast CT with the corresponding non-contrast CT scan by the first rater. A sample belonging to the test-GT set (Bujny et al. 2024).
‘manualGT_rater2.stl’ – corresponding test-GT sample by the second rater.
‘ML.stl’ – post-processed inference of the nnU-Net ML model in the STL format.
‘report.html’ – interactive HTML report consisting of a manually-aligned test-GT sample (red mask), the ML segmentation based on the non-contrast CT scan (green mask), and selected slices of the non-contrast CT scan. The reports contain the relevant information related to the scanning device and present the main segmentation quality metrics for the ML model inference.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a global resource for machine learning applications in mining area detection and semantic segmentation on satellite imagery. It contains Sentinel-2 satellite images and corresponding mining area masks + bounding boxes for 1,210 sites worldwide. Ground-truth masks are derived from Maus et al. (2022) and Tang et al. (2023), and validated through manual verification to ensure accurate alignment with Sentinel-2 imagery from specific timestamps.
The dataset includes three mask variants:
Each tile corresponds to a 2048x2048 pixel Sentinel-2 image, with metadata on mine type (surface, placer, underground, brine & evaporation) and scale (artisanal, industrial). For convenience, the preferred mask dataset is already split into training (75%), validation (15%), and test (10%) sets.
Furthermore, dataset quality was validated by re-validating test set tiles manually and correcting any mismatches between mining polygons and visually observed true mining area in the images, resulting in the following estimated quality metrics:
Combined | Maus | Tang | |
Accuracy | 99.78 | 99.74 | 99.83 |
Precision | 99.22 | 99.20 | 99.24 |
Recall | 95.71 | 96.34 | 95.10 |
Note that the dataset does not contain the Sentinel-2 images themselves but contains a reference to specific Sentinel-2 images. Thus, for any ML applications, the images must be persisted first. For example, Sentinel-2 imagery is available from Microsoft's Planetary Computer and filterable via STAC API: https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a. Additionally, the temporal specificity of the data allows integration with other imagery sources from the indicated timestamp, such as Landsat or other high-resolution imagery.
Source code used to generate this dataset and to use it for ML model training is available at https://github.com/SimonJasansky/mine-segmentation. It includes useful Python scripts, e.g. to download Sentinel-2 images via STAC API, or to divide tile images (2048x2048px) into smaller chips (e.g. 512x512px).
A database schema, a schematic depiction of the dataset generation process, and a map of the global distribution of tiles are provided in the accompanying images.
Titanic Dataset Description Overview The data is divided into two groups: - Training set (train.csv): Used to build machine learning models. It includes the outcome (also called the "ground truth") for each passenger, allowing models to predict survival based on “features” like gender and class. Feature engineering can also be applied to create new features. - Test set (test.csv): Used to evaluate model performance on unseen data. The ground truth is not provided; the task is to predict survival for each passenger in the test set using the trained model.
Additionally, gender_submission.csv is provided as an example submission file, containing predictions based on the assumption that all and only female passengers survive.
Data Dictionary | Variable | Definition | Key | |------------|------------------------------------------|-------------------------------------------------| | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | age | Age in years | | | sibsp | # of siblings/spouses aboard the Titanic | | | parch | # of parents/children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Variable Notes
pclass: Proxy for socio-economic status (SES):
1st = Upper
2nd = Middle
3rd = Lower
age:
Fractional if less than 1 year.
Estimated ages are represented in the form xx.5.
sibsp: Defines family relations as:
Sibling: Brother, sister, stepbrother, stepsister.
Spouse: Husband, wife (excluding mistresses and fiancés).
parch: Defines family relations as:
Parent: Mother, father.
Child: Daughter, son, stepdaughter, stepson.
Some children traveled only with a nanny, so parch = 0 for them.