Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).
Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.
All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Partitioning of the ABIDE I, ABIDE II, and ADHD200 datasets into training, validation and testing sets.
This deep learning model is used to transform incorrect and non-standard addresses into standardized addresses. Address standardization is a process of formatting and correcting addresses in accordance with global standards. It includes all the required address elements (i.e., street number, apartment number, street name, city, state, and postal) and is used by the standard postal service.
An address can be termed as non-standard because of incomplete details (missing street name or zip code), invalid information (incorrect address), incorrect information (typos, misspellings, formatting of abbreviations), or inaccurate information (wrong house number or street name). These errors make it difficult to locate a destination. Although a standardized address does not guarantee the address validity, it simply converts an address into the correct format. This deep learning model is trained on address dataset provided by openaddresses.io and can be used to standardize addresses from 10 different countries.
Using the model
Follow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.
Fine-tuning the modelThis model can be fine-tuned using the Train Deep Learning Model tool. Follow the guide to fine-tune this model.Input
Text (non-standard address) on which address standardization will be performed.
Output
Text (standard address)
Supported countries
This model supports addresses from the following countries:
AT – Austria
AU – Australia
CA – Canada
CH – Switzerland
DK – Denmark
ES – Spain
FR – France
LU – Luxemburg
SI – Slovenia
US – United States
Model architecture
This model uses the T5-base architecture implemented in Hugging Face Transformers.
Accuracy metrics
This model has an accuracy of 90.18 percent.
Training dataThe model has been trained on openly licensed data from openaddresses.io.Sample results
Here are a few results from the model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The file geocenter_models.tar contains all models comprising the GeoCenter ensemble: 3 convolutional neural networks (CNN), 3 isotonic-regression files (one for correcting each CNN’s mean estimate), and 3 more isotonic-regression files (one for correcting each CNN’s ensemble spread). Every model is found in a subdirectory whose names indicate which infrared (IR) wavelengths are used as input to the CNN. For example:
wavelengths-microns=3.900-7.340-13.300/model.weights.h5: An HDF5 file containing the trained CNN that uses data from bands 7, 10, 16 (corresponding to 3.9, 7.34, and 13.3 microns on the GOES ABI imager). The trained CNN can always be read by neural_net_utils.read_model() in the ml4tccf library (https://doi.org/10.5281/zenodo.15116854).
wavelengths-microns=3.900-7.340-13.300/model_metadata.p: A Pickle file containing metadata for the trained CNN. This file is needed to read the CNN itself with neural_net_utils.read_model(). Otherwise, you will probably never need to access this metafile directly.
wavelengths-microns=3.900-7.340-13.300/isotonic_regression/isotonic_regression.dill: A Dill file containing isotonic-regression models used to bias-correct the ensemble mean from the same CNN. The trained isotonic-regression models can always be read by scalar_isotonic_regression.read_file() in the ml4tccf library. Note that there are technically two isotonic-regression models for every CNN’s ensemble mean: one that bias-corrects the x-coordinate of the TC-center, another that bias-corrects the y-coordinate.
wavelengths-microns=3.900-7.340-13.300/uncertainty_calibration/uncertainty_calibration.dill: A Dill file containing isotonic-regression models used to bias-correct the ensemble spread from the same CNN. In the ml4tccf code, I make a distinction between “isotonic_regression” (correcting the ensemble mean) and “uncertainty_calibration” (correcting the ensemble spread), but note that both models are isotonic regression and use the sklearn.isotonic.IsotonicRegression class. The trained uncertainty-calibration models can always be read by scalar_uncertainty_calibration.read_file() in the ml4tccf library. Again, note that there are technically two uncertainty-calibration models per CNN: one for spread in the x-coordinate, one for spread in the y-coordinate.
As mentioned above, every trained CNN can be read by neural_net_utils.read_model(). Also, every trained CNN can be applied to new data (inference mode) by neural_net_utils.apply_model(). The input argument model_object should be the object returned by neural_net_utils.read_model(), and I suggest setting num_examples_per_batch = 10 to avoid out-of-memory errors. The only other input argument is predictor_matrices, which is a list of two numpy arrays. The first numpy array contains IR imagery centered at the first-guess TC center, and the second numpy array contains ATCF scalars. The first numpy array should have dimensions S (number of TC samples) x 300 (grid rows) x 300 (grid columns) x 9 (lag times) x 3 (wavelengths). Lag times should be in the following order: 240, 210, 180, 150, 120, 90, 60, 30, 0 min ago. Wavelengths should be in the order indicated by the subdirectory name. The numpy array itself should contain normalized brightness temperatures at the given lag times and wavelengths, following the grid specifications laid out in the journal paper (a plate carrée grid with 2-km spacing). The original IR data (brightness temperatures) must be normalized to z-scores using the same normalization parameters as in the journal paper, i.e., those based on the training data. See details below. The second numpy array in predictor_matrices should have dimensions S (number of TC samples) x 9 (variables). The variables must in the order: absolute latitude, cosine of longitude, sine of longitude, TC intensity, minimum central pressure, tropical flag, subtropical flag, extratropical flag, disturbance flag. The journal paper contains details on all these variables in one table. These variables must come from A-deck files at the second-most recent synoptic time. Like the IR data, these ATCF scalars must be normalized to z-scores using the same normalization parameters as in the journal paper. See details below.
Once you have predictions (estimated TC-center locations) from a CNN, you can bias-correct these predictions. To read the isotonic-regression model for the given CNN’s ensemble mean, use scalar_isotonic_regression.read_file() in the ml4tccf library. To apply the same model, use scalar_isotonic_regression.apply_models(). For the CNN’s ensemble spread, use scalar_uncertainty_calibration.read_file() and scalar_uncertainty_calibration.apply_models().
To normalize the IR data, you will need the file ir_satellite_normalization_params.tar included with this dataset. Within the tar file is a single zarr file. You can read the zarr file with normalization.read_file() in the ml4tccf library; then you can normalize new data with normalization.normalize_data().
To normalize the ATCF data, you will need the file a_deck_normalization_params.nc included with this dataset. This is a NetCDF file, containing the full set of training values for all 5 ATCF variables that are normalized (the binary storm-type flags are not normalized). You can read this file using any of the standard Python methods for reading NetCDF files, such as xarray.open_dataset(). To normalize new ATCF data, you can use the method normalization._normalize_one_variable(), where the argument actual_values_training is the list of training values from a_deck_normalization_params.nc for the given variable, while actual_values_new is the list of values to be normalized (currently in physical units, to be converted to z-score units).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data repository provides the underlying data and neural network training scripts associated with the manuscript titled "A Transformer Network for High-Throughput Material Characterisation with X-ray Photoelectron Spectroscopy" by Simperl and Werner.
All data files are released under the Creative Commons Attribution 4.0 International (CC-BY) license, while all code files are distributed under the MIT license.
The repository contains simulated X-ray photoelectron spectroscopy (XPS) spectra stored as hdf5 files in the zipped (h5_files.zip) folder, which was generated using the software developed by the authors. The NIST Standard Reference Database 100 – Simulation of Electron Spectra for Surface Analysis (SESSA) is freely available at https://www.nist.gov/srd/nist-standard-reference-database-100.
The neural network architecture is implemented using the PyTorch Lightning framework and is fully available within the attached materials as Transformer_SimulatedSpectra.py contained in the python_scripts.zip.
The trained model and the list of materials for the train, test and validation sets are contained in the models.zip folder.
The repository contains all the data necessary to replot the figures from the manuscript. These data are available in the form of .csv files or .h5 files for the spectra. In addition, the repository also contains a Python script (Plot_Data_Manuscript.ipynb) which is contained in the python_scripts.zip file.
The dataset and accompanying Python code files included in this repository were used to train a transformer-based neural network capable of directly inferring chemical concentrations from simulated survey X-ray photoelectron spectroscopy (XPS) spectra of bulk compounds.
The spectral dataset provided here represents the raw output from the SESSA software (version 2.2.2), prior to the normalization procedure described in the associated manuscript. This step of normalisation is of paramount importance for the effective training of the neural network.
The repository contains the Python scripts utilised to execute the spectral simulations and the neural network training on the Vienna Scientific Cluster (VSC5). In order to obtain guidance on the proper configuration of the Command Line Interface (CLI) tools required for SESSA, users are advised to consult the official SESSA manual, which is available at the following address: https://nvlpubs.nist.gov/nistpubs/NSRDS/NIST.NSRDS.100-2024.pdf.
To run the neural network training we provided the requirements_nn_training.txt file that contains all the necessary python packages and version numbers. All other python scripts can be run locally with the python libraries listed in requirements_data_analysis.txt.
HDF5 (in zip folder): As described in the manuscript, we simulate X-ray photoelectron spectra for each of the 7,587 inorganic [1] and organic [2] materials in our dataset. To reflect realistic experimental conditions, each simulated spectrum was augmented by systematically varying parameters such as peak width, peak shift, and peak type—all configurable within the SESSA software—as well as by applying statistical Poisson noise to simulate varying signal-to-noise ratios. These modifications account for experimentally observed and material-specific spectral broadening, peak shifts, and detector-induced noise. Each material is represented by an individual HDF5 (.h5) file, named according to its chemical formula and mass density (in g/cm³). For example, the file for SiO2 with a density of 2.196 gcm-3 is named SiO2_2.196.h5. For more complex chemical formulas, such as Co(ClO4)2 with a density of 3.33 gcm-3, the file is named Co_ClO4_2_3.33.h5. Within each HDF5 file, the metadata for each spectrum is stored alongside a fixed energy axis and the corresponding intensity values. The spectral data are organized hierarchically by augmentation parameters in the following directory structure, e.g. for Ac_10.0.h5 we have SNR_0/WIDTH_0.3/SHIFT_-3.0/PEAK_gauss/Ac_10.0/. These files can be easily inspected with H5Web in Visual Studio Code or using h5py in Python or any other h5 interpretable program.
Session Files: The .ses files are SESSA specific input files that can be directly loaded into SESSA to specify certain input parameters for the initilization (ini), the geometry (geo) and the simulation parameters (sim_para) and are required by the python script Simulation_Script_VSC_json.py to run the simulation on the cluster.
Json Files: The two json files (MaterialsListVSC_gauss.json, MaterialsListVSC_lorentz.json) are used as the input files to the Python script Simulation_Script_VSC_json.py. These files contain all the material specific information for the SESSA simulation.
csv files: The csv files are used to generate the plots from the manuscript described in the section "Plotting Scripts".
npz files: The two .npz files (element_counts.npz, single_elements.npz) are python arrays that are needed by the Transformer_SimulatedSpectra.py script and contain the number of each single element in the dataset and an array of each single element present, respectively.
There is one python file that sets the communication with SESSA:
Simulation_Script_VSC_json.py: This script uses the functions of the VSC_function.py script (therefore needs to be placed in the same directory as this script) and can be called with the following command:
python3 Simulation_Script_VSC_json.py MaterialsListVSC_gauss.json 0
It simulates the spectrum for the material at index 0 in the .json file and with the corresponding parameters specified in the .json file.
It is important that before running this script the following paths need to be specified:
To run SESSA on a computing cluster it is important to have a working Xvfb (virtual frame buffer) or a similar tool available to which any graphical output from SESSA can be written to.
Before running the training script it is important to normalize the data such that the squared integral of the spectrum is 1 (as described in the manuscript) and shown in the code: normalize_spectra.py
For the neural network training we use the Transformer_SimulatedSpectra.py where the external functions used are specified in external_functions.py. This script contains the full description of the neural network architecture, the hyperparameter tuning and the Wandb logging.
In the models.zip folder the fully trained network final_trained_model.ckpt presented in the manuscript is available as well as the list of training, validation and testing materials (test_materials_list.pt, train_materials_list.pt, val_materials_list.pt) where the corresponding spectra are extracted from the hdf5 files. The file types .ckpt and .pt can be read in by using the pytorch specific load functions in Python, e.g.
torch.load(train_materials_list)
normalize_spectra.py: To run this script properly it is important to set up a python environment with the necessary libraries specified in the requirements_data_analysis.txt file. Then it can be called with
python3 normalize_spectra.py
where it is important to specify the path to the .h5 files containing the unnormalized spectra.
Transformer_SimulatedSpectra.py: To run this script properly on the cluster it is important to set up a python environment with the necessary libraries specified in the requirements_nn_training.txt file. This script also relies on external_functions.py, single_elements.npz and element_counts.npz (that should be placed in the same directory as the python script) file. This is important for creating the datasets for training, validation and testing and ensures that all the single elements appear in the testing set. You can call this script (on the cluster) within a slurm script to start the GPU training.
python3 Transformer_SimulatedSpectra.py
It is important that before running this script the following paths need to be specified:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gender breakdown and distribution of age and FIQ score for each dataset (training, validation, testing, testing 2 sets).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains high-resolution microscopic images of cancerous and non-cancerous cells. It is designed for deep learning-based cancer detection models, specifically for binary classification (Benign vs. Malignant).
The dataset is organized into two main folders:
📁 train/
– Labeled images for training:
- 0/
(Benign) → Non-cancerous cell images
- 1/
(Malignant) → Cancerous cell images
📁 test/
– Contains unlabeled images for model evaluation.
.jpg
/ .png
✅ Cancer detection using Convolutional Neural Networks (CNNs)
✅ Image classification & feature extraction
✅ Transfer learning with VGG16, ResNet, etc.
✅ Medical AI research
✅ Data Augmentation to improve generalization
✅ Transfer Learning using pre-trained models
✅ Web App Deployment for real-time detection
📌 MIT License – Free to use, modify, and distribute with proper attribution.
1️⃣ Download the dataset from Kaggle.
2️⃣ Preprocess images (rescale, normalize).
3️⃣ Train a CNN using TensorFlow/Keras or PyTorch.
4️⃣ Evaluate the model using the test set.
This dataset is inspired by medical AI research and deep learning applications. Special thanks to OpenAI, TensorFlow, and Kaggle for resources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets are Training, Validation, Testing (No Comorbidity), and Testing Set 2 (With Comorbidities).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data.
MEDDOPROF has three different sub-tasks:
1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD).
2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]).
3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list.
This is the complete Gold Standard. Annotations for the NER and CLASS sub-track are provided both separately and joint together (with each annotation level separated by a dash, e.g. PROFESION-PACIENTE). The normalized mentions are given as tab-separated file (.tsv) with four columns: filename, mention text, span and code.
Please cite if you use this resource:
Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021.
@article{meddoprof,
title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts},
author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin},
journal = {Procesamiento del Lenguaje Natural},
volume = {67},
year={2021},
issn = {1989-7553},
url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393},
pages = {243--256}
}
Resources:
- Web
- Test set
- Codes Reference List (for MEDDOPROF-NORM)
MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es
MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 11 terrestrial laser scanning (TLS) tree point clouds (in .LAZ format v1.4) of 7 different species, which have been manually labeled into leaf and wood points. The labels are contained in the Classification field (0 = wood, 1 = leaf). The point clouds have additional attributes (Deviation, Reflectance, Amplitude, GpsTime, PointSourceId, NumberOfReturns, ReturnNumber). Before labeling, all point clouds were filtered by Deviation, discarding all points with a Deviation greater than 50. An ASCII file with tree species and tree positions (in ETRS89 / UTM zone 32N; EPSG:25832) is provided, which can be used to normalize and center the point clouds. This dataset is intended to be used for training and validation of algorithms for semantic segmentation (leaf-wood separation) of TLS tree point clouds, as done by Esmorís et al. 2023 (Related Publication). The point clouds are a subset of a larger dataset, which is available on PANGAEA (Weiser et al. 2022b, see Related Dataset). More details on data acquisition and processing, file formats, and quality assessments can be found in the corresponding data description paper (Weiser et al. 2022a, see Related Material).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🚑 Clinical Field Mappings for Healthcare Systems
This synthetic dataset provides a wide variety of alternative names for clinical database fields, mapping them to standardized targets for healthcare data normalization.
Using LLMs, we generated and validated thousands of plausible variations, including misspellings, abbreviations, country-specific nuances, and common real-world typos.
This dataset is perfect for training models that need to standardize, clean, or map heterogeneous healthcare data schemas into unified, normalized formats.
✅ Applications include: - Data cleaning and ETL pipelines for clinical databases - Fine-tuning LLMs for schema matching - Clinical data interoperability projects - Zero-shot field matching research
The dataset is machine-generated and validated with LLM feedback loops to ensure high-quality mappings.
his dataset comprises an array of Mel Frequency Cepstral Coefficients (MFCCs) that have undergone feature scaling, representing a variety of human actions. Feature scaling, or data normalization, is a preprocessing technique used to standardize the range of features in the dataset. For MFCCs, this process helps ensure all coefficients contribute equally to the learning process, preventing features with larger scales from overshadowing those with smaller scales.
In this dataset, the audio signals correspond to diverse human actions such as walking, running, jumping, and dancing. The MFCCs are calculated via a series of signal processing stages, which capture key characteristics of the audio signal in a manner that closely aligns with human auditory perception. The coefficients are then standardized or scaled using methods such as MinMax Scaling or Standardization, thereby normalizing their range. Each normalized MFCC vector corresponds to a segment of the audio signal.
The dataset is meticulously designed for tasks including human action recognition, classification, segmentation, and detection based on auditory cues. It serves as an essential resource for training and evaluating machine learning models focused on interpreting human actions from audio signals. This dataset proves particularly beneficial for researchers and practitioners in fields such as signal processing, computer vision, and machine learning, who aim to craft algorithms for human action analysis leveraging audio signals.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DB for Machine learning using clinical data at baselines. Used to predicts the medium-term efficacy of biologic therapies for in patients with Crohn's Disease.
Types of Data - Demographic information - Clinical data (symptoms, disease severity, treatment history) - Genetic data (SNPs, mutations) - Lab results (CRP levels, fecal calprotectin) - Imaging data (MRI, endoscopy) - Lifestyle data (diet, smoking status)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparing accuracy scores between data collection sites.
This dataset contains 11 terrestrial laser scanning (TLS) tree point clouds (in .LAZ format v1.4) of 7 different species, which have been manually labeled into leaf and wood points. The labels are contained in the Classification field (0 = wood, 1 = leaf). The point clouds have additional attributes (Deviation, Reflectance, Amplitude, GpsTime, PointSourceId, NumberOfReturns, ReturnNumber). Before labeling, all point clouds were filtered by Deviation, discarding all points with a Deviation greater than 50. An ASCII file with tree species and tree positions (in ETRS89 / UTM zone 32N; EPSG:25832) is provided, which can be used to normalize and center the point clouds. This dataset is intended to be used for training and validation of algorithms for semantic segmentation (leaf-wood separation) of TLS tree point clouds, as done by Esmorís et al. 2023 (Related Publication). The point clouds are a subset of a larger dataset, which is available on PANGAEA (Weiser et al. 2022b, see Related Dataset). More details on data acquisition and processing, file formats, and quality assessments can be found in the corresponding data description paper (Weiser et al. 2022a, see Related Material).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each row is for one region, each column is for one model and one combination of datasets considered (training+validation+testing 1 sets (no comorbidity), or all these sets + testing set 2 (containing subjects with comorbidities)), each case returns the number of datasets where the region was important for predicting TN for the model considered. (CSV)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each row is for one region, each column is for one model and one combination of datasets considered (training+validation+testing 1 sets (no comorbidity), or all these sets + testing set 2 (containing subjects with comorbidities)), each case returns the number of datasets where the region was important for predicting TN for the model considered. (CSV)
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset comprises curated question-answer pairs derived from key legal texts pertinent to Indian law, specifically the Indian Penal Code (IPC), Criminal Procedure Code (CRPC), and the Indian Constitution. The goal of this dataset is to facilitate the development and fine-tuning of language models and AI applications that assist legal professionals in India.
Misuse of Dataset: This dataset is intended for educational, research, and development purposes only. Users should exercise caution to ensure that any AI applications developed using this dataset do not misrepresent or distort legal information. The dataset should not be used for legal advice or to influence legal decisions without proper context and verification.
Relevance and Context: While every effort has been made to ensure the accuracy and relevance of the question-answer pairs, some entries may be out of context or may not fully represent the legal concepts they aim to explain. Users are strongly encouraged to conduct thorough reviews of the entries, particularly when using them in formal applications or legal research.
Data Preprocessing Recommended: Due to the nature of natural language, the QA pairs may include variations in phrasing, potential redundancies, or entries that may not align perfectly with the intended legal context. Therefore, it is highly recommended that users perform data preprocessing to cleanse, normalize, or filter out any irrelevant or out-of-context pairs before integrating the dataset into machine learning models or systems.
Dynamic Nature of Law: The legal landscape is subject to change over time. As laws and interpretations evolve, some answers may become outdated or less applicable. Users should verify the current applicability of legal concepts and check sources for updates when necessary.
Credits and Citations: If you use this dataset in your research or projects, appropriate credits should be provided. Users are also encouraged to share any improvements, corrections, or updates they make to the dataset for the benefit of the community.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Esophageal Cancer Dataset is a comprehensive clinical dataset designed to support advancements in the detection, prognosis, and treatment of esophageal cancer, one of the most aggressive and high-mortality cancers worldwide. Available on Kaggle, this dataset includes detailed patient demographics, clinical data, and cancer-specific attributes, offering valuable insights for developing AI models aimed at early detection and tailored treatment approaches.
Overview of Dataset Contents The dataset serves as a resource for healthcare professionals and researchers focused on cancer detection and personalized treatment solutions. It includes essential data points, such as:
Patient Demographics: These include patient identifiers, age at diagnosis, gender, and consent status, which support studies on age and gender influences in disease incidence and outcomes. Medical and Clinical History: This section covers ICD-10 and ICD-O-3 codes for detailed tumor site and histology information, comorbidities like GERD, and smoking status to evaluate lifestyle impacts on cancer progression. Cancer-Specific Data: Key attributes include tumor location, histology type, cancer stage, residual tumor status, and lymph node examination results. Additionally, records on radiation therapy and postoperative treatments provide context on treatment outcomes. Clinical Outcome Data: This section assesses the patient's physical capabilities using the Karnofsky Performance Score and the ECOG Performance Status, which are critical for tracking functional and health status during treatment. Implementation Guide To make optimal use of this dataset, the following steps are recommended:
Data Preprocessing: Clean and normalize data by handling missing values and ensuring consistency across entries, especially for variables such as age, lymph node count, and performance scores. Model Training: Employ machine learning frameworks like TensorFlow, PyTorch, or scikit-learn. Models such as Decision Trees, Random Forests, or Neural Networks can be trained depending on data complexity, with performance evaluated using accuracy, precision, recall, and F1-score. Deployment: Integrate trained models into decision-support tools for clinicians, enabling predictive insights to aid diagnosis and treatment planning. Continuous testing and feedback will improve the model’s performance and adaptability. Potential Applications This dataset supports several key applications:
Machine Learning Models: It enables the development of algorithms for early detection, personalized treatment plans, and prognosis prediction in esophageal cancer. Healthcare Insights: By using this data, clinicians can optimize patient care strategies, improving the effectiveness of treatment protocols. Academic Research: Researchers can utilize the dataset for studies on esophageal cancer pathophysiology, risk assessment, and treatment efficacy, contributing to a deeper understanding of the disease. Conclusion The Esophageal Cancer Dataset is a high-quality, well-rounded clinical resource that empowers researchers and clinicians to drive innovation in esophageal cancer care. By leveraging this data, the medical community can work towards improved patient outcomes and a greater understanding of this challenging disease.
Team Contributors:
Abhinaba Biswas: Aspiring Data Analyst and ML Developer Akash Nath: ML Developer Shreya Dutta: AI Enthusiast All team members are students at JIS College of Engineering, Kalyani, West Bengal, India.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Each row is for one region, each column is for one model (R42 for ResNet50 trained on 42 epochs, D32 for DenseNet121 trained on 32 epochs, D70 for DenseNet121 trained on 70 epochs) and one combination of datasets considered (training+validation+testing 1 sets (“no comorb” for no comorbidity), or all these sets + testing set 2 (“with comorb” for containing subjects with comorbidities), each case returns the number of datasets where the region was important for predicting TP for the model considered. (CSV)
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).
Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.
All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.