25 datasets found
  1. c

    Data from: LVMED: Dataset of Latvian text normalisation samples for the...

    • repository.clarin.lv
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85
    Explore at:
    Dataset updated
    May 30, 2023
    Authors
    Viesturs Jūlijs Lasmanis; Normunds Grūzītis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

    Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

    All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.

  2. f

    Partitioning of the ABIDE I, ABIDE II, and ADHD200 datasets into training,...

    • plos.figshare.com
    csv
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Garcia; Clare Kelly (2024). Partitioning of the ABIDE I, ABIDE II, and ADHD200 datasets into training, validation and testing sets. [Dataset]. http://doi.org/10.1371/journal.pone.0276832.s001
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mélanie Garcia; Clare Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Partitioning of the ABIDE I, ABIDE II, and ADHD200 datasets into training, validation and testing sets.

  3. Address Standardization

    • hub.arcgis.com
    Updated Jul 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2022). Address Standardization [Dataset]. https://hub.arcgis.com/content/6c8e054fbdde4564b3b416eacaed3539
    Explore at:
    Dataset updated
    Jul 26, 2022
    Dataset authored and provided by
    Esrihttp://esri.com/
    Description

    This deep learning model is used to transform incorrect and non-standard addresses into standardized addresses. Address standardization is a process of formatting and correcting addresses in accordance with global standards. It includes all the required address elements (i.e., street number, apartment number, street name, city, state, and postal) and is used by the standard postal service.

          An address can be termed as non-standard because of incomplete details (missing street name or zip code), invalid information (incorrect address), incorrect information (typos, misspellings, formatting of abbreviations), or inaccurate information (wrong house number or street name). These errors make it difficult to locate a destination. Although a standardized address does not guarantee the address validity, it simply converts an address into the correct format. This deep learning model is trained on address dataset provided by openaddresses.io and can be used to standardize addresses from 10 different countries.
    
    
    
      Using the model
    
    
          Follow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.
    
    
    
        Fine-tuning the modelThis model can be fine-tuned using the Train Deep Learning Model tool. Follow the guide to fine-tune this model.Input
        Text (non-standard address) on which address standardization will be performed.
    
        Output
        Text (standard address)
    
        Supported countries
        This model supports addresses from the following countries:
    
          AT – Austria
          AU – Australia
          CA – Canada
          CH – Switzerland
          DK – Denmark
          ES – Spain
          FR – France
          LU – Luxemburg
          SI – Slovenia
          US – United States
    
        Model architecture
        This model uses the T5-base architecture implemented in Hugging Face Transformers.
        Accuracy metrics
        This model has an accuracy of 90.18 percent.
    
        Training dataThe model has been trained on openly licensed data from openaddresses.io.Sample results
        Here are a few results from the model.
    
  4. Data and models for "Center-fixing of tropical cyclones using...

    • zenodo.org
    nc, tar
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Lagerquist; Ryan Lagerquist (2025). Data and models for "Center-fixing of tropical cyclones using uncertainty-aware deep learning applied to high-temporal-resolution geostationary satellite imagery" by Lagerquist et al. [Dataset]. http://doi.org/10.5281/zenodo.15116855
    Explore at:
    tar, ncAvailable download formats
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ryan Lagerquist; Ryan Lagerquist
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The file geocenter_models.tar contains all models comprising the GeoCenter ensemble: 3 convolutional neural networks (CNN), 3 isotonic-regression files (one for correcting each CNN’s mean estimate), and 3 more isotonic-regression files (one for correcting each CNN’s ensemble spread). Every model is found in a subdirectory whose names indicate which infrared (IR) wavelengths are used as input to the CNN. For example:

    • wavelengths-microns=3.900-7.340-13.300/model.weights.h5: An HDF5 file containing the trained CNN that uses data from bands 7, 10, 16 (corresponding to 3.9, 7.34, and 13.3 microns on the GOES ABI imager). The trained CNN can always be read by neural_net_utils.read_model() in the ml4tccf library (https://doi.org/10.5281/zenodo.15116854).

    • wavelengths-microns=3.900-7.340-13.300/model_metadata.p: A Pickle file containing metadata for the trained CNN. This file is needed to read the CNN itself with neural_net_utils.read_model(). Otherwise, you will probably never need to access this metafile directly.

    • wavelengths-microns=3.900-7.340-13.300/isotonic_regression/isotonic_regression.dill: A Dill file containing isotonic-regression models used to bias-correct the ensemble mean from the same CNN. The trained isotonic-regression models can always be read by scalar_isotonic_regression.read_file() in the ml4tccf library. Note that there are technically two isotonic-regression models for every CNN’s ensemble mean: one that bias-corrects the x-coordinate of the TC-center, another that bias-corrects the y-coordinate.

    • wavelengths-microns=3.900-7.340-13.300/uncertainty_calibration/uncertainty_calibration.dill: A Dill file containing isotonic-regression models used to bias-correct the ensemble spread from the same CNN. In the ml4tccf code, I make a distinction between “isotonic_regression” (correcting the ensemble mean) and “uncertainty_calibration” (correcting the ensemble spread), but note that both models are isotonic regression and use the sklearn.isotonic.IsotonicRegression class. The trained uncertainty-calibration models can always be read by scalar_uncertainty_calibration.read_file() in the ml4tccf library. Again, note that there are technically two uncertainty-calibration models per CNN: one for spread in the x-coordinate, one for spread in the y-coordinate.

    As mentioned above, every trained CNN can be read by neural_net_utils.read_model(). Also, every trained CNN can be applied to new data (inference mode) by neural_net_utils.apply_model(). The input argument model_object should be the object returned by neural_net_utils.read_model(), and I suggest setting num_examples_per_batch = 10 to avoid out-of-memory errors. The only other input argument is predictor_matrices, which is a list of two numpy arrays. The first numpy array contains IR imagery centered at the first-guess TC center, and the second numpy array contains ATCF scalars. The first numpy array should have dimensions S (number of TC samples) x 300 (grid rows) x 300 (grid columns) x 9 (lag times) x 3 (wavelengths). Lag times should be in the following order: 240, 210, 180, 150, 120, 90, 60, 30, 0 min ago. Wavelengths should be in the order indicated by the subdirectory name. The numpy array itself should contain normalized brightness temperatures at the given lag times and wavelengths, following the grid specifications laid out in the journal paper (a plate carrée grid with 2-km spacing). The original IR data (brightness temperatures) must be normalized to z-scores using the same normalization parameters as in the journal paper, i.e., those based on the training data. See details below. The second numpy array in predictor_matrices should have dimensions S (number of TC samples) x 9 (variables). The variables must in the order: absolute latitude, cosine of longitude, sine of longitude, TC intensity, minimum central pressure, tropical flag, subtropical flag, extratropical flag, disturbance flag. The journal paper contains details on all these variables in one table. These variables must come from A-deck files at the second-most recent synoptic time. Like the IR data, these ATCF scalars must be normalized to z-scores using the same normalization parameters as in the journal paper. See details below.

    Once you have predictions (estimated TC-center locations) from a CNN, you can bias-correct these predictions. To read the isotonic-regression model for the given CNN’s ensemble mean, use scalar_isotonic_regression.read_file() in the ml4tccf library. To apply the same model, use scalar_isotonic_regression.apply_models(). For the CNN’s ensemble spread, use scalar_uncertainty_calibration.read_file() and scalar_uncertainty_calibration.apply_models().

    To normalize the IR data, you will need the file ir_satellite_normalization_params.tar included with this dataset. Within the tar file is a single zarr file. You can read the zarr file with normalization.read_file() in the ml4tccf library; then you can normalize new data with normalization.normalize_data().

    To normalize the ATCF data, you will need the file a_deck_normalization_params.nc included with this dataset. This is a NetCDF file, containing the full set of training values for all 5 ATCF variables that are normalized (the binary storm-type flags are not normalized). You can read this file using any of the standard Python methods for reading NetCDF files, such as xarray.open_dataset(). To normalize new ATCF data, you can use the method normalization._normalize_one_variable(), where the argument actual_values_training is the list of training values from a_deck_normalization_params.nc for the given variable, while actual_values_new is the list of values to be normalized (currently in physical units, to be converted to z-score units).

  5. t

    Transformer Network trained on Simulated X-ray photoelectron spectroscopy...

    • researchdata.tuwien.at
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl (2025). Transformer Network trained on Simulated X-ray photoelectron spectroscopy data for organic and inorganic compounds [Dataset]. http://doi.org/10.48436/mvrkc-dz146
    Explore at:
    Dataset updated
    Jul 1, 2025
    Dataset provided by
    TU Wien
    Authors
    Florian Simperl; Florian Simperl; Florian Simperl; Florian Simperl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    This data repository provides the underlying data and neural network training scripts associated with the manuscript titled "A Transformer Network for High-Throughput Material Characterisation with X-ray Photoelectron Spectroscopy" by Simperl and Werner.

    All data files are released under the Creative Commons Attribution 4.0 International (CC-BY) license, while all code files are distributed under the MIT license.

    The repository contains simulated X-ray photoelectron spectroscopy (XPS) spectra stored as hdf5 files in the zipped (h5_files.zip) folder, which was generated using the software developed by the authors. The NIST Standard Reference Database 100 – Simulation of Electron Spectra for Surface Analysis (SESSA) is freely available at https://www.nist.gov/srd/nist-standard-reference-database-100.

    The neural network architecture is implemented using the PyTorch Lightning framework and is fully available within the attached materials as Transformer_SimulatedSpectra.py contained in the python_scripts.zip.

    The trained model and the list of materials for the train, test and validation sets are contained in the models.zip folder.

    The repository contains all the data necessary to replot the figures from the manuscript. These data are available in the form of .csv files or .h5 files for the spectra. In addition, the repository also contains a Python script (Plot_Data_Manuscript.ipynb) which is contained in the python_scripts.zip file.

    Context and methodology

    The dataset and accompanying Python code files included in this repository were used to train a transformer-based neural network capable of directly inferring chemical concentrations from simulated survey X-ray photoelectron spectroscopy (XPS) spectra of bulk compounds.

    The spectral dataset provided here represents the raw output from the SESSA software (version 2.2.2), prior to the normalization procedure described in the associated manuscript. This step of normalisation is of paramount importance for the effective training of the neural network.

    The repository contains the Python scripts utilised to execute the spectral simulations and the neural network training on the Vienna Scientific Cluster (VSC5). In order to obtain guidance on the proper configuration of the Command Line Interface (CLI) tools required for SESSA, users are advised to consult the official SESSA manual, which is available at the following address: https://nvlpubs.nist.gov/nistpubs/NSRDS/NIST.NSRDS.100-2024.pdf.

    To run the neural network training we provided the requirements_nn_training.txt file that contains all the necessary python packages and version numbers. All other python scripts can be run locally with the python libraries listed in requirements_data_analysis.txt.

    Data details

    HDF5 (in zip folder): As described in the manuscript, we simulate X-ray photoelectron spectra for each of the 7,587 inorganic [1] and organic [2] materials in our dataset. To reflect realistic experimental conditions, each simulated spectrum was augmented by systematically varying parameters such as peak width, peak shift, and peak type—all configurable within the SESSA software—as well as by applying statistical Poisson noise to simulate varying signal-to-noise ratios. These modifications account for experimentally observed and material-specific spectral broadening, peak shifts, and detector-induced noise. Each material is represented by an individual HDF5 (.h5) file, named according to its chemical formula and mass density (in g/cm³). For example, the file for SiO2 with a density of 2.196 gcm-3 is named SiO2_2.196.h5. For more complex chemical formulas, such as Co(ClO4)2 with a density of 3.33 gcm-3, the file is named Co_ClO4_2_3.33.h5. Within each HDF5 file, the metadata for each spectrum is stored alongside a fixed energy axis and the corresponding intensity values. The spectral data are organized hierarchically by augmentation parameters in the following directory structure, e.g. for Ac_10.0.h5 we have SNR_0/WIDTH_0.3/SHIFT_-3.0/PEAK_gauss/Ac_10.0/. These files can be easily inspected with H5Web in Visual Studio Code or using h5py in Python or any other h5 interpretable program.

    Session Files: The .ses files are SESSA specific input files that can be directly loaded into SESSA to specify certain input parameters for the initilization (ini), the geometry (geo) and the simulation parameters (sim_para) and are required by the python script Simulation_Script_VSC_json.py to run the simulation on the cluster.

    Json Files: The two json files (MaterialsListVSC_gauss.json, MaterialsListVSC_lorentz.json) are used as the input files to the Python script Simulation_Script_VSC_json.py. These files contain all the material specific information for the SESSA simulation.

    csv files: The csv files are used to generate the plots from the manuscript described in the section "Plotting Scripts".

    npz files: The two .npz files (element_counts.npz, single_elements.npz) are python arrays that are needed by the Transformer_SimulatedSpectra.py script and contain the number of each single element in the dataset and an array of each single element present, respectively.

    SESSA Simulation Script

    There is one python file that sets the communication with SESSA:

    • Simulation_Script_VSC_json.py: This script is the heart of the simulation as it controls the communication through the CLI with SESSA using the specified input paramters in the .json and .ses files together with external functions specified in VSC_function.py

    Technical Details

    Simulation_Script_VSC_json.py: This script uses the functions of the VSC_function.py script (therefore needs to be placed in the same directory as this script) and can be called with the following command:

    python3 Simulation_Script_VSC_json.py MaterialsListVSC_gauss.json 0

    It simulates the spectrum for the material at index 0 in the .json file and with the corresponding parameters specified in the .json file.

    It is important that before running this script the following paths need to be specified:

    • sessa_path: The path to their SESSA installation in sessa_path and the path to their session files in
    • folder_path: The path to their .ses files. In this directory an output folder will be generated where all the output files, including the simulated spectra, are written to.

    To run SESSA on a computing cluster it is important to have a working Xvfb (virtual frame buffer) or a similar tool available to which any graphical output from SESSA can be written to.

    Neural Network Training Script

    Before running the training script it is important to normalize the data such that the squared integral of the spectrum is 1 (as described in the manuscript) and shown in the code: normalize_spectra.py

    For the neural network training we use the Transformer_SimulatedSpectra.py where the external functions used are specified in external_functions.py. This script contains the full description of the neural network architecture, the hyperparameter tuning and the Wandb logging.

    In the models.zip folder the fully trained network final_trained_model.ckpt presented in the manuscript is available as well as the list of training, validation and testing materials (test_materials_list.pt, train_materials_list.pt, val_materials_list.pt) where the corresponding spectra are extracted from the hdf5 files. The file types .ckpt and .pt can be read in by using the pytorch specific load functions in Python, e.g.

    torch.load(train_materials_list)

    Technical Details

    normalize_spectra.py: To run this script properly it is important to set up a python environment with the necessary libraries specified in the requirements_data_analysis.txt file. Then it can be called with

    python3 normalize_spectra.py

    where it is important to specify the path to the .h5 files containing the unnormalized spectra.

    Transformer_SimulatedSpectra.py: To run this script properly on the cluster it is important to set up a python environment with the necessary libraries specified in the requirements_nn_training.txt file. This script also relies on external_functions.py, single_elements.npz and element_counts.npz (that should be placed in the same directory as the python script) file. This is important for creating the datasets for training, validation and testing and ensures that all the single elements appear in the testing set. You can call this script (on the cluster) within a slurm script to start the GPU training.

    python3 Transformer_SimulatedSpectra.py

    It is important that before running this script the following paths need to be specified:

    • data_path: General path where all the data is stored
    • neural_network_data: The location where you keep your normalized hdf5 files
    • wandb_api_key: The api key to use wandb
    • ray_tesults: The location where you want to save your tuning results
    • checkpoints: The location where you want to save your ray

  6. f

    Gender breakdown and distribution of age and FIQ score for each dataset...

    • plos.figshare.com
    csv
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Garcia; Clare Kelly (2024). Gender breakdown and distribution of age and FIQ score for each dataset (training, validation, testing, testing 2 sets). [Dataset]. http://doi.org/10.1371/journal.pone.0276832.s002
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mélanie Garcia; Clare Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gender breakdown and distribution of age and FIQ score for each dataset (training, validation, testing, testing 2 sets).

  7. Cancer Detection dataset

    • kaggle.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manikandan (2025). Cancer Detection dataset [Dataset]. https://www.kaggle.com/datasets/mani11111111111/cancer-detection-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Manikandan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🩺 Cancer Cell Detection Dataset

    📌 Overview

    This dataset contains high-resolution microscopic images of cancerous and non-cancerous cells. It is designed for deep learning-based cancer detection models, specifically for binary classification (Benign vs. Malignant).

    📂 Dataset Structure

    The dataset is organized into two main folders:

    📁 train/ – Labeled images for training:
    - 0/ (Benign) → Non-cancerous cell images
    - 1/ (Malignant) → Cancerous cell images

    📁 test/ – Contains unlabeled images for model evaluation.

    📸 Image Details

    • Format: .jpg / .png
    • Resolution: 150x150 pixels (can be resized)
    • Color Mode: RGB (3-channel images)

    🔍 Use Cases

    ✅ Cancer detection using Convolutional Neural Networks (CNNs)
    ✅ Image classification & feature extraction
    ✅ Transfer learning with VGG16, ResNet, etc.
    ✅ Medical AI research

    📈 Model Performance Benchmark

    • Trained using a CNN model, achieving 92% accuracy on the validation set.
    • Data augmentation and advanced architectures can further improve performance.

    🚀 Future Enhancements

    Data Augmentation to improve generalization
    Transfer Learning using pre-trained models
    Web App Deployment for real-time detection

    📜 License

    📌 MIT License – Free to use, modify, and distribute with proper attribution.

    💡 How to Use?

    1️⃣ Download the dataset from Kaggle.
    2️⃣ Preprocess images (rescale, normalize).
    3️⃣ Train a CNN using TensorFlow/Keras or PyTorch.
    4️⃣ Evaluate the model using the test set.

    📢 Acknowledgments

    This dataset is inspired by medical AI research and deep learning applications. Special thanks to OpenAI, TensorFlow, and Kaggle for resources.

  8. f

    Sensitivity and specificity of each model across datasets.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Garcia; Clare Kelly (2024). Sensitivity and specificity of each model across datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0276832.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mélanie Garcia; Clare Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets are Training, Validation, Testing (No Comorbidity), and Testing Set 2 (With Comorbidities).

  9. MEDDOPROF corpus: complete gold standard annotations for occupation...

    • zenodo.org
    zip
    Updated May 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salvador Lima-López; Salvador Lima-López; Eulàlia Farré-Maduell; Antonio Miranda-Escalada; Antonio Miranda-Escalada; Vicent Briva-Iglesias; Martin Krallinger; Martin Krallinger; Eulàlia Farré-Maduell; Vicent Briva-Iglesias (2023). MEDDOPROF corpus: complete gold standard annotations for occupation detection in medical documents in Spanish [Dataset]. http://doi.org/10.5281/zenodo.5070541
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 22, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Salvador Lima-López; Salvador Lima-López; Eulàlia Farré-Maduell; Antonio Miranda-Escalada; Antonio Miranda-Escalada; Vicent Briva-Iglesias; Martin Krallinger; Martin Krallinger; Eulàlia Farré-Maduell; Vicent Briva-Iglesias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MEDDOPROF Shared Task tackles the detection of occupations and employment statuses in clinical cases in Spanish from different specialties. Systems capable of automatically processing clinical texts are of interest to the medical community, social workers, researchers, the pharmaceutical industry, computer engineers, AI developers, policy makers, citizen’s associations and patients. Additionally, other NLP tasks (such as anonymization) can also benefit from this type of data.

    MEDDOPROF has three different sub-tasks:

    1) MEDDOPROF-NER: Participants must find the beginning and end of occupation mentions and classify them as PROFESION (PROFESSION), SITUACION_LABORAL (WORKING_STATUS) or ACTIVIDAD (ACTIVIDAD).

    2) MEDDOPROF-CLASS: Participants must find the beginning and end of occupation mentions and classify them according to their referent (PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTRO [other]).

    3) MEDDOPROF-NORM: Participants must find the beginning and end of occupation mentions and normalize them according to a reference codes list.

    This is the complete Gold Standard. Annotations for the NER and CLASS sub-track are provided both separately and joint together (with each annotation level separated by a dash, e.g. PROFESION-PACIENTE). The normalized mentions are given as tab-separated file (.tsv) with four columns: filename, mention text, span and code.

    Please cite if you use this resource:

    Salvador Lima-López, Eulàlia Farré-Maduell, Antonio Miranda-Escalada, Vicent Brivá-Iglesias and Martin Krallinger. NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts. In Procesamiento del Lenguaje Natural, 67. 2021.

    @article{meddoprof,
      title={NLP applied to occupational health: MEDDOPROF shared task at IberLEF 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts},
      author={Lima-López, Salvador and Farré-Maduell, Eulàlia and Miranda-Escalada, Antonio and Brivá-Iglesias, Vicent and Krallinger, Martin},
    journal = {Procesamiento del Lenguaje Natural},
    volume = {67},
    year={2021},
    issn = {1989-7553},
    url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6393},
    pages = {243--256}
    }

    Resources:

    - Web

    - Training Data

    - Test set

    - Codes Reference List (for MEDDOPROF-NORM)

    - Annotation Guidelines

    MEDDOPROF is part of the IberLEF 2021 workshop, which is co-located with the SEPLN 2021 conference. For further information, please visit https://temu.bsc.es/meddoprof/ or email us at encargo-pln-life@bsc.es

    MEDDOPROF is promoted by the Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL).

  10. h

    Data from: Manually labeled terrestrial laser scanning point clouds of...

    • heidata.uni-heidelberg.de
    bin, tsv
    Updated Jan 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hannah Weiser; Hannah Weiser; Veit Ulrich; Veit Ulrich; Lukas Winiwarter; Lukas Winiwarter; Alberto M. Esmorís; Bernhard Höfle; Bernhard Höfle; Alberto M. Esmorís (2024). Manually labeled terrestrial laser scanning point clouds of individual trees for leaf-wood separation [Dataset]. http://doi.org/10.11588/DATA/UUMEDI
    Explore at:
    bin(133507442), bin(6633033), bin(41550826), bin(32301812), bin(82029589), bin(71234946), tsv(826), bin(117636408), bin(55125478), bin(16529344), bin(37192727), bin(20689778)Available download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    heiDATA
    Authors
    Hannah Weiser; Hannah Weiser; Veit Ulrich; Veit Ulrich; Lukas Winiwarter; Lukas Winiwarter; Alberto M. Esmorís; Bernhard Höfle; Bernhard Höfle; Alberto M. Esmorís
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bretten municipal forest, Bretten, Germany, Baden-Württemberg, Germany, Baden-Württemberg, Karlsruhe, Hardtwald forest in Karlsruhe-Waldstadt
    Dataset funded by
    Deutsche Forschungsgemeinschaft (DFG)
    Description

    This dataset contains 11 terrestrial laser scanning (TLS) tree point clouds (in .LAZ format v1.4) of 7 different species, which have been manually labeled into leaf and wood points. The labels are contained in the Classification field (0 = wood, 1 = leaf). The point clouds have additional attributes (Deviation, Reflectance, Amplitude, GpsTime, PointSourceId, NumberOfReturns, ReturnNumber). Before labeling, all point clouds were filtered by Deviation, discarding all points with a Deviation greater than 50. An ASCII file with tree species and tree positions (in ETRS89 / UTM zone 32N; EPSG:25832) is provided, which can be used to normalize and center the point clouds. This dataset is intended to be used for training and validation of algorithms for semantic segmentation (leaf-wood separation) of TLS tree point clouds, as done by Esmorís et al. 2023 (Related Publication). The point clouds are a subset of a larger dataset, which is available on PANGAEA (Weiser et al. 2022b, see Related Dataset). More details on data acquisition and processing, file formats, and quality assessments can be found in the corresponding data description paper (Weiser et al. 2022a, see Related Material).

  11. h

    clinical-field-mappings

    • huggingface.co
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tiago Silva (2025). clinical-field-mappings [Dataset]. https://huggingface.co/datasets/tsilva/clinical-field-mappings
    Explore at:
    Dataset updated
    May 8, 2025
    Authors
    Tiago Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🚑 Clinical Field Mappings for Healthcare Systems

    This synthetic dataset provides a wide variety of alternative names for clinical database fields, mapping them to standardized targets for healthcare data normalization.

    Using LLMs, we generated and validated thousands of plausible variations, including misspellings, abbreviations, country-specific nuances, and common real-world typos.

    This dataset is perfect for training models that need to standardize, clean, or map heterogeneous healthcare data schemas into unified, normalized formats.

    Applications include: - Data cleaning and ETL pipelines for clinical databases - Fine-tuning LLMs for schema matching - Clinical data interoperability projects - Zero-shot field matching research

    The dataset is machine-generated and validated with LLM feedback loops to ensure high-quality mappings.

  12. MFCCs Feature Scaling Images for Multi-class Human Action Analysis : A...

    • researchdata.edu.au
    • data.mendeley.com
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naveed Akhtar; Syed Mohammed Shamsul Islam; Douglas Chai; Muhammad Bilal Shaikh; Computer Science and Software Engineering (2023). MFCCs Feature Scaling Images for Multi-class Human Action Analysis : A Benchmark Dataset [Dataset]. http://doi.org/10.17632/6D8V9JMVGM.1
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    Mendeley Ltd.
    The University of Western Australia
    Authors
    Naveed Akhtar; Syed Mohammed Shamsul Islam; Douglas Chai; Muhammad Bilal Shaikh; Computer Science and Software Engineering
    Description

    his dataset comprises an array of Mel Frequency Cepstral Coefficients (MFCCs) that have undergone feature scaling, representing a variety of human actions. Feature scaling, or data normalization, is a preprocessing technique used to standardize the range of features in the dataset. For MFCCs, this process helps ensure all coefficients contribute equally to the learning process, preventing features with larger scales from overshadowing those with smaller scales.

    In this dataset, the audio signals correspond to diverse human actions such as walking, running, jumping, and dancing. The MFCCs are calculated via a series of signal processing stages, which capture key characteristics of the audio signal in a manner that closely aligns with human auditory perception. The coefficients are then standardized or scaled using methods such as MinMax Scaling or Standardization, thereby normalizing their range. Each normalized MFCC vector corresponds to a segment of the audio signal.

    The dataset is meticulously designed for tasks including human action recognition, classification, segmentation, and detection based on auditory cues. It serves as an essential resource for training and evaluating machine learning models focused on interpreting human actions from audio signals. This dataset proves particularly beneficial for researchers and practitioners in fields such as signal processing, computer vision, and machine learning, who aim to craft algorithms for human action analysis leveraging audio signals.

  13. m

    Crohn's Disease Treatment Prediction Model

    • data.mendeley.com
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henry Adams (2024). Crohn's Disease Treatment Prediction Model [Dataset]. http://doi.org/10.17632/y2hhsygy49.1
    Explore at:
    Dataset updated
    Jul 10, 2024
    Authors
    Henry Adams
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DB for Machine learning using clinical data at baselines. Used to predicts the medium-term efficacy of biologic therapies for in patients with Crohn's Disease.

    1. Data Collection Sources
    2. Electronic Health Records (EHR)
    3. Clinical trials and studies
    4. Genetic data
    5. Patient-reported outcomes
    6. Medical imaging

    Types of Data - Demographic information - Clinical data (symptoms, disease severity, treatment history) - Genetic data (SNPs, mutations) - Lab results (CRP levels, fecal calprotectin) - Imaging data (MRI, endoscopy) - Lifestyle data (diet, smoking status)

    1. Data Preprocessing Steps
    2. Data Cleaning: Handle missing values, remove duplicates, correct errors.
    3. Data Normalization/Standardization: Normalize lab results, standardize imaging data.
    4. Feature Engineering: Create new features from existing data, e.g., calculate disease activity scores.
    5. Encoding Categorical Data: Convert categorical variables to numerical ones using one-hot encoding or label encoding.
    6. Data Splitting: Split data into training, validation, and test sets.
  14. f

    Comparing accuracy scores between data collection sites.

    • figshare.com
    • plos.figshare.com
    csv
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Garcia; Clare Kelly (2024). Comparing accuracy scores between data collection sites. [Dataset]. http://doi.org/10.1371/journal.pone.0276832.s020
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mélanie Garcia; Clare Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparing accuracy scores between data collection sites.

  15. e

    Manually labeled terrestrial laser scanning point clouds of individual trees...

    • b2find.eudat.eu
    Updated Feb 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Manually labeled terrestrial laser scanning point clouds of individual trees for leaf-wood separation - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/5cb0d2f4-ae65-5258-89e5-e5ceab790ea8
    Explore at:
    Dataset updated
    Feb 13, 2024
    Description

    This dataset contains 11 terrestrial laser scanning (TLS) tree point clouds (in .LAZ format v1.4) of 7 different species, which have been manually labeled into leaf and wood points. The labels are contained in the Classification field (0 = wood, 1 = leaf). The point clouds have additional attributes (Deviation, Reflectance, Amplitude, GpsTime, PointSourceId, NumberOfReturns, ReturnNumber). Before labeling, all point clouds were filtered by Deviation, discarding all points with a Deviation greater than 50. An ASCII file with tree species and tree positions (in ETRS89 / UTM zone 32N; EPSG:25832) is provided, which can be used to normalize and center the point clouds. This dataset is intended to be used for training and validation of algorithms for semantic segmentation (leaf-wood separation) of TLS tree point clouds, as done by Esmorís et al. 2023 (Related Publication). The point clouds are a subset of a larger dataset, which is available on PANGAEA (Weiser et al. 2022b, see Related Dataset). More details on data acquisition and processing, file formats, and quality assessments can be found in the corresponding data description paper (Weiser et al. 2022a, see Related Material).

  16. f

    Best regions for predicting True Negatives (TN, i.e. no diagnosis of...

    • figshare.com
    csv
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Garcia; Clare Kelly (2024). Best regions for predicting True Negatives (TN, i.e. no diagnosis of Autism). [Dataset]. http://doi.org/10.1371/journal.pone.0276832.s007
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mélanie Garcia; Clare Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Each row is for one region, each column is for one model and one combination of datasets considered (training+validation+testing 1 sets (no comorbidity), or all these sets + testing set 2 (containing subjects with comorbidities)), each case returns the number of datasets where the region was important for predicting TN for the model considered. (CSV)

  17. f

    Best regions for predicting False Positives (FP, i.e. prediction of Autism...

    • plos.figshare.com
    csv
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Garcia; Clare Kelly (2024). Best regions for predicting False Positives (FP, i.e. prediction of Autism whereas no diagnosis Autism). [Dataset]. http://doi.org/10.1371/journal.pone.0276832.s008
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mélanie Garcia; Clare Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Each row is for one region, each column is for one model and one combination of datasets considered (training+validation+testing 1 sets (no comorbidity), or all these sets + testing set 2 (containing subjects with comorbidities)), each case returns the number of datasets where the region was important for predicting TN for the model considered. (CSV)

  18. LLM Fine Tuning Dataset of Indian Legal Texts

    • kaggle.com
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshat Gupta (2024). LLM Fine Tuning Dataset of Indian Legal Texts [Dataset]. https://www.kaggle.com/datasets/akshatgupta7/llm-fine-tuning-dataset-of-indian-legal-texts/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akshat Gupta
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    This dataset comprises curated question-answer pairs derived from key legal texts pertinent to Indian law, specifically the Indian Penal Code (IPC), Criminal Procedure Code (CRPC), and the Indian Constitution. The goal of this dataset is to facilitate the development and fine-tuning of language models and AI applications that assist legal professionals in India.

    Dataset Details:

    • Sources: The questions and answers in this dataset are extracted from the Indian Constitution, Indian Penal Code (IPC), and the Code of Criminal Procedure (CrPC), ensuring relevance and accuracy in legal contexts.
    • Content: Each entry in the dataset contains a clear and concise question alongside its corresponding answer. The questions are designed to cover fundamental concepts, key provisions, and significant terms found within these legal documents.

    Use Cases:

    • Legal Research: A valuable tool for lawyers, legal researchers, and students seeking to understand legal terminology and principles as outlined in Indian law.
    • Natural Language Processing (NLP): This dataset is ideal for training AI models for question-answering systems that require a strong understanding of Indian legal texts.
    • Educational Resources: Useful for creating educational tools and materials for law students and legal practitioners.

    Note on Use and Limitations:

    • Misuse of Dataset: This dataset is intended for educational, research, and development purposes only. Users should exercise caution to ensure that any AI applications developed using this dataset do not misrepresent or distort legal information. The dataset should not be used for legal advice or to influence legal decisions without proper context and verification.

    • Relevance and Context: While every effort has been made to ensure the accuracy and relevance of the question-answer pairs, some entries may be out of context or may not fully represent the legal concepts they aim to explain. Users are strongly encouraged to conduct thorough reviews of the entries, particularly when using them in formal applications or legal research.

    • Data Preprocessing Recommended: Due to the nature of natural language, the QA pairs may include variations in phrasing, potential redundancies, or entries that may not align perfectly with the intended legal context. Therefore, it is highly recommended that users perform data preprocessing to cleanse, normalize, or filter out any irrelevant or out-of-context pairs before integrating the dataset into machine learning models or systems.

    • Dynamic Nature of Law: The legal landscape is subject to change over time. As laws and interpretations evolve, some answers may become outdated or less applicable. Users should verify the current applicability of legal concepts and check sources for updates when necessary.

    • Credits and Citations: If you use this dataset in your research or projects, appropriate credits should be provided. Users are also encouraged to share any improvements, corrections, or updates they make to the dataset for the benefit of the community.

  19. Esophageal Cancer

    • kaggle.com
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    willian oliveira gibin (2024). Esophageal Cancer [Dataset]. http://doi.org/10.34740/kaggle/dsv/9828226
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2024
    Dataset provided by
    Kaggle
    Authors
    willian oliveira gibin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Esophageal Cancer Dataset is a comprehensive clinical dataset designed to support advancements in the detection, prognosis, and treatment of esophageal cancer, one of the most aggressive and high-mortality cancers worldwide. Available on Kaggle, this dataset includes detailed patient demographics, clinical data, and cancer-specific attributes, offering valuable insights for developing AI models aimed at early detection and tailored treatment approaches.

    Overview of Dataset Contents The dataset serves as a resource for healthcare professionals and researchers focused on cancer detection and personalized treatment solutions. It includes essential data points, such as:

    Patient Demographics: These include patient identifiers, age at diagnosis, gender, and consent status, which support studies on age and gender influences in disease incidence and outcomes. Medical and Clinical History: This section covers ICD-10 and ICD-O-3 codes for detailed tumor site and histology information, comorbidities like GERD, and smoking status to evaluate lifestyle impacts on cancer progression. Cancer-Specific Data: Key attributes include tumor location, histology type, cancer stage, residual tumor status, and lymph node examination results. Additionally, records on radiation therapy and postoperative treatments provide context on treatment outcomes. Clinical Outcome Data: This section assesses the patient's physical capabilities using the Karnofsky Performance Score and the ECOG Performance Status, which are critical for tracking functional and health status during treatment. Implementation Guide To make optimal use of this dataset, the following steps are recommended:

    Data Preprocessing: Clean and normalize data by handling missing values and ensuring consistency across entries, especially for variables such as age, lymph node count, and performance scores. Model Training: Employ machine learning frameworks like TensorFlow, PyTorch, or scikit-learn. Models such as Decision Trees, Random Forests, or Neural Networks can be trained depending on data complexity, with performance evaluated using accuracy, precision, recall, and F1-score. Deployment: Integrate trained models into decision-support tools for clinicians, enabling predictive insights to aid diagnosis and treatment planning. Continuous testing and feedback will improve the model’s performance and adaptability. Potential Applications This dataset supports several key applications:

    Machine Learning Models: It enables the development of algorithms for early detection, personalized treatment plans, and prognosis prediction in esophageal cancer. Healthcare Insights: By using this data, clinicians can optimize patient care strategies, improving the effectiveness of treatment protocols. Academic Research: Researchers can utilize the dataset for studies on esophageal cancer pathophysiology, risk assessment, and treatment efficacy, contributing to a deeper understanding of the disease. Conclusion The Esophageal Cancer Dataset is a high-quality, well-rounded clinical resource that empowers researchers and clinicians to drive innovation in esophageal cancer care. By leveraging this data, the medical community can work towards improved patient outcomes and a greater understanding of this challenging disease.

    Team Contributors:

    Abhinaba Biswas: Aspiring Data Analyst and ML Developer Akash Nath: ML Developer Shreya Dutta: AI Enthusiast All team members are students at JIS College of Engineering, Kalyani, West Bengal, India.

  20. f

    Best regions for predicting True Positives (TP, i.e. true diagnosis of...

    • plos.figshare.com
    csv
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mélanie Garcia; Clare Kelly (2024). Best regions for predicting True Positives (TP, i.e. true diagnosis of Autism). [Dataset]. http://doi.org/10.1371/journal.pone.0276832.s006
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mélanie Garcia; Clare Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Each row is for one region, each column is for one model (R42 for ResNet50 trained on 42 epochs, D32 for DenseNet121 trained on 32 epochs, D70 for DenseNet121 trained on 70 epochs) and one combination of datasets considered (training+validation+testing 1 sets (“no comorb” for no comorbidity), or all these sets + testing set 2 (“with comorb” for containing subjects with comorbidities), each case returns the number of datasets where the region was important for predicting TP for the model considered. (CSV)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85

Data from: LVMED: Dataset of Latvian text normalisation samples for the medical domain

Related Article
Explore at:
Dataset updated
May 30, 2023
Authors
Viesturs Jūlijs Lasmanis; Normunds Grūzītis
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.

Search
Clear search
Close search
Google apps
Main menu