80 datasets found
  1. Data Pre-Processing : Data Integration

    • kaggle.com
    Updated Aug 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mr.Machine
    Description

    In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

  2. Data from: A Python-based pipeline for preprocessing LC-MS data for...

    • data.niaid.nih.gov
    xml
    Updated Nov 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NICOLAS ZABALEGUI (2020). A Python-based pipeline for preprocessing LC-MS data for untargeted metabolomics workflows [Dataset]. https://data.niaid.nih.gov/resources?id=mtbls1919
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Nov 21, 2020
    Dataset provided by
    CIBION-CONICET
    Authors
    NICOLAS ZABALEGUI
    Variables measured
    Metabolomics
    Description

    Preprocessing data in a reproducible and robust way is one of the current challenges in untargeted metabolomics workflows. Data curation in liquid chromatography-mass spectrometry (LC-MS) involves the removal of unwanted features (retention time; m/z pairs) to retain only high-quality data for subsequent analysis and interpretation. The present work introduces a package for the Python programming language for pre-processing LC-MS data for quality control procedures in untargeted metabolomics workflows. It is a versatile strategy that can be customized or fit for purpose according to the specific metabolomics application. It allows performing quality control procedures to ensure accuracy and reliability in LC-MS measurements, and it allows preprocessing metabolomics data to obtain cleaned matrices for subsequent statistical analysis. The capabilities of the package are showcased with pipelines for an LC-MS system suitability check, system conditioning, signal drift evaluation, and data curation. These applications were implemented to preprocess data corresponding to a new suite of plasma candidate plasma reference materials developed by the National Institute of Standards and Technology (NIST; hypertriglyceridemic, diabetic, and African-American plasma pools) to be used in untargeted metabolomics studies. in addition to NIST SRM 1950 – Metabolites in Frozen Human Plasma. The package offers a rapid and reproducible workflow that can be used in an automated or semi-automated fashion, and it is an open and free tool available to all users.

  3. Titanic data for Data Preprocessing

    • kaggle.com
    Updated Oct 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshay Sehgal (2021). Titanic data for Data Preprocessing [Dataset]. https://www.kaggle.com/akshaysehgal/titanic-data-for-data-preprocessing/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 28, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akshay Sehgal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    Public "Titanic" dataset for data exploration, preprocessing and benchmarking basic classification/regression models.

    Columns

    • 'survived'
    • 'pclass'
    • 'sex'
    • 'age'
    • 'sibsp'
    • 'parch'
    • 'fare'
    • 'embarked'
    • 'class'
    • 'who'
    • 'adult_male'
    • 'deck'
    • 'embark_town'
    • 'alive'
    • 'alone'

    Acknowledgements

    Github: https://github.com/mwaskom/seaborn-data/blob/master/titanic.csv

    Inspiration

    Playground for visualizations, preprocessing feature engineering, model pipelining, and more.

  4. e

    Preprocessing Antarctic Weather Station (AWS) data in python - Dataset -...

    • b2find.eudat.eu
    Updated Dec 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Preprocessing Antarctic Weather Station (AWS) data in python - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/d93b6b2b-b08f-55a1-9fb0-68c2971701ae
    Explore at:
    Dataset updated
    Dec 27, 2023
    Area covered
    Antarctica
    Description

    Information about data sources is available. Some downloading scripts are included in the provided code. However, users should make sure to comply with the data providers terms and conditions. Given changing download options of the differnent institutions the above links may not permanently work and data has to be retrieved by the user of this dataset. No quality control is applied in the provided preprocessing software - quality control is up to the user of the datasets. Some dataset are quality controlled by the owner. Acknowledgements We thank all the data providers for making the data publicly available or providing them upon request. Full acknowledgements can be found in Gerber et al., submitted. References Amory, C. (2020). “Drifting-snow statistics from multiple-year autonomous measurements in Adélie Land, East Antarctica”. The Cryosphere, 1713–1725. doi: 10.5194/tc-14-1713-2020 Gerber, F., Sharma, V. and Lehning, M.: CRYOWRF - a validation and the effect of blowing snow on the Antarctic SMB, JGR - Atmospheres, submitted.

  5. Data from: COVID-19 and media dataset: Mining textual data according periods...

    • dataverse.cirad.fr
    application/x-gzip +1
    Updated Dec 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu Roche; Mathieu Roche (2020). COVID-19 and media dataset: Mining textual data according periods and countries (UK, Spain, France) [Dataset]. http://doi.org/10.18167/DVN1/ZUA8MF
    Explore at:
    application/x-gzip(511157), application/x-gzip(97349), text/x-perl-script(4982), application/x-gzip(93110), application/x-gzip(23765310), application/x-gzip(107669)Available download formats
    Dataset updated
    Dec 21, 2020
    Authors
    Mathieu Roche; Mathieu Roche
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    France, Spain, United Kingdom
    Dataset funded by
    ANR (#DigitAg)
    Horizon 2020 - European Commission - (MOOD project)
    Description

    These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],

  6. Data from: SleepEEGpy: a Python-based software integration package to...

    • zenodo.org
    bin, txt
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rotem Falach; Gennadiy Belonosov; Flavio Schmidig; Maya Aderka; Vladislav Zhelezniakov; Revital Shani-Hershkovich; Ella Bar; Yuval Nir; Rotem Falach; Gennadiy Belonosov; Flavio Schmidig; Maya Aderka; Vladislav Zhelezniakov; Revital Shani-Hershkovich; Ella Bar; Yuval Nir (2025). SleepEEGpy: a Python-based software integration package to organize preprocessing, analysis, and visualization of sleep EEG data [Dataset]. http://doi.org/10.5281/zenodo.14914456
    Explore at:
    txt, binAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rotem Falach; Gennadiy Belonosov; Flavio Schmidig; Maya Aderka; Vladislav Zhelezniakov; Revital Shani-Hershkovich; Ella Bar; Yuval Nir; Rotem Falach; Gennadiy Belonosov; Flavio Schmidig; Maya Aderka; Vladislav Zhelezniakov; Revital Shani-Hershkovich; Ella Bar; Yuval Nir
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes three high-density sleep EEG recordings of healthy participants, downsampled to 250 Hz and stored in FIF format:

    1. Nap recording of a young adult participant
    2. Overnight recording of a young adult participant
    3. Overnight recording of an older adult participant

    Additionally, the dataset includes three text files for each recording:

    • bad_channels.txt: Indexes of noisy channels
    • annotations.txt: Onset and duration of noisy temporal intervals
    • staging.txt: Sleep staging vector

    The corresponding package can be found on GitHub.

    For citation, please use: https://doi.org/10.1101/2023.12.17.572046

  7. e

    Preprocessing Antarctic Weather Station (AWS) data in python

    • envidat.ch
    .zip, not available
    Updated May 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Franziska Gerber; Michael Lehning (2025). Preprocessing Antarctic Weather Station (AWS) data in python [Dataset]. http://doi.org/10.16904/envidat.340
    Explore at:
    not available, .zipAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset provided by
    WSL-SLF
    WSL Institute for Snow and Avalanche Research SLF
    Authors
    Franziska Gerber; Michael Lehning
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    Switzerland
    Dataset funded by
    EPFL / WSL-SLF
    Description

    There are many sources providing atmospheric weather station data for the Antarctic continent. However, variable naming, timestamps and data types are highly variable between the different sources. The published python code intends to make processing of different AWS sources from Antarctica easier. For all datasets that are taken into account variables are renamed in a consistent way. Data from different sources can then be handled in one consistent python dictionary. The following data sources are taken into account: * AAD: Australian Antarctic Division (https://data.aad.gov.au/aws) * ACECRC: Antarctic Climate and Ecosystems Cooperative Research Centre by the Australian Antarctic Division * AMRC: Antarctic Meteorological Research Center (ftp://amrc.ssec.wisc.edu/pub/aws/q1h/) * BAS: British Antarctic Survey (ftp://ftp.bas.ac.uk/src/ANTARCTIC_METEOROLOGICAL_DATA/AWS/; https://legacy.bas.ac.uk/met/READER/ANTARCTIC_METEOROLOGICAL_DATA/) * CLIMANTARTIDE: Antarctic Meteo-Climatological Observatory by the italian National Programme of Antarctic Research (https://www.climantartide.it/dataaccess/index.php?lang=en) * IMAU: Institute for Marine and Atmospheric research Utrecht (Lazzara et al., 2012), https://www.projects.science.uu.nl/iceclimate/aws/antarctica.ph * JMA: Japan Meteorological Agency (https://www.data.jma.go.jp/antarctic/datareport/index-e.html) * NOAA: National Oceanic and Atmospheric Administration (https://gml.noaa.gov/aftp/data/meteorology/in-situ/spo/) * Other/AWS_PE: Princess Elisabeth (PE), KU Leuven, Prof. N. van Lipzig, personal communication * Other/DDU_transect: Stations D-17 and D-47 (in transect between Dumont d’Urville and Dome C, Amory, 2020) * PANGAEA: World Data Center (e.g. König-Langlo, 2012) Important notes * Information about data sources is available. Some downloading scripts are included in the provided code. However, users should make sure to comply with the data providers terms and conditions. * Given changing download options of the differnent institutions the above links may not permanently work and data has to be retrieved by the user of this dataset. * No quality control is applied in the provided preprocessing software - quality control is up to the user of the datasets. Some dataset are quality controlled by the owner.

    Acknowledgements

    We thank all the data providers for making the data publicly available or providing them upon request. Full acknowledgements can be found in Gerber et al., submitted.

    References

    Amory, C. (2020). “Drifting-snow statistics from multiple-year autonomous measurements in Adélie Land, East Antarctica”. The Cryosphere, 1713–1725. doi: 10.5194/tc-14-1713-2020 Gerber, F., Sharma, V. and Lehning, M.: CRYOWRF - a validation and the effect of blowing snow on the Antarctic SMB, JGR - Atmospheres, submitted. König-Langlo, G. (2012). “Continuous meteorological observations at Neumayer station (2011-01)”. Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Bremerhaven, PANGAEA, doi: 10.1594/PANGAEA. 775173

  8. Z

    VegeNet - Image datasets and Codes

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tan, Jo Yen (2022). VegeNet - Image datasets and Codes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7254507
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    Tan, Jo Yen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

    Image datasets:

    vege_original : Images of vegetables captured manually in data acquisition stage

    vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed

    non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods

    food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.

    food_image_dataset_split : Image dataset (4) split into train and test sets

    process : Images created when cropping (pre-processing step) to create dataset (2).

  9. Storage and Transit Time Data and Code

    • zenodo.org
    zip
    Updated Oct 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Felton; Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. Felton
    Date: 10/29/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

    Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

    #Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a role:

    "01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

    "02_functions.R": This script contains custom functions. Load this using the
    `source()` function in the 01_start.R script.

    "03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
    `source()` function in the 01_start.R script.

    "04_figures_tables.R": This is the main workhouse for figure/table production and
    supporting analyses. This script generates the key figures and summary statistics
    used in the study that then get saved in the manuscript_figures folder. Note that all
    maps were produced using Python code found in the "supporting_code"" folder.

    "supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

    "supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

  10. o

    SleepEEGpy: a Python-based package for preprocessing, analysis, and...

    • explore.openaire.eu
    • zenodo.org
    Updated Dec 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gennadiy Belonosov; Rotem Falach; Flavio Schmidig; Maya Aderka; Vladislav Zhelezniakov; Revital Shani-Hershkovich; Ella Bar; Yuval Nir (2023). SleepEEGpy: a Python-based package for preprocessing, analysis, and visualization of sleep EEG data [Dataset]. http://doi.org/10.5281/zenodo.10362190
    Explore at:
    Dataset updated
    Dec 12, 2023
    Authors
    Gennadiy Belonosov; Rotem Falach; Flavio Schmidig; Maya Aderka; Vladislav Zhelezniakov; Revital Shani-Hershkovich; Ella Bar; Yuval Nir
    Description

    Overnight high-density EEG recording of a healthy young participant downsampled to 250Hz in fif format. In addition, two text files: bad_channels.txt contains indexes of noisy channels annotations.txt contains the onset and duration of noisy temporal intervals The package can be found on GitHub.

  11. Z

    Adult dataset preprocessed

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schuster, Verena (2024). Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
    Explore at:
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Schuster, Verena
    Pustozerova, Anastasia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

    The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

    The preprocessing steps include:

    One-hot-encoding of categorical values

    Imputation of missing values using knn-imputer with k=1

    Standard scaling of ordinal attributes

    Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.

  12. h

    warvan-ml-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset
    Explore at:
    Authors
    warvan
    Description

    Dataset Name

    This dataset contains structured data for machine learning and analysis purposes.

      Contents
    

    data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

      Usage
    

    Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

    Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

  13. f

    pre-processing

    • catalog.eoxhub.fairicube.eu
    data, json
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). pre-processing [Dataset]. https://catalog.eoxhub.fairicube.eu/collections/no-ML%20collection/items/0BCEY5MNAO
    Explore at:
    data, jsonAvailable download formats
    Dataset updated
    Apr 3, 2025
    License

    https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html

    Time period covered
    Apr 3, 2025
    Area covered
    Earth
    Description

    pre-processing python script to slice through a large genomics data set to separate/filter according certain criteria

  14. f

    S1 Data -

    • plos.figshare.com
    zip
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.

  15. Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...

    • zenodo.org
    • datadryad.org
    bin, zip
    Updated Jul 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuqi Tan; Yuqi Tan; Tim Kempchen; Tim Kempchen (2024). Demo dataset for: SPACEc, a streamlined, interactive Python workflow for multiplexed image processing and analysis [Dataset]. http://doi.org/10.5061/dryad.brv15dvj1
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yuqi Tan; Yuqi Tan; Tim Kempchen; Tim Kempchen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Measurement technique
    <p>Tissue samples:</p> <p>Tonsil cores were extracted from a larger multi-tumor tissue microarray (TMA), which included a total of 66 unique tissues (51 malignant and semi-malignant tissues, as well as 15 non-malignant tissues). Representative tissue regions were annotated on corresponding hematoxylin and eosin (H&E)-stained sections by a board-certified surgical pathologist (S.Z.). Annotations were used to generate the 66 cores each with cores of 1mm diameter. FFPE tissue blocks were retrieved from the tissue archives of the Institute of Pathology, University Medical Center Mainz, Germany, and the Department of Dermatology, University Medical Center Mainz, Germany. The multi-tumor-TMA block was sectioned at 3µm thickness onto SuperFrost Plus microscopy slides before being processed for CODEX multiplex imaging as previously described.</p> <p>CODEX multiplexed imaging and processing</p> <p>To run the CODEX machine, the slide was taken from the storage buffer and placed in PBS for 10 minutes to equilibrate. After drying the PBS with a tissue, a flow cell was sealed onto the tissue slide. The assembled slide and flow cell were then placed in a PhenoCycler Buffer made from 10X PhenoCycler Buffer \& Additive for at least 10 minutes before starting the experiment. A 96-well reporter plate was prepared with each reporter corresponding to the correct barcoded antibody for each cycle, with up to 3 reporters per cycle per well. The fluorescence reporters were mixed with 1X PhenoCycler Buffer, Additive, nuclear-staining reagent, and assay reagent according to the manufacturer's instructions. With the reporter plate and assembled slide and flow cell placed into the CODEX machine, the automated multiplexed imaging experiment was initiated. Each imaging cycle included steps for reporter binding, imaging of three fluorescent channels, and reporter stripping to prepare for the next cycle and set of markers. This was repeated until all markers were imaged. After the experiment, a .qptiff image file containing individual antibody channels and the DAPI channel was obtained. Image stitching, drift compensation, deconvolution, and cycle concatenation are performed within the Akoya PhenoCycler software. The raw imaging data output (tiff, 377.442nm per pixel for 20x CODEX) is first examined with QuPath software (https://qupath.github.io/) for inspection of staining quality. Any markers that produce unexpected patterns or low signal-to-noise ratios should be excluded from the ensuing analysis. The qptiff files must be converted into tiff files for input into SPACEc. Data preprocessing includes image stitching, drift compensation, deconvolution, and cycle concatenation performed using the Akoya Phenocycler software. The raw imaging data (qptiff, 377.442 nm/pixel for 20x CODEX) files from the Akoya PhenoCycler technology were first examined with QuPath software (https://qupath.github.io/) to inspect staining qualities. Markers with untenable patterns or low signal-to-noise ratios were excluded from further analysis. A custom CODEX analysis pipeline was used to process all acquired CODEX data (scripts available upon request). The qptiff files were converted into tiff files for tissue detection (watershed algorithm) and cell segmentation.</p>
    Description

    Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface.

    The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user's system or to familiarize oneself with the pipeline.

  16. CommitBench

    • zenodo.org
    csv, json
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo (2024). CommitBench [Dataset]. http://doi.org/10.5281/zenodo.10497442
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Dec 15, 2023
    Description

    Data Statement for CommitBench

    - Dataset Title: CommitBench
    - Dataset Curator: Maximilian Schall, Tamara Czinczoll, Gerard de Melo
    - Dataset Version: 1.0, 15.12.2023
    - Data Statement Author: Maximilian Schall, Tamara Czinczoll
    - Data Statement Version: 1.0, 16.01.2023

    EXECUTIVE SUMMARY

    We provide CommitBench as an open-source, reproducible and privacy- and license-aware benchmark for commit message generation. The dataset is gathered from github repositories with licenses that permit redistribution. We provide six programming languages, Java, Python, Go, JavaScript, PHP and Ruby. The commit messages in natural language are restricted to English, as it is the working language in many software development projects. The dataset has 1,664,590 examples that were generated by using extensive quality-focused filtering techniques (e.g. excluding bot commits). Additionally, we provide a version with longer sequences for benchmarking models with more extended sequence input, as well a version with

    CURATION RATIONALE

    We created this dataset due to quality and legal issues with previous commit message generation datasets. Given a git diff displaying code changes between two file versions, the task is to predict the accompanying commit message describing these changes in natural language. We base our GitHub repository selection on that of a previous dataset, CodeSearchNet, but apply a large number of filtering techniques to improve the data quality and eliminate noise. Due to the original repository selection, we are also restricted to the aforementioned programming languages. It was important to us, however, to provide some number of programming languages to accommodate any changes in the task due to the degree of hardware-relatedness of a language. The dataset is provides as a large CSV file containing all samples. We provide the following fields: Diff, Commit Message, Hash, Project, Split.

    DOCUMENTATION FOR SOURCE DATASETS

    Repository selection based on CodeSearchNet, which can be found under https://github.com/github/CodeSearchNet

    LANGUAGE VARIETIES

    Since GitHub hosts software projects from all over the world, there is no single uniform variety of English used across all commit messages. This means that phrasing can be regional or subject to influences from the programmer's native language. It also means that different spelling conventions may co-exist and that different terms may used for the same concept. Any model trained on this data should take these factors into account. For the number of samples for different programming languages, see Table below:

    LanguageNumber of Samples
    Java153,119
    Ruby233,710
    Go137,998
    JavaScript373,598
    Python472,469
    PHP294,394

    SPEAKER DEMOGRAPHIC

    Due to the extremely diverse (geographically, but also socio-economically) backgrounds of the software development community, there is no single demographic the data comes from. Of course, this does not entail that there are no biases when it comes to the data origin. Globally, the average software developer tends to be male and has obtained higher education. Due to the anonymous nature of GitHub profiles, gender distribution information cannot be extracted.

    ANNOTATOR DEMOGRAPHIC

    Due to the automated generation of the dataset, no annotators were used.

    SPEECH SITUATION AND CHARACTERISTICS

    The public nature and often business-related creation of the data by the original GitHub users fosters a more neutral, information-focused and formal language. As it is not uncommon for developers to find the writing of commit messages tedious, there can also be commit messages representing the frustration or boredom of the commit author. While our filtering is supposed to catch these types of messages, there can be some instances still in the dataset.

    PREPROCESSING AND DATA FORMATTING

    See paper for all preprocessing steps. We do not provide the un-processed raw data due to privacy concerns, but it can be obtained via CodeSearchNet or requested from the authors.

    CAPTURE QUALITY

    While our dataset is completely reproducible at the time of writing, there are external dependencies that could restrict this. If GitHub shuts down and someone with a software project in the dataset deletes their repository, there can be instances that are non-reproducible.

    LIMITATIONS

    While our filters are meant to ensure a high quality for each data sample in the dataset, we cannot ensure that only low-quality examples were removed. Similarly, we cannot guarantee that our extensive filtering methods catch all low-quality examples. Some might remain in the dataset. Another limitation of our dataset is the low number of programming languages (there are many more) as well as our focus on English commit messages. There might be some people that only write commit messages in their respective languages, e.g., because the organization they work at has established this or because they do not speak English (confidently enough). Perhaps some languages' syntax better aligns with that of programming languages. These effects cannot be investigated with CommitBench.

    Although we anonymize the data as far as possible, the required information for reproducibility, including the organization, project name, and project hash, makes it possible to refer back to the original authoring user account, since this information is freely available in the original repository on GitHub.

    METADATA

    License: Dataset under the CC BY-NC 4.0 license

    DISCLOSURES AND ETHICAL REVIEW

    While we put substantial effort into removing privacy-sensitive information, our solutions cannot find 100% of such cases. This means that researchers and anyone using the data need to incorporate their own safeguards to effectively reduce the amount of personal information that can be exposed.

    ABOUT THIS DOCUMENT

    A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

    This data statement was written based on the template for the Data Statements Version 2 schema. The template was prepared by Angelina McMillan-Major, Emily M. Bender, and Batya Friedman and can be found at https://techpolicylab.uw.edu/data-statements/ and was updated from the community Version 1 Markdown template by Leon Dercyznski.

  17. Top Rated Movies

    • kaggle.com
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marium Masroor (2024). Top Rated Movies [Dataset]. https://www.kaggle.com/datasets/mariumfaheem666/top-rated-movies/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marium Masroor
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Beginner Freindly:

    It is a beginner friendly data which is easy to understand and best to start any one's journey. Students of machine learning and data analytics can use this to understand basic libraries of python.

    Preprocessing:

    To make it less complex only four basic columns are included which are giving some quick information of top rated movies all over the world. This data set can help to build concepts of preprocessing and handling of data before applying any mathematical models.

    Visualizations are also giving some comprehensive description about popular movies.

    Lets get started. Happy coding!

  18. Python Code and Dataset for An Empirical Study of ChatGPT-4o Use in...

    • zenodo.org
    bin
    Updated Apr 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lauren Genith Isaza Dominguez; Lauren Genith Isaza Dominguez (2025). Python Code and Dataset for An Empirical Study of ChatGPT-4o Use in Engineering Education: Prompting and Performance [Dataset]. http://doi.org/10.5281/zenodo.15148300
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lauren Genith Isaza Dominguez; Lauren Genith Isaza Dominguez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Repository Overview

    This repository contains the complete Python codebase for the study:

    An Empirical Study of ChatGPT-4o Use in Engineering Education: Prompting and Performance

    The code includes data preprocessing, metric calculations, and machine learning models used in the study.

    Dataset

    The repository also includes the original dataset used in the paper:

    An Empirical Research Study of ChatGPT-4o in Engineering Education.xlsx

    This dataset contains anonymized logs of AI interactions, written assignments, grades, and computed metrics used for analysis.

  19. f

    Table_2_XCast: A python climate forecasting toolkit.docx

    • frontiersin.figshare.com
    docx
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle Joseph Chen Hall; Nachiketa Acharya (2023). Table_2_XCast: A python climate forecasting toolkit.docx [Dataset]. http://doi.org/10.3389/fclim.2022.953262.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Frontiers
    Authors
    Kyle Joseph Chen Hall; Nachiketa Acharya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Climate forecasts, both experimental and operational, are often made by calibrating Global Climate Model (GCM) outputs with observed climate variables using statistical and machine learning models. Often, machine learning techniques are applied to gridded data independently at each gridpoint. However, the implementation of these gridpoint-wise operations is a significant barrier to entry to climate data science. Unfortunately, there is a significant disconnect between the Python data science ecosystem and the gridded earth data ecosystem. Traditional Python data science tools are not designed to be used with gridded datasets, like those commonly used in climate forecasting. Heavy data preprocessing is needed: gridded data must be aggregated, reshaped, or reduced in dimensionality in order to fit the strict formatting requirements of Python's data science tools. Efficiently implementing this gridpoint-wise workflow is a time-consuming logistical burden which presents a high barrier to entry to earth data science. A set of high-performance, easy-to-use Python climate forecasting tools is needed to bridge the gap between Python's data science ecosystem and its gridded earth data ecosystem. XCast, an Xarray-based climate forecasting Python library developed by the authors, bridges this gap. XCast wraps underlying two-dimensional data science methods, like those of Scikit-Learn, with data structures that allow them to be applied to each gridpoint independently. XCast uses high-performance computing libraries to efficiently parallelize the gridpoint-wise application of data science utilities and make Python's traditional data science toolkits compatible with multidimensional gridded data. XCast also implements a diverse set of climate forecasting tools including traditional statistical methods, state-of-the-art machine learning approaches, preprocessing functionality (regridding, rescaling, smoothing), and postprocessing modules (cross validation, forecast verification, visualization). These tools are useful for producing and analyzing both experimental and operational climate forecasts. In this study, we describe the development of XCast, and present in-depth technical details on how XCast brings highly parallelized gridpoint-wise versions of traditional Python data science tools into Python's gridded earth data ecosystem. We also demonstrate a case study where XCast was used to generate experimental real-time deterministic and probabilistic forecasts for South Asian Summer Monsoon Rainfall in 2022 using different machine learning-based multi-model ensembles.

  20. S

    machine learning models on the WDBC dataset

    • scidb.cn
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Mahdi Aghaziarati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration
Organization logo

Data Pre-Processing : Data Integration

Merge - Join - Concatenate

Explore at:
39 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mr.Machine
Description

In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

Search
Clear search
Close search
Google apps
Main menu