38 datasets found
  1. f

    Mean AUC of 10x10-fold CV for different models and imputation methods.

    • plos.figshare.com
    xls
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Mean AUC of 10x10-fold CV for different models and imputation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mean AUC of 10x10-fold CV for different models and imputation methods.

  2. f

    Model performance results based on random forest, gradient boosting,...

    • figshare.com
    xls
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Junying Wang; David D. Wu; Christine DeLorenzo; Jie Yang (2024). Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times. [Dataset]. http://doi.org/10.1371/journal.pone.0299625.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 28, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Junying Wang; David D. Wu; Christine DeLorenzo; Jie Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times.

  3. f

    Life course multivariable linear regression between allostatic load and...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Castagné, Raphaële; Delpierre, Cyrille; Kelly-Irving, Michelle; Lepage, Benoit; Joannès, Camille (2021). Life course multivariable linear regression between allostatic load and parental interest using data obtained from multiple imputation for women (n = 4 056). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000884555
    Explore at:
    Dataset updated
    Jun 17, 2021
    Authors
    Castagné, Raphaële; Delpierre, Cyrille; Kelly-Irving, Michelle; Lepage, Benoit; Joannès, Camille
    Description

    Life course multivariable linear regression between allostatic load and parental interest using data obtained from multiple imputation for women (n = 4 056).

  4. Processed Datasets - Imputation in Well Log Data: A Benchmark

    • zenodo.org
    application/gzip
    Updated May 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro H. T. Gama; Pedro H. T. Gama; Jackson Faria; Jessica Sena; Jessica Sena; Francisco Neves; Francisco Neves; Vinícius R. Riffel; Vinícius R. Riffel; Lucas Perez; Lucas Perez; André Korenchendler; André Korenchendler; Matheus C. A. Sobreira; Matheus C. A. Sobreira; Alexei M. C. Machado; Alexei M. C. Machado; Jackson Faria (2024). Processed Datasets - Imputation in Well Log Data: A Benchmark [Dataset]. http://doi.org/10.5281/zenodo.10987946
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pedro H. T. Gama; Pedro H. T. Gama; Jackson Faria; Jessica Sena; Jessica Sena; Francisco Neves; Francisco Neves; Vinícius R. Riffel; Vinícius R. Riffel; Lucas Perez; Lucas Perez; André Korenchendler; André Korenchendler; Matheus C. A. Sobreira; Matheus C. A. Sobreira; Alexei M. C. Machado; Alexei M. C. Machado; Jackson Faria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 17, 2024
    Description

    Imputation of well log data is a common task in the field. However a quick review of the literature reveals a lack of padronization when evaluating methods for the problem. The goal of the benchmark is to introduce a standard evaluation protocol to any imputation method for well log data.

    In the proposed benchmark, three public datasets are used:

    • Geolink: The Geolink Dataset is another public dataset of wells in the Norwegian offshore. The data is provided by the company of the same name, GEOLINK and follows the NOLD 2.0 license.
      This dataset contains a total of 223 wells. It also has lithology labels for the wells with a total of 36 lithology classes. [download original]
    • Taranaki Basin: The Taranaki Basin Dataset is a curated set of wells and a convenient option for experimentation especially due to it is ease of accessibility and use.
      This collection, under the CDLA-Sharing-1.0 license, contains well logs extracted from the New Zealand Petroleum & Minerals Online Exploration Database and Petlab.
      There are a total of 407 wells, of which 289 are onshore and 118 are offshore exploration and production wells. [download original]
    • Teapot Dome: The Teapot Dome dataset is provided by the Rocky Mountain Oilfield Testing Center (RMOTC) and the US Department of Energy.
      It contains different types of data related to the Teapot Dome oil field, such as 2D and 3D seismic data, well logs, and GIS data. The data is licensed under the Creative Commons 4.0 license.
      In total, the dataset has 1,179 wells with available logs. The number of available logs varies across wells. There are only 91 wells with the gamma ray, bulk density, and neutron porosity logs, while only three wells have the complete basic suite. [direct download]

    Here you can download all three datasets already preprocessed to be used with our implementation, found here.

    File Description:

    There are six files for each fold partition for each dataset.

    • datasetname_fold_k_well_log_metadata_train.json : JSON file with general information of the slices of training partition of the fold k. Contains total number of slices and the number of slices per well.
    • datasetname_fold_k_well_log_metadata_val.json : JSON file with general information of the slices of validation partition of the fold k. Contains total number of slices and the number of slices per well.
    • datasetname_fold_k_well_log_slices_train.npy: .npy (numpy) file ready to be loaded with the slices for training of the fold k already processed. When loaded should have shape of (total_slices, 256, number_of_logs)
    • datasetname_fold_k_well_log_slices_val.npy : .npy (numpy) file ready to be loaded with the slices for validation of the fold k already processed.
    • datasetname_fold_k_well_log_slices_meta_train.json : JSON file with the slices info for all slices in the training partition of the fold k. For each slice, 7 data points are provided, the last four are discarded (it would contain other information that was not used). The first three are in order the: origin well name, the starting position in that well, and the end position of the slice in that well.
    • datasetname_fold_k_well_log_slices_meta_val.json : JSON file with the slices info for all slices in the validation partition of the fold k.
  5. e

    Monthly imputation of delays

    • data.europa.eu
    • gimi9.com
    • +2more
    csv, excel xlsx, json +5
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Infrabel (2024). Monthly imputation of delays [Dataset]. https://data.europa.eu/data/datasets/https-opendata-infrabel-be-explore-dataset-toewijzingvertraging-/embed
    Explore at:
    json-ld, csv, rdf xml, parquet, excel xlsx, json, n3, rdf turtleAvailable download formats
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Infrabel
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The historical method used to calculate the statistics presented in this dataset takes into account all the minutes of delay caused by 'major incidents' (internally know as 'relazen') on the rail network as reported to the Railway Accident and Incident Investigation Body (OEAIF/OOIS) and the Railway Safety and Interoperability Service (NSA Rail Belgium) under the Royal Decree of 16 January 2007 laying down certain rules relating to investigations into railway accidents and incidents.

    The criteria defining 'major incidents' (internally known as 'relations'**) are as follows:

    1 passenger train delayed by an incident for 20 minutes or more Several passenger trains delayed by an incident for at least 40 minutes Incidents leading to the cancellation (partial or total) of trains Incidents with an impact on operational safety

    There is no unequivocal relationship between the minutes of delay in 'major incidents' and the punctuality rate because:

    The minutes included in 'major incidents' do not necessarily have an actual impact on punctuality (a train can make up its delay as it goes along). Some trains arrive at their terminus more than 6 minutes late (and therefore have an actual impact on punctuality), but are not included in the 'major incidents'.

    In order to provide an exhaustive overview of the causes and responsibilities for delays, a new dataset has been made available: Monthly causes of loss of punctuality.

    The data presented in this new dataset is as follows: for each train delayed by 6 minutes or more on arrival at a tracking point*, an analysis is made of the cause of all the minutes of delay along the route, and a proportional score is awarded for each responsibility identified.

    More info in the new dataset's description

  6. f

    Between-day contrasts from imputed data.

    • plos.figshare.com
    xlsx
    Updated Mar 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas K. Winther; Ivan Baptista; Sigurd Pedersen; João Brito; Morten B. Randers; Dag Johansen; Svein Arne Pettersen (2024). Between-day contrasts from imputed data. [Dataset]. http://doi.org/10.1371/journal.pone.0299851.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 28, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Andreas K. Winther; Ivan Baptista; Sigurd Pedersen; João Brito; Morten B. Randers; Dag Johansen; Svein Arne Pettersen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This observational study aimed to analyze external training load in highly trained female football players, comparing starters and non-starters across various cycle lengths and training days. Method: External training load [duration, total distance [TD], high-speed running distance [HSRD], sprint distance [SpD], and acceleration- and deceleration distance [AccDecdist] from 100 female football players (22.3 ± 3.7 years of age) in the Norwegian premier division were collected over two seasons using STATSports APEX. This resulted in a final dataset totaling 10498 observations after multiple imputation of missing data. Microcycle length was categorized based on the number of days between matches (2 to 7 days apart), while training days were categorized relative to match day (MD, MD+1, MD+2, MD-5, MD-4, MD-3, MD-2, MD-1). Linear mixed modeling was used to assess differences between days, and starters vs. non-starters. Results: In longer cycle lengths (5–7 days between matches), the middle of the week (usually MD-4 or MD-3) consistently exhibited the highest external training load (~21–79% of MD TD, MD HSRD, MD SpD, and MD AccDecdist); though, with the exception of duration (~108–120% of MD duration), it remained lower than MD. External training load was lowest on MD+2 and MD-1 (~1–37% of MD TD, MD HSRD, MD SpD, MD AccDecdist, and ~73–88% of MD peak speed). Non-starters displayed higher loads (~137–400% of starter TD, HSRD, SpD, AccDecdist) on MD+2 in cycles with 3 to 7 days between matches, with non-significant differences (~76–116%) on other training days. Conclusion: Loading patterns resemble a pyramid or skewed pyramid during longer cycle lengths (5–7 days), with higher training loads towards the middle compared to the start and the end of the cycle. Non-starters displayed slightly higher loads on MD+2, with no significant load differentiation from MD-5 onwards.

  7. f

    Multiple year training vs single year training (AUC).

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Multiple year training vs single year training (AUC). [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multiple year training vs single year training (AUC).

  8. u

    Quarterly Labour Force Survey Household Dataset, October - December, 2022

    • beta.ukdataservice.ac.uk
    Updated 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office For National Statistics (2023). Quarterly Labour Force Survey Household Dataset, October - December, 2022 [Dataset]. http://doi.org/10.5255/ukda-sn-9064-2
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    datacite
    Authors
    Office For National Statistics
    Description
    Background
    The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

    Household datasets
    Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. From January 2011, a pseudonymised household identifier variable (HSERIALP) is also included in the main quarterly LFS dataset instead.

    Change to coding of missing values for household series
    From 1996-2013, all missing values in the household datasets were set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. This was also in line with the Annual Population Survey household series of the time. The change was applied to the back series during 2010 to ensure continuity for analytical purposes. From 2013 onwards, the -8 and -9 categories have been reinstated.

    LFS Documentation
    The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each volume alongside the appropriate questionnaire for the year concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS
    LFS User Guidance page before commencing analysis.

    Additional data derived from the QLFS
    The Archive also holds further QLFS series: End User Licence (EUL) quarterly datasets; Secure Access datasets (see below); two-quarter and five-quarter longitudinal datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

    End User Licence and Secure Access QLFS Household datasets
    Users should note that there are two discrete versions of the QLFS household datasets. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. Secure Access household datasets for the QLFS are available from 2009 onwards, and include additional, detailed variables not included in the standard EUL versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurrence of learning difficulty or disability; and benefits. For full details of variables included, see data dictionary documentation. The Secure Access version (see SN 7674) has more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.

    Changes to variables in QLFS Household EUL datasets
    In order to further protect respondent confidentiality, ONS have made some changes to variables available in the EUL datasets. From July-September 2015 onwards, 4-digit industry class is available for main job only, meaning that 3-digit industry group is the most detailed level available for second and last job.

    Review of imputation methods for LFS Household data - changes to missing values
    A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

    Occupation data for 2021 and 2022 data files

    The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

    Latest edition information

    For the second edition (September 2023), the variables NSECM20, NSECMJ20, SC2010M, SC20SMJ, SC20SMN and SOC20M have been replaced with new versions. Further information on the SOC revisions can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

  9. B

    Data from: A comparison of genomic selection models across time in interior...

    • borealisdata.ca
    Updated May 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby (2021). Data from: A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods [Dataset]. http://doi.org/10.5683/SP2/I9BJI6
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2021
    Dataset provided by
    Borealis
    Authors
    Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    British Columbia
    Description

    AbstractGenomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated. Usage notesphenotypephenotype and experimental designphen mask dryad.txtSVD genotype imputationmarker matrix for SVD imputation methodSVDimp dryad.txtMean genotype imputationmarker matrix for mean imputation methodMeanImp dryad.txtKNN genotype imputationmarker matrix for KNN imputation methodKNNimp dryad.txt

  10. Employment Of India CLeaned and Messy Data

    • kaggle.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SONIA SHINDE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

    🔹 Dataset Composition:

    It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

    Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
    - Employment Status (Employed/Unemployed)
    - Monthly Salary (INR)
    - Education Level
    - Industry Sector
    - Years of Experience
    - Location
    - Perceived AI Risk
    - Date of Data Recording

    Transformations & Cleaning Applied:

    The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

    Purpose & Utility:

    This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

    It's also useful for: - Training ML models with clean inputs
    - Data storytelling with visual clarity
    - Demonstrating reproducibility in data cleaning pipelines

    By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.

  11. f

    Ratio of training time of four models relative to RF.

    • figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Ratio of training time of four models relative to RF. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ratio of training time of four models relative to RF.

  12. m

    Synthetic Stroke Prediction Dataset

    • data.mendeley.com
    • kaggle.com
    Updated May 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Borhan Uddin (2025). Synthetic Stroke Prediction Dataset [Dataset]. http://doi.org/10.17632/s2nh6fm925.1
    Explore at:
    Dataset updated
    May 2, 2025
    Authors
    Mohammed Borhan Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.

  13. f

    Number of Landsat 7 ETM+ surface reflectance products acquired per year...

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Number of Landsat 7 ETM+ surface reflectance products acquired per year (January—December) per tile (cf. Fig 2 for location of tiles). [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of Landsat 7 ETM+ surface reflectance products acquired per year (January—December) per tile (cf. Fig 2 for location of tiles).

  14. Cleaned Driver License Dataset

    • kaggle.com
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyanshu_CODER (2025). Cleaned Driver License Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12605959
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Divyanshu_CODER
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🚗 Cleaned Drivers License Dataset

    "This dataset is designed for those who believe in building smart, ethical AI models that can assist in real-world decision-making like licensing, risk assessment, and more."

    📂 Dataset Overview

    This dataset is a cleaned and preprocessed version of a drivers license dataset containing details like:

    Age Group

    Gender

    Reaction Time

    Driving Skill Level

    Training Received

    License Qualification Status

    And many more total 20 columns you can train a best ML model with this dataset

    The raw dataset contained missing values, categorical anomalies, and inconsistencies which have been fixed to make this dataset ML-ready.

    💡 Key Highlights

    ✅ Cleaned missing values with intelligent imputation

    🧠 Encoded categorical columns with appropriate techniques (OneHot, Ordinal)

    🔍 Removed outliers using statistical logic

    🔧 Feature engineered to preserve semantic meaning

    💾 Ready-to-use for classification tasks (e.g., Predicting who qualifies for a license)

    📊 Columns Description

    Column Name Description

    Gender Gender of the individual (Male, Female) Age Group Age segment (Teen, Adult, Senior) Race Ethnicity of the driver Reactions Reaction speed categorized (Fast, Average, Slow) Training Training received (Basic, Advanced) Driving Skill Skill level (Expert, Beginner, etc.) Qualified Whether the person qualified for a license (Yes, No)

    🤖 Perfect For

    📚 Machine Learning (Classification)

    📊 Exploratory Data Analysis (EDA)

    📉 Feature Engineering Practice

    🧪 Model Evaluation & Experimentation

    🚥 AI for Transport & Safety Projects

    🏷️ Tags

    MLReady #LicensePrediction #ClassificationDataset #CleanedData #FeatureEngineering #DriversLicense #Transportation #AIProjects #Imputation #OrdinalEncoding

    📌 Author Notes

    This dataset is part of a data cleaning and feature engineering project. One column is intentionally left unprocessed for developers to test their own pipeline or transformation strategies 😎.

    🔗 Citation

    If you use this dataset in your projects, notebooks, or blog posts — feel free to tag me or credit the original dataset and this cleaned version.

    📚 Citation

    If you use this dataset in your work, please cite it as:

    Divyanshu_Coder, 2025. Cleaned Driver License Dataset. Kaggle.

  15. H

    A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution...

    • dataverse.harvard.edu
    Updated Jan 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lianfa, Li; Jiajie, Wu (2021). A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution for Mainland China [Dataset]. http://doi.org/10.7910/DVN/RNSWRH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 19, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Lianfa, Li; Jiajie, Wu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2015 - Dec 31, 2018
    Area covered
    China
    Description

    We share the complete aerosol optical depth dataset with high spatial (1x1km^2) and temporal (daily) resolution and the Beijing 1954 projection (https://epsg.io/2412) for mainland China (2015-2018). The original aerosol optical depth images are from Multi-Angle Implementation of Atmospheric Correction Aerosol Optical Depth (MAIAC AOD) (https://lpdaac.usgs.gov/products/mcd19a2v006/) with the similar spatiotemporal resolution and the sinusoidal projection (https://en.wikipedia.org/wiki/Sinusoidal_projection). After projection conversion, eighteen tiles of MAIAC AOD were merged to obtain a large image of AOD covering the entire area of mainland China. Due to the conditions of clouds and high surface reflectance, each original MAIAC AOD image usually has many missing values, and the average missing percentage of each AOD image may exceed 60%. Such a high percentage of missing values severely limits applicability of the original MAIAC AOD dataset product. We used the sophisticated method of full residual deep networks (Li et al, 2020, https://ieeexplore.ieee.org/document/9186306) to impute the daily missing MAIAC AOD, thus obtaining the complete (no missing values) high-resolution AOD data product covering mainland China. The covariates used in imputation included coordinates, elevation, MERRA2 coarse-resolution PBLH and AOD variables, cloud fraction, high-resolution meteorological variables (air pressure, air temperature, relative humidity and wind speed) and/or time index etc. Ground monitoring data were used to generate high-resolution meteorological variables to ensure the reliability of interpolation. Overall, our daily imputation models achieved an average training R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075, with a range of 0.026 to 0.32) and an average test R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075 with a range of 0.026 to 0.32). With almost no difference between training metrics and test metrics, the high test R^2 and low test RMSE show the reliability of AOD imputation. In the evaluation using the ground AOD data from the monitoring stations of the Aerosol Robot Network (AERONET) in mainland China, our method obtained a R^2 of 0.78 and RMSE of 0.27, which further illustrated the reliability of the method. This database contains four datasets: - Daily complete high-resolution AOD image dataset for mainland China from January 1, 2015 to December 31, 2018. The archived resources contain 1461 images stored in 1461 files, and 3 summary Excel files. The table “CHN_AOD_INFO.xlsx” describing the properties of the 1461 images, including projection, training R^2 and RMSE, testing R^2 and RMSE, minmum, mean, median and maximum AOD that we predicted. - The table “Model_and_Accuracy_of_Meteorological_Elements.xlsx” describing the statistics of performance metrics in interpolation of high-resolution meteorological dataset. - The table “Evaluation_Using_AERONET_AOD.xlsx” showing the evaluation result of AERONET, including R^2, RMSE, and monitoring information used in this study.

  16. Insurance Premium Data

    • kaggle.com
    Updated Apr 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prachi Gopalani (2021). Insurance Premium Data [Dataset]. https://www.kaggle.com/prachi13/insurance13m-persistency/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Prachi Gopalani
    Description

    1. Problem Description:

    • Prepare a Machine Learning Model to predict the Persistency 13M Payment Behaviour at the New Business stage. ## ## 2. Objective:
    • Using Machine Learning techniques, provide scores for each policy at the New Business stage the likelihood to pay the 13M premium.
    • Identify the segments where maximum non payers are captured ## 3. Dataset:
    • “Training” & “Test” Dataset with the raw input attributes and the 13M actual paid/not paid flag.
    • “Out of Time” Datasets would be provided with just the raw input attributes. ## 4. Expected Steps:
      1. Conduct appropriate Data Treatments for e.g. Missing Value Imputation, Outlier treatment etc.
      2. Conduct required Feature Engineering for e.g. Binning, Ratio, Interaction, Polynomial etc.
      3. Use any machine learning algorithm or combination of machine learning algorithms you deem fit.
      4. Prepare your model on the Train Data and you can evaluate the generalization capability of your model by using K-Fold Cross Validation, Leave One Out
      5. Cross Validation or any other validation technique that you see appropriate.
      6. Score the Test and Out of Time Data and share it back to us along with the scored Train Data for evaluation. Also share all the Model Codes and Documentation.
  17. Within-year classification performance (AUC) for all the years.

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Within-year classification performance (AUC) for all the years. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Within-year classification performance (AUC) for all the years.

  18. f

    Model hyperparameters, their ranges and optimal values obtained.

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Model hyperparameters, their ranges and optimal values obtained. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Model hyperparameters, their ranges and optimal values obtained.

  19. f

    Ratio (%) of crop loss parcels for three groups with different areas and...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xls
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Ratio (%) of crop loss parcels for three groups with different areas and four years. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ratio (%) of crop loss parcels for three groups with different areas and four years.

  20. f

    Within-year AUC of three groups with different field parcel sizes.

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Within-year AUC of three groups with different field parcel sizes. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Within-year AUC of three groups with different field parcel sizes.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka (2023). Mean AUC of 10x10-fold CV for different models and imputation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0251952.t004

Mean AUC of 10x10-fold CV for different models and imputation methods.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Jun 7, 2023
Dataset provided by
PLOS ONE
Authors
Santosh Hiremath; Samantha Wittke; Taru Palosuo; Jere Kaivosoja; Fulu Tao; Maximilian Proll; Eetu Puttonen; Pirjo Peltonen-Sainio; Pekka Marttinen; Hiroshi Mamitsuka
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Mean AUC of 10x10-fold CV for different models and imputation methods.

Search
Clear search
Close search
Google apps
Main menu