93 datasets found
  1. f

    Comparison of missing values, ‘don’t know’ values and inconsistent values...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Hal, Guido; Van der Heyden, Johan; Braekman, Elise; Charafeddine, Rana; Demarest, Stefaan; Gisle, Lydia; Tafforeau, Jean; Berete, Finaba; Molenberghs, Geert; Drieskens, Sabine (2018). Comparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000729296
    Explore at:
    Dataset updated
    May 21, 2018
    Authors
    Van Hal, Guido; Van der Heyden, Johan; Braekman, Elise; Charafeddine, Rana; Demarest, Stefaan; Gisle, Lydia; Tafforeau, Jean; Berete, Finaba; Molenberghs, Geert; Drieskens, Sabine
    Description

    Comparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149).

  2. Student Admission Records

    • kaggle.com
    zip
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeeshan Ahmad (2024). Student Admission Records [Dataset]. https://www.kaggle.com/datasets/zeeshier/student-admission-records/code
    Explore at:
    zip(2107 bytes)Available download formats
    Dataset updated
    Nov 8, 2024
    Authors
    Zeeshan Ahmad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is crafted for beginners to practice data cleaning and preprocessing techniques in machine learning. It contains 157 rows of student admission records, including duplicate rows, missing values, and some data inconsistencies (e.g., outliers, unrealistic values). It’s ideal for practicing common data preparation steps before applying machine learning algorithms.

    The dataset simulates a university admission record system, where each student’s admission profile includes test scores, high school percentages, and admission status. The data contains realistic flaws often encountered in raw data, offering hands-on experience in data wrangling.

    The dataset contains the following columns:

    Name: Student's first name (Pakistani names). Age: Age of the student (some outliers and missing values). Gender: Gender (Male/Female). Admission Test Score: Score obtained in the admission test (includes outliers and missing values). High School Percentage: Student's high school final score percentage (includes outliers and missing values). City: City of residence in Pakistan. Admission Status: Whether the student was accepted or rejected.

  3. Data from: Multiple Imputation of Missing or Faulty Values Under Linear...

    • tandf.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hang J. Kim; Jerome P. Reiter; Quanli Wang; Lawrence H. Cox; Alan F. Karr (2023). Multiple Imputation of Missing or Faulty Values Under Linear Constraints [Dataset]. http://doi.org/10.6084/m9.figshare.1119364.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Hang J. Kim; Jerome P. Reiter; Quanli Wang; Lawrence H. Cox; Alan F. Karr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.

  4. f

    Data from: Integrated In Silico Models for the Prediction of...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Oct 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toropova, Alla; Gadaleta, Domenico; Marzo, Marco; Benfenati, Emilio; Dorne, Jean Lou C. M.; Escher, Sylvia E.; Lavado, Giovanna J.; Toropov, Andrey (2020). Integrated In Silico Models for the Prediction of No-Observed-(Adverse)-Effect Levels and Lowest-Observed-(Adverse)-Effect Levels in Rats for Sub-chronic Repeated-Dose Toxicity [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000484204
    Explore at:
    Dataset updated
    Oct 22, 2020
    Authors
    Toropova, Alla; Gadaleta, Domenico; Marzo, Marco; Benfenati, Emilio; Dorne, Jean Lou C. M.; Escher, Sylvia E.; Lavado, Giovanna J.; Toropov, Andrey
    Description

    Repeated-dose toxicity (RDT) is a critical endpoint for hazard characterization of chemicals and is assessed to derive safe levels of exposure for human health. Here we present the first attempt to model simultaneously no-observed-(adverse)-effect level (NO(A)EL) and lowest-observed-(adverse)-effect level (LO(A)EL). Classification and regression models were derived based on rat sub-chronic repeated dose toxicity data for 327 compounds from the Fraunhofer RepDose database. Multi-category classification models were built for both NO(A)EL and LO(A)EL though a consensus of statistics- and fragment-based algorithms, while regression models were based on quantitative relationships between the endpoints and SMILES-based attributes. NO(A)EL and LO(A)EL models were integrated, and predictions were compared to exclude inconsistent values. This strategy improved the performance of single models, leading to R2 greater than 0.70, root-mean-square error (RMSE) lower than 0.60 (for regression models), and accuracy of 0.61–0.73 (for classification models) on the validation set, based on the endpoint and the threshold applied for selecting predictions. This study confirms the effectiveness of the modeling strategy presented here for assessing RDT of chemicals using in silico models.

  5. Retail Store Sales: Dirty for Data Cleaning

    • kaggle.com
    zip
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning
    Explore at:
    zip(226740 bytes)Available download formats
    Dataset updated
    Jan 18, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Retail Store Sales Dataset

    Overview

    The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

    File Information

    • File Name: retail_store_sales.csv
    • Number of Rows: 12,575
    • Number of Columns: 11

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    Customer IDA unique identifier for each customer. 25 unique customers.CUST_01
    CategoryThe category of the purchased item.Food, Furniture
    ItemThe name of the purchased item. May contain missing values or None.Item_1_FOOD, None
    Price Per UnitThe static price of a single unit of the item. May contain missing or None values.4.00, None
    QuantityThe quantity of the item purchased. May contain missing or None values.1, None
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, None
    Payment MethodThe method of payment used. May contain missing or invalid values.Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Online
    Transaction DateThe date of the transaction. Always present and valid.2023-01-15
    Discount AppliedIndicates if a discount was applied to the transaction. May contain missing values.True, False, None

    Categories and Items

    The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

    Electric Household Essentials

    Item CodeItem NamePrice
    Item_1_EHEBlender5.0
    Item_2_EHEMicrowave6.5
    Item_3_EHEToaster8.0
    Item_4_EHEVacuum Cleaner9.5
    Item_5_EHEAir Purifier11.0
    Item_6_EHEElectric Kettle12.5
    Item_7_EHERice Cooker14.0
    Item_8_EHEIron15.5
    Item_9_EHECeiling Fan17.0
    Item_10_EHETable Fan18.5
    Item_11_EHEHair Dryer20.0
    Item_12_EHEHeater21.5
    Item_13_EHEHumidifier23.0
    Item_14_EHEDehumidifier24.5
    Item_15_EHECoffee Maker26.0
    Item_16_EHEPortable AC27.5
    Item_17_EHEElectric Stove29.0
    Item_18_EHEPressure Cooker30.5
    Item_19_EHEInduction Cooktop32.0
    Item_20_EHEWater Dispenser33.5
    Item_21_EHEHand Blender35.0
    Item_22_EHEMixer Grinder36.5
    Item_23_EHESandwich Maker38.0
    Item_24_EHEAir Fryer39.5
    Item_25_EHEJuicer41.0

    Furniture

    Item CodeItem NamePrice
    Item_1_FUROffice Chair5.0
    Item_2_FURSofa6.5
    Item_3_FURCoffee Table8.0
    Item_4_FURDining Table9.5
    Item_5_FURBookshelf11.0
    Item_6_FURBed F...
  6. f

    Antimalarial drugs: inconsistent studies of pregnancy-associated...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    • +1more
    Updated Nov 2, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ito, Shinya; Carls, Alexandra; Koren, Gideon; Leibson, Tom; Adams-Webber, Thomasin; Pariente, Gali (2016). Antimalarial drugs: inconsistent studies of pregnancy-associated pharmacokinetic changes (percent calculated as pregnant/nonpregnant values). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001563382
    Explore at:
    Dataset updated
    Nov 2, 2016
    Authors
    Ito, Shinya; Carls, Alexandra; Koren, Gideon; Leibson, Tom; Adams-Webber, Thomasin; Pariente, Gali
    Description

    Antimalarial drugs: inconsistent studies of pregnancy-associated pharmacokinetic changes (percent calculated as pregnant/nonpregnant values).

  7. Data from: Hcropland30: A hybrid 30-m global cropland map by leveraging...

    • zenodo.org
    • data.niaid.nih.gov
    bin, jpeg, zip
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qiong Hu; Zhiwen Cai; Liangzhi You; Steffen Fritz; Xinyu Zhang; He Yin; Haodong Wei; Jingya Yang; Zexuan Li; Hao Wu; Baodong Xu; Wenbin Wu; Qiong Hu; Zhiwen Cai; Liangzhi You; Steffen Fritz; Xinyu Zhang; He Yin; Haodong Wei; Jingya Yang; Zexuan Li; Hao Wu; Baodong Xu; Wenbin Wu (2024). Hcropland30: A hybrid 30-m global cropland map by leveraging global land cover products and Landsat data based on a deep learning model [Dataset]. http://doi.org/10.5281/zenodo.13169748
    Explore at:
    zip, bin, jpegAvailable download formats
    Dataset updated
    Aug 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Qiong Hu; Zhiwen Cai; Liangzhi You; Steffen Fritz; Xinyu Zhang; He Yin; Haodong Wei; Jingya Yang; Zexuan Li; Hao Wu; Baodong Xu; Wenbin Wu; Qiong Hu; Zhiwen Cai; Liangzhi You; Steffen Fritz; Xinyu Zhang; He Yin; Haodong Wei; Jingya Yang; Zexuan Li; Hao Wu; Baodong Xu; Wenbin Wu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hcropland30:A 30-m global cropland map by leveraging global land cover products and Landsat data based on a deep learning model

    ***Please note this dataset is undergoing peer review***

    Version: 1.0

    Authors: Qiong Hu a, 1, Zhiwen Cai b, 1, Liangzhi You c, d, Steffen Fritz e, Xinyu Zhang c, He Yin f, Haodong Weic, Jingya Yang g, Zexuan Li a, Qiangyi Yu g, Hao Wu a, Baodong Xu b *, Wenbin Wu g, *

    a Key Laboratory for Geographical Process Analysis & Simulation of Hubei Province/College of Urban and Environmental Sciences, Central China Normal University, Wuhan 430079, China

    b College of Resources and Environment, Huazhong Agricultural University, Wuhan 430070, China

    c Macro Agriculture Research Institute, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China

    d International Food Policy Research Institute, 1201 I Street, NW, Washington, DC 20005, USA

    e Novel Data Ecosystems for sustainability Research Group, International Institute for Applied Systems Analysis (IIASA), Schlossplatz 1, Laxenburg A-2361, Austria

    f Department of Geography, Kent State University, 325 S. Lincoln Street, Kent, OH 44242, USA

    g State Key Laboratory of Efficient Utilization of Arid and Semi-arid Arable Land in Northern China, the Institute of Agricultural Resources and Regional Planning, Chinese Academy of Agricultural Sciences, Beijing 100081, China

    Introduction

    We are pleased to introduce a comprehensive global cropland mapping dataset (named Hcropland30) in 2020, meticulously curated to support a wide range of research and analysis applications related to agricultural land and environmental assessment. This dataset encompasses the entire globe, divided into 16,284 grids, each measuring an area of 1°×1°. Hcropland30 was produced by leveraging global land cover products and Landsat data based on a deep learning model. Initially, we established a hierarchal sampling strategy that used the simulated annealing method to identify the representative 1°×1° grids globally and the sparse point-level samples within these selected 1°×1°grids. Subsequently, we employed an ensemble learning technique to expand these sparse point-level samples into the densely pixel-wise labels, creating the area-level 1°×1° cropland labels. These area-level labels were then used to train a U-Net model for predicting global cropland distribution, followed by a comprehensive evaluation of the mapping accuracy.

    Dataset

    1. Hcropland30: A hybrid 30-m global cropland map in 2020

    ****Data format: GeoTiff

    ****Spatial resolution: 30 m

    ****Projection: EPSG: 4326 (WGS84)

    ****Values: 1 denotes cropland and 0 denotes non-cropland

    The dataset has been uploaded in 16,284 tiles. The extent of each tile can be found in the file of “Grids.shp”. Each file is named according to the grid’s Id number. For example, “000015.tif” corresponds to the cropland mapping result for the 15-th 1°×1° grid. This systematic naming convention ensures easy identification and retrieval of the specific grid data.

    2. 1°×1° Grids: This file contains all 16,284 1°×1° grids used in the dataset. The vector file includes 18 attribute fields, providing comprehensive metadata for each grid. These attributes are essential for users who need detailed information about each grid’s characteristics.

    ****Data format: ESRI shapefile

    ****Projection: EPSG: 4326 (WGS84)

    ****Attribute Fields:

    Id: The grid’s ID number.

    area: The area of the grid.

    mode: Indicates the representative sample grid.

    climate: The climate type the grid belongs to.

    dem: Average DEM value of the grid.

    ndvi_s1 to ndvi_s4: Average NDVI values for four seasons within the grid.

    esa, esri, fcs30, fromglc, glad, globeland30: Proportion of cropland pixels of different publicly available cropland products.

    inconsistent: Proportion of inconsistent pixels within the grid according to different public cropland products.

    hcropland30: Proportion of cropland pixels of our Hcropland30 dataset.

    3. Samples: The selected representative pixel-level samples, including 32,343 cropland and 67657 non-cropland samples. The category information of each sample was determined based on visual interpretation on Google Earth image and three-year NDVI time series curves from 2019-2021.

    ****Data format: ESRI shapefile

    ****Projection: EPSG: 4326 (WGS84)

    ****Attribute Fields:

    type: 1 denotes cropland sample and 0 denotes non-cropland sample.

    Citation

    If you use this dataset, please cite the following paper:

    Hu, Q., Cai, Z., You, L., Fritz, S., Zhang, X., Yin, H., Wei, H., Yang, J., Li, Z., Yu, Q., Wu, H., Xu, B., Wu, W. (2024). Hcropland30: A 30-m global cropland map by leveraging global land cover products and Landsat data based on a deep learning model, Remote Sensing of Environment, submitted.

    License

    The data is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

    Disclaimer

    This dataset is provided as-is, without any warranty, express or implied. The dataset author is not

    responsible for any errors or omissions in the data, or for any consequences arising from the use

    of the data.

    Contact

    If you have any questions or feedback regarding the dataset, please contact the dataset author

    Qiong Hu (huqiong@ccnu.edu.cn)

  8. f

    Performance of ORFcor run on simulated inconsistency-containing data in...

    • figshare.com
    xls
    Updated Oct 31, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan L. Klassen; Cameron R. Currie (2016). Performance of ORFcor run on simulated inconsistency-containing data in comparison to known values using the parameters: a = 5; b = 10; d = 0.75 or 0.90; f = 10; g = 30; l = k = 1000. [Dataset]. http://doi.org/10.1371/journal.pone.0058387.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 31, 2016
    Dataset provided by
    PLOS ONE
    Authors
    Jonathan L. Klassen; Cameron R. Currie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of ORFcor run on simulated inconsistency-containing data in comparison to known values using the parameters: a = 5; b = 10; d = 0.75 or 0.90; f = 10; g = 30; l = k = 1000.

  9. f

    Antibiotics: inconsistent studies of pregnancy-associated pharmacokinetic...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Nov 2, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koren, Gideon; Adams-Webber, Thomasin; Leibson, Tom; Carls, Alexandra; Pariente, Gali; Ito, Shinya (2016). Antibiotics: inconsistent studies of pregnancy-associated pharmacokinetic changes (percent calculated as pregnant/non-pregnant values). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001563318
    Explore at:
    Dataset updated
    Nov 2, 2016
    Authors
    Koren, Gideon; Adams-Webber, Thomasin; Leibson, Tom; Carls, Alexandra; Pariente, Gali; Ito, Shinya
    Description

    Antibiotics: inconsistent studies of pregnancy-associated pharmacokinetic changes (percent calculated as pregnant/non-pregnant values).

  10. h

    EDADataset-RetailStoreSales-Dirty

    • huggingface.co
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bar haim (2025). EDADataset-RetailStoreSales-Dirty [Dataset]. https://huggingface.co/datasets/Reut1/EDADataset-RetailStoreSales-Dirty
    Explore at:
    Dataset updated
    Nov 14, 2025
    Authors
    bar haim
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Retail Store Sales EDA Project The data set is publicly available on Kaggle. Background‫:
 The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. The analysis includes data cleaning, descriptive… See the full description on the dataset page: https://huggingface.co/datasets/Reut1/EDADataset-RetailStoreSales-Dirty.

  11. d

    Data from: US federal resource allocations are inconsistent with...

    • datadryad.org
    • search.dataone.org
    • +1more
    zip
    Updated Sep 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Heller; Christopher Knittel; Tim Schittekatte; Carlos Batlle (2024). US federal resource allocations are inconsistent with concentrations of energy poverty [Dataset]. http://doi.org/10.5061/dryad.9kd51c5rj
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 17, 2024
    Dataset provided by
    Dryad
    Authors
    Peter Heller; Christopher Knittel; Tim Schittekatte; Carlos Batlle
    Time period covered
    Mar 29, 2024
    Area covered
    United States
    Description

    US federal resource allocations are inconsistent with concentrations of energy poverty

    https://doi.org/10.5061/dryad.9kd51c5rj

    This dataset contains the necessary R scripts and data files to replicate this analysis' results. All analysis is completed in R, and an internet connection is required as the RECS input files are loaded directly from the US Energy Information Administration's webiste for the most up-to-date information.

    Description of the data and file structure

    “Analysis” Folder

    The folder titled "Analysis" contains all of the results presented in this paper. The "Coeffs" subfolder conatins the .csv files of model coefficients for both 2015 and 2020.

    • 2015_coeffs.csv
    • 2020_coeffs.csv

    The "Figures" subfolder contains all of the maps, graphs, and performance output from the R scripts.

    • Graphs: Histograms of tract average energy burdens for 2015, 2020, and the comparison of 2015 and 2020. Subfolder "2020" also conta...
  12. Cleaned Retail Customer Dataset (SQL-based ETL)

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rizwan Bin Akbar (2025). Cleaned Retail Customer Dataset (SQL-based ETL) [Dataset]. https://www.kaggle.com/datasets/rizwanbinakbar/cleaned-retail-customer-dataset-sql-based-etl
    Explore at:
    zip(1249509 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    Rizwan Bin Akbar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Description

    This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)

    This table contains information about customers, including their unique identifiers and demographic details.

    Columns:
    
      cst_id: Customer ID (Primary Key)
    
      cst_gndr: Gender
    
      cst_marital_status: Marital status
    
      cst_create_date: Customer account creation date
    
    Cleaning Steps:
    
      Removed duplicates and handled missing or null cst_id values.
    
      Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.
    
      Standardized gender values and identified inconsistencies in marital status.
    
    1. Product Information (s_crm_prd_info / b_crm_prd_info)

    This table contains information about products, including product identifiers, names, costs, and lifecycle dates.

    Columns:
    
      prd_id: Product ID
    
      prd_key: Product key
    
      prd_nm: Product name
    
      prd_cost: Product cost
    
      prd_start_dt: Product start date
    
      prd_end_dt: Product end date
    
    Cleaning Steps:
    
      Checked for duplicates and null values in the prd_key column.
    
      Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.
    
      Corrected product costs to remove invalid entries (e.g., negative values).
    
    1. Sales Details (s_crm_sales_details / b_crm_sales_details)

    This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.

    Columns:
    
      sls_order_dt: Sales order date
    
      sls_due_dt: Sales due date
    
      sls_sales: Total sales amount
    
      sls_quantity: Number of products sold
    
      sls_price: Product unit price
    
    Cleaning Steps:
    
      Validated sales order dates and corrected invalid entries.
    
      Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.
    
      Removed null and negative values from sls_sales, sls_quantity, and sls_price.
    
    1. ERP Customer Data (b_erp_cust_az12, s_erp_cust_az12)

    This table contains additional customer demographic data, including gender and birthdate.

    Columns:
    
      cid: Customer ID
    
      gen: Gender
    
      bdate: Birthdate
    
    Cleaning Steps:
    
      Checked for missing or null gender values and standardized inconsistent entries.
    
      Removed leading/trailing spaces from gen and bdate.
    
      Validated birthdates to ensure they were within a realistic range.
    
    1. Location Information (b_erp_loc_a101)

    This table contains country information related to the customers' locations.

    Columns:
    
      cntry: Country
    
    Cleaning Steps:
    
      Standardized country names (e.g., "US" and "USA" were mapped to "United States").
    
      Removed special characters (e.g., carriage returns) and trimmed whitespace.
    
    1. Product Category (b_erp_px_cat_g1v2)

    This table contains product category information.

    Columns:
    
      Product category data (no significant cleaning required).
    

    Key Features:

    Customer demographics, including gender and marital status
    
    Product details such as cost, start date, and end date
    
    Sales data with order dates, quantities, and sales amounts
    
    ERP-specific customer and location data
    

    Data Cleaning Process:

    This dataset underwent extensive cleaning and validation, including:

    Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).
    
    Date Validations: Ensuring correct date ranges and chronological consistency.
    
    Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.
    
    Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.
    

    This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.

  13. w

    Data from: The problem of inconsistency between thermal maturity indicators...

    • data.wu.ac.at
    pdf
    Updated Jun 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corp (2018). The problem of inconsistency between thermal maturity indicators used for petroleum exploration in Australian basins [Dataset]. https://data.wu.ac.at/schema/data_gov_au/ZjU0MzM4ZDYtMGFmOS00MzUwLWFjODItNGU5MjY1ZGRhODVh
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 27, 2018
    Dataset provided by
    Corp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A major frustration in thermal maturation modelling for petroleum exploration in Australian sedimentary basins is the inconsistency between the values of different thermal maturity indicators. Vitrinite reflectance (VR) , Rock-Eval Tmax , spore colouration index (SCI) and fluorescence alteration of multiple macerals (FAMM) for wells from three Australian basins show inconsistencies due to technical, methodological and conceptual problems inherent in each technique. When the differences between the concepts of rank and thermal maturity are considered, it can be shown that some inconsistencies are more apparent than real. It is important to consider this distinction when selecting data against which to model burial and thermal histories.

  14. Statistical Reporting Errors and Collaboration on Statistical Analyses in...

    • plos.figshare.com
    tiff
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coosje L. S. Veldkamp; Michèle B. Nuijten; Linda Dominguez-Alvarez; Marcel A. L. M. van Assen; Jelte M. Wicherts (2023). Statistical Reporting Errors and Collaboration on Statistical Analyses in Psychological Science [Dataset]. http://doi.org/10.1371/journal.pone.0114876
    Explore at:
    tiffAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Coosje L. S. Veldkamp; Michèle B. Nuijten; Linda Dominguez-Alvarez; Marcel A. L. M. van Assen; Jelte M. Wicherts
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical analysis is error prone. A best practice for researchers using statistics would therefore be to share data among co-authors, allowing double-checking of executed tasks just as co-pilots do in aviation. To document the extent to which this ‘co-piloting’ currently occurs in psychology, we surveyed the authors of 697 articles published in six top psychology journals and asked them whether they had collaborated on four aspects of analyzing data and reporting results, and whether the described data had been shared between the authors. We acquired responses for 49.6% of the articles and found that co-piloting on statistical analysis and reporting results is quite uncommon among psychologists, while data sharing among co-authors seems reasonably but not completely standard. We then used an automated procedure to study the prevalence of statistical reporting errors in the articles in our sample and examined the relationship between reporting errors and co-piloting. Overall, 63% of the articles contained at least one p-value that was inconsistent with the reported test statistic and the accompanying degrees of freedom, and 20% of the articles contained at least one p-value that was inconsistent to such a degree that it may have affected decisions about statistical significance. Overall, the probability that a given p-value was inconsistent was over 10%. Co-piloting was not found to be associated with reporting errors.

  15. d

    Data from: Weak and inconsistent associations between melanic darkness and...

    • datadryad.org
    • dataone.org
    zip
    Updated Oct 23, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siiri-Lii Sandre; Tanel Kaart; Nathan Morehouse; Toomas Tammaru (2018). Weak and inconsistent associations between melanic darkness and fitness related traits in an insect [Dataset]. http://doi.org/10.5061/dryad.kr8vc17
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 23, 2018
    Dataset provided by
    Dryad
    Authors
    Siiri-Lii Sandre; Tanel Kaart; Nathan Morehouse; Toomas Tammaru
    Time period covered
    Oct 22, 2018
    Area covered
    Estonia
    Description

    Ematurga data for quantitative genetic analysesAn Excel file with three sheets

    Sheet 1: Pedigree data presenting the relatedness structure

    id Individual identification number (including also individuals without phenotype data) sire Sire identification number (zero, if unknown) dam Dam identification number (zero, if unknown)

    Sheets 2 and 3: Heather.data & Bilberry.data: individual-based valued of the traits being analysed

    gen Generation number (1 - F1, 2 - F2) plant Plant (1 - heatrher, 2 - bilberry) sex Sex (1 - male, 2 - female)

    h_rgr & b_rgr Growth ratio in 5th instar on heather and on bilberry, respectively

    h_pupw & b_pupw Pupal weight (mg) on heather and on bilberry, respectively

    h_fifth & b_fifth Duration of the 5th instar (days) on heather and on bilberry, respectively

    h_dscore & b_dscore Melanic darkness MCA score on heather and on bilberry, respectivelydryaddata.xlsx

  16. C

    Sirene database of companies and their establishments (consolidated version...

    • ckan.mobidatalab.eu
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    INSEE (2023). Sirene database of companies and their establishments (consolidated version v3) - Île-de-France [Dataset]. https://ckan.mobidatalab.eu/dataset/sirene-base-of-companies-and-their-establishments-consolidated-version-v3-ile-de-france
    Explore at:
    https://www.iana.org/assignments/media-types/text/csv, https://www.iana.org/assignments/media-types/application/zip, https://www.iana.org/assignments/media-types/application/jsonAvailable download formats
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    INSEE
    License

    Licence Ouverte / Open Licence 2.0https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf
    License information was derived automatically

    Area covered
    France, Île-de-France
    Description

    Find all the companies and their establishments. The Sirene® database is updated every day, it includes approximately 30 million establishments in activity or not.

    IMPORTANT

    As the Sirene database contains personal data, INSEE draws your attention to the resulting legal obligations:

    Indeed , Article A123-96 of the Commercial Code provides that:

    "Any natural person may request either directly during their creation or modification formalities, or by letter addressed to the Managing Director of the National Institute of Statistics and Economic Studies, that the information in the directory concerning it may not be used by third parties other than the bodies authorized under Article R. 123-224 or the administrations, for the purposes of prospecting, particularly commercial."

    SIRENE BY ODS

    ODS presents a database of establishments consolidated with the data of its associated legal unit.

    Enrichments

    • addition of the descriptions of the NAF codes and legal categories;
    • addition of the descriptions of legal ranges and types route;
    • addition of administrative hierarchies (reg/arr/dep/epci);
    • addition of geolocation of establishments via BAN geocoding;
    • change certain abbreviations (F/M, O/N, A/C) by the corresponding wording (legalunit gender, legalunit administrative status, legalunit employer character, establishmentadministrativestate, establishmentheadquarters, employercharacterestablishment)
    • addition of a field "first line of addressing" with civility + surname first name person from the legal unit
    • addition of an establishment address field (concatenation num + type + route)

    Notes

    • "establishment start date" values ​​were reconstructed from an older version for about 100 records with inconsistent values.< br/>

  17. Clean cafe sales dataset

    • kaggle.com
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Majeedat Babalola (2025). Clean cafe sales dataset [Dataset]. https://www.kaggle.com/datasets/majeedatbabalola/clean-cafe-sales-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Majeedat Babalola
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset contains sales records from a café. Initially, it was messy, with missing values represented as NaN, UNKNOWN, and ERROR. The following cleaning steps were applied: 1. Handling Missing Values Replaced missing values with appropriate statistics: i. Mode for categorical columns (Item, Payment Method, and Location). ii. Mean for numerical columns (Quantity). iii. Median for temporal data (Transaction Date).

    2. Price Standardization Inconsistent values in the Price per Unit column were corrected by filling them with the appropriate consistent price from the dataset.

    3. Data Type Conversion Converted all columns to their appropriate data types (e.g., datetime for transaction dates, numeric for quantities and prices, categorical for items, payment methods, and locations)

  18. d

    Companies active in the territory

    • datasets.ai
    23, 57, 8
    Updated Oct 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plateforme ouverte des données publiques françaises (2021). Companies active in the territory [Dataset]. https://datasets.ai/datasets/https-opendata-cc-coeurdefrance-fr-explore-dataset-base-sirene-v3-consolidee-des-etablissements-actifs-sur-le-territoire-de-coeur-d-
    Explore at:
    57, 23, 8Available download formats
    Dataset updated
    Oct 5, 2021
    Dataset authored and provided by
    Plateforme ouverte des données publiques françaises
    Description

    Find all the companies and their establishments. The Sirene® database is updated every day, it includes about 30 million establishments in operation or not.

    IMPORTANT

    Since the Sirene database contains personal data, INSEE draws your attention to the legal obligations arising therefrom:

    • The processing of these data falls within the reporting obligations of Law 78-17 of 6 January 1978, as amended, known as the CNIL Law: https://www.cnil.fr/en/loi-78-17-du-6-janvier-1978-modifiee
    • Depending on your use of the dataset, it is your responsibility to take into account the most recent dissemination status of each individual.

    Article A123-96 of the Commercial Code provides that:

    "Any natural person may request, either directly at the time of his creation or modification formalities, or by letter addressed to the Director-General of the National Institute of Statistics and Economic Studies, that the information in the directory concerning him may not be used by third parties other than bodies authorized under Article R. 123-224 or administrations, for prospecting purposes, particularly commercial."

    SIRENE BY ODS

    ODS presents a consolidated institution database with data from its associated legal unit.

    Enrichments

    • addition of the wording of the NAF codes and legal categories;
    • addition of legal slice wordings and track types;
    • addition of administrative hierarchies (reg/arr/dep/epci);
    • addition of geolocation of establishments via BAN geocoding;
    • change of some abbreviations (F/M, O/N, A/C) by the corresponding wording (sexeunitelegale, etatadministratifunitelegale, caractereemployeurunitelegale, etatadministratifetablissement, etablissementsiege, caractereemployeuretablissement)
    • addition of a field "first line of address" with the civility + surname first name person of the legal unit
    • addition of a field address establishment (concatenation num + type + way)

    Notes

    • the "start date of establishment" values have been reconstructed from an old version for around 100 records with inconsistent values.
  19. d

    Dataset for collaborative prediction of web service quality based on user...

    • search.dataone.org
    • datadryad.org
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang Song (2025). Dataset for collaborative prediction of web service quality based on user preferences and services [Dataset]. http://doi.org/10.5061/dryad.5dv41ns4s
    Explore at:
    Dataset updated
    May 4, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Yang Song
    Time period covered
    Jan 1, 2020
    Description

    The prediction of web service quality plays an important role in improving user services; it has been one of the most popular topics in the field of Internet services. In traditional collaborative filtering methods, differences in the personalization and preferences of different users have been ignored. In this paper, we propose a prediction method for web service quality based on different types of quality of service (QoS) attributes. Different extraction rules are applied to extract the user preference matrices from the original web data, and the negative value filtering-based top-K method is used to merge the optimization results into the collaborative prediction method. Thus, the individualized differences are fully exploited, and the problem of inconsistent QoS values is resolved. The experimental results demonstrate the validity of the proposed method. Compared with other methods, the proposed method performs better, and the results are closer to the real values.

  20. H

    Replication Data for: Machine Learning Predictions as Regression Covariates

    • dataverse.harvard.edu
    Updated Sep 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Fong; Matthew Tyler (2022). Replication Data for: Machine Learning Predictions as Regression Covariates [Dataset]. http://doi.org/10.7910/DVN/QQHBHY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Christian Fong; Matthew Tyler
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/QQHBHYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/QQHBHY

    Description

    In text, images, merged surveys, voter files, and elsewhere, data sets are often missing important covariates, either because they are latent features of observations (such as sentiment in text) or because they are not collected (such as race in voter files). One promising approach for coping with this missing data is to find the true values of the missing covariates for a subset of the observations and then train a machine learning algorithm to predict the values of those covariates for the rest. However, plugging in these predictions without regard for prediction error renders regression analyses biased, inconsistent, and overconfident. We characterize the severity of the problem posed by prediction error, describe a procedure to avoid these inconsistencies under comparatively general assumptions, and demonstrate the performance of our estimators through simulations and a study of hostile political dialogue on the Internet. We provide software implementing our approach.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Van Hal, Guido; Van der Heyden, Johan; Braekman, Elise; Charafeddine, Rana; Demarest, Stefaan; Gisle, Lydia; Tafforeau, Jean; Berete, Finaba; Molenberghs, Geert; Drieskens, Sabine (2018). Comparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000729296

Comparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149).

Explore at:
Dataset updated
May 21, 2018
Authors
Van Hal, Guido; Van der Heyden, Johan; Braekman, Elise; Charafeddine, Rana; Demarest, Stefaan; Gisle, Lydia; Tafforeau, Jean; Berete, Finaba; Molenberghs, Geert; Drieskens, Sabine
Description

Comparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149).

Search
Clear search
Close search
Google apps
Main menu