Facebook
TwitterComparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is crafted for beginners to practice data cleaning and preprocessing techniques in machine learning. It contains 157 rows of student admission records, including duplicate rows, missing values, and some data inconsistencies (e.g., outliers, unrealistic values). It’s ideal for practicing common data preparation steps before applying machine learning algorithms.
The dataset simulates a university admission record system, where each student’s admission profile includes test scores, high school percentages, and admission status. The data contains realistic flaws often encountered in raw data, offering hands-on experience in data wrangling.
The dataset contains the following columns:
Name: Student's first name (Pakistani names). Age: Age of the student (some outliers and missing values). Gender: Gender (Male/Female). Admission Test Score: Score obtained in the admission test (includes outliers and missing values). High School Percentage: Student's high school final score percentage (includes outliers and missing values). City: City of residence in Pakistan. Admission Status: Whether the student was accepted or rejected.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.
Facebook
TwitterRepeated-dose toxicity (RDT) is a critical endpoint for hazard characterization of chemicals and is assessed to derive safe levels of exposure for human health. Here we present the first attempt to model simultaneously no-observed-(adverse)-effect level (NO(A)EL) and lowest-observed-(adverse)-effect level (LO(A)EL). Classification and regression models were derived based on rat sub-chronic repeated dose toxicity data for 327 compounds from the Fraunhofer RepDose database. Multi-category classification models were built for both NO(A)EL and LO(A)EL though a consensus of statistics- and fragment-based algorithms, while regression models were based on quantitative relationships between the endpoints and SMILES-based attributes. NO(A)EL and LO(A)EL models were integrated, and predictions were compared to exclude inconsistent values. This strategy improved the performance of single models, leading to R2 greater than 0.70, root-mean-square error (RMSE) lower than 0.60 (for regression models), and accuracy of 0.61–0.73 (for classification models) on the validation set, based on the endpoint and the threshold applied for selecting predictions. This study confirms the effectiveness of the modeling strategy presented here for assessing RDT of chemicals using in silico models.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.
retail_store_sales.csv| Column Name | Description | Example Values |
|---|---|---|
Transaction ID | A unique identifier for each transaction. Always present and unique. | TXN_1234567 |
Customer ID | A unique identifier for each customer. 25 unique customers. | CUST_01 |
Category | The category of the purchased item. | Food, Furniture |
Item | The name of the purchased item. May contain missing values or None. | Item_1_FOOD, None |
Price Per Unit | The static price of a single unit of the item. May contain missing or None values. | 4.00, None |
Quantity | The quantity of the item purchased. May contain missing or None values. | 1, None |
Total Spent | The total amount spent on the transaction. Calculated as Quantity * Price Per Unit. | 8.00, None |
Payment Method | The method of payment used. May contain missing or invalid values. | Cash, Credit Card |
Location | The location where the transaction occurred. May contain missing or invalid values. | In-store, Online |
Transaction Date | The date of the transaction. Always present and valid. | 2023-01-15 |
Discount Applied | Indicates if a discount was applied to the transaction. May contain missing values. | True, False, None |
The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:
| Item Code | Item Name | Price |
|---|---|---|
| Item_1_EHE | Blender | 5.0 |
| Item_2_EHE | Microwave | 6.5 |
| Item_3_EHE | Toaster | 8.0 |
| Item_4_EHE | Vacuum Cleaner | 9.5 |
| Item_5_EHE | Air Purifier | 11.0 |
| Item_6_EHE | Electric Kettle | 12.5 |
| Item_7_EHE | Rice Cooker | 14.0 |
| Item_8_EHE | Iron | 15.5 |
| Item_9_EHE | Ceiling Fan | 17.0 |
| Item_10_EHE | Table Fan | 18.5 |
| Item_11_EHE | Hair Dryer | 20.0 |
| Item_12_EHE | Heater | 21.5 |
| Item_13_EHE | Humidifier | 23.0 |
| Item_14_EHE | Dehumidifier | 24.5 |
| Item_15_EHE | Coffee Maker | 26.0 |
| Item_16_EHE | Portable AC | 27.5 |
| Item_17_EHE | Electric Stove | 29.0 |
| Item_18_EHE | Pressure Cooker | 30.5 |
| Item_19_EHE | Induction Cooktop | 32.0 |
| Item_20_EHE | Water Dispenser | 33.5 |
| Item_21_EHE | Hand Blender | 35.0 |
| Item_22_EHE | Mixer Grinder | 36.5 |
| Item_23_EHE | Sandwich Maker | 38.0 |
| Item_24_EHE | Air Fryer | 39.5 |
| Item_25_EHE | Juicer | 41.0 |
| Item Code | Item Name | Price |
|---|---|---|
| Item_1_FUR | Office Chair | 5.0 |
| Item_2_FUR | Sofa | 6.5 |
| Item_3_FUR | Coffee Table | 8.0 |
| Item_4_FUR | Dining Table | 9.5 |
| Item_5_FUR | Bookshelf | 11.0 |
| Item_6_FUR | Bed F... |
Facebook
TwitterAntimalarial drugs: inconsistent studies of pregnancy-associated pharmacokinetic changes (percent calculated as pregnant/nonpregnant values).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hcropland30:A 30-m global cropland map by leveraging global land cover products and Landsat data based on a deep learning model
***Please note this dataset is undergoing peer review***
Version: 1.0
Authors: Qiong Hu a, 1, Zhiwen Cai b, 1, Liangzhi You c, d, Steffen Fritz e, Xinyu Zhang c, He Yin f, Haodong Weic, Jingya Yang g, Zexuan Li a, Qiangyi Yu g, Hao Wu a, Baodong Xu b *, Wenbin Wu g, *
a Key Laboratory for Geographical Process Analysis & Simulation of Hubei Province/College of Urban and Environmental Sciences, Central China Normal University, Wuhan 430079, China
b College of Resources and Environment, Huazhong Agricultural University, Wuhan 430070, China
c Macro Agriculture Research Institute, College of Plant Science and Technology, Huazhong Agricultural University, Wuhan 430070, China
d International Food Policy Research Institute, 1201 I Street, NW, Washington, DC 20005, USA
e Novel Data Ecosystems for sustainability Research Group, International Institute for Applied Systems Analysis (IIASA), Schlossplatz 1, Laxenburg A-2361, Austria
f Department of Geography, Kent State University, 325 S. Lincoln Street, Kent, OH 44242, USA
g State Key Laboratory of Efficient Utilization of Arid and Semi-arid Arable Land in Northern China, the Institute of Agricultural Resources and Regional Planning, Chinese Academy of Agricultural Sciences, Beijing 100081, China
Introduction
We are pleased to introduce a comprehensive global cropland mapping dataset (named Hcropland30) in 2020, meticulously curated to support a wide range of research and analysis applications related to agricultural land and environmental assessment. This dataset encompasses the entire globe, divided into 16,284 grids, each measuring an area of 1°×1°. Hcropland30 was produced by leveraging global land cover products and Landsat data based on a deep learning model. Initially, we established a hierarchal sampling strategy that used the simulated annealing method to identify the representative 1°×1° grids globally and the sparse point-level samples within these selected 1°×1°grids. Subsequently, we employed an ensemble learning technique to expand these sparse point-level samples into the densely pixel-wise labels, creating the area-level 1°×1° cropland labels. These area-level labels were then used to train a U-Net model for predicting global cropland distribution, followed by a comprehensive evaluation of the mapping accuracy.
Dataset
1. Hcropland30: A hybrid 30-m global cropland map in 2020
****Data format: GeoTiff
****Spatial resolution: 30 m
****Projection: EPSG: 4326 (WGS84)
****Values: 1 denotes cropland and 0 denotes non-cropland
The dataset has been uploaded in 16,284 tiles. The extent of each tile can be found in the file of “Grids.shp”. Each file is named according to the grid’s Id number. For example, “000015.tif” corresponds to the cropland mapping result for the 15-th 1°×1° grid. This systematic naming convention ensures easy identification and retrieval of the specific grid data.
2. 1°×1° Grids: This file contains all 16,284 1°×1° grids used in the dataset. The vector file includes 18 attribute fields, providing comprehensive metadata for each grid. These attributes are essential for users who need detailed information about each grid’s characteristics.
****Data format: ESRI shapefile
****Projection: EPSG: 4326 (WGS84)
****Attribute Fields:
Id: The grid’s ID number.
area: The area of the grid.
mode: Indicates the representative sample grid.
climate: The climate type the grid belongs to.
dem: Average DEM value of the grid.
ndvi_s1 to ndvi_s4: Average NDVI values for four seasons within the grid.
esa, esri, fcs30, fromglc, glad, globeland30: Proportion of cropland pixels of different publicly available cropland products.
inconsistent: Proportion of inconsistent pixels within the grid according to different public cropland products.
hcropland30: Proportion of cropland pixels of our Hcropland30 dataset.
3. Samples: The selected representative pixel-level samples, including 32,343 cropland and 67657 non-cropland samples. The category information of each sample was determined based on visual interpretation on Google Earth image and three-year NDVI time series curves from 2019-2021.
****Data format: ESRI shapefile
****Projection: EPSG: 4326 (WGS84)
****Attribute Fields:
type: 1 denotes cropland sample and 0 denotes non-cropland sample.
Citation
If you use this dataset, please cite the following paper:
Hu, Q., Cai, Z., You, L., Fritz, S., Zhang, X., Yin, H., Wei, H., Yang, J., Li, Z., Yu, Q., Wu, H., Xu, B., Wu, W. (2024). Hcropland30: A 30-m global cropland map by leveraging global land cover products and Landsat data based on a deep learning model, Remote Sensing of Environment, submitted.
License
The data is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
Disclaimer
This dataset is provided as-is, without any warranty, express or implied. The dataset author is not
responsible for any errors or omissions in the data, or for any consequences arising from the use
of the data.
Contact
If you have any questions or feedback regarding the dataset, please contact the dataset author
Qiong Hu (huqiong@ccnu.edu.cn)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of ORFcor run on simulated inconsistency-containing data in comparison to known values using the parameters: a = 5; b = 10; d = 0.75 or 0.90; f = 10; g = 30; l = k = 1000.
Facebook
TwitterAntibiotics: inconsistent studies of pregnancy-associated pharmacokinetic changes (percent calculated as pregnant/non-pregnant values).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Retail Store Sales EDA Project The data set is publicly available on Kaggle. Background: The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. The analysis includes data cleaning, descriptive… See the full description on the dataset page: https://huggingface.co/datasets/Reut1/EDADataset-RetailStoreSales-Dirty.
Facebook
Twitterhttps://doi.org/10.5061/dryad.9kd51c5rj
This dataset contains the necessary R scripts and data files to replicate this analysis' results. All analysis is completed in R, and an internet connection is required as the RECS input files are loaded directly from the US Energy Information Administration's webiste for the most up-to-date information.
The folder titled "Analysis" contains all of the results presented in this paper. The "Coeffs" subfolder conatins the .csv files of model coefficients for both 2015 and 2020.
The "Figures" subfolder contains all of the maps, graphs, and performance output from the R scripts.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description
This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)
This table contains information about customers, including their unique identifiers and demographic details.
Columns:
cst_id: Customer ID (Primary Key)
cst_gndr: Gender
cst_marital_status: Marital status
cst_create_date: Customer account creation date
Cleaning Steps:
Removed duplicates and handled missing or null cst_id values.
Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.
Standardized gender values and identified inconsistencies in marital status.
This table contains information about products, including product identifiers, names, costs, and lifecycle dates.
Columns:
prd_id: Product ID
prd_key: Product key
prd_nm: Product name
prd_cost: Product cost
prd_start_dt: Product start date
prd_end_dt: Product end date
Cleaning Steps:
Checked for duplicates and null values in the prd_key column.
Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.
Corrected product costs to remove invalid entries (e.g., negative values).
This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.
Columns:
sls_order_dt: Sales order date
sls_due_dt: Sales due date
sls_sales: Total sales amount
sls_quantity: Number of products sold
sls_price: Product unit price
Cleaning Steps:
Validated sales order dates and corrected invalid entries.
Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.
Removed null and negative values from sls_sales, sls_quantity, and sls_price.
This table contains additional customer demographic data, including gender and birthdate.
Columns:
cid: Customer ID
gen: Gender
bdate: Birthdate
Cleaning Steps:
Checked for missing or null gender values and standardized inconsistent entries.
Removed leading/trailing spaces from gen and bdate.
Validated birthdates to ensure they were within a realistic range.
This table contains country information related to the customers' locations.
Columns:
cntry: Country
Cleaning Steps:
Standardized country names (e.g., "US" and "USA" were mapped to "United States").
Removed special characters (e.g., carriage returns) and trimmed whitespace.
This table contains product category information.
Columns:
Product category data (no significant cleaning required).
Key Features:
Customer demographics, including gender and marital status
Product details such as cost, start date, and end date
Sales data with order dates, quantities, and sales amounts
ERP-specific customer and location data
Data Cleaning Process:
This dataset underwent extensive cleaning and validation, including:
Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).
Date Validations: Ensuring correct date ranges and chronological consistency.
Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.
Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.
This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A major frustration in thermal maturation modelling for petroleum exploration in Australian sedimentary basins is the inconsistency between the values of different thermal maturity indicators. Vitrinite reflectance (VR) , Rock-Eval Tmax , spore colouration index (SCI) and fluorescence alteration of multiple macerals (FAMM) for wells from three Australian basins show inconsistencies due to technical, methodological and conceptual problems inherent in each technique. When the differences between the concepts of rank and thermal maturity are considered, it can be shown that some inconsistencies are more apparent than real. It is important to consider this distinction when selecting data against which to model burial and thermal histories.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical analysis is error prone. A best practice for researchers using statistics would therefore be to share data among co-authors, allowing double-checking of executed tasks just as co-pilots do in aviation. To document the extent to which this ‘co-piloting’ currently occurs in psychology, we surveyed the authors of 697 articles published in six top psychology journals and asked them whether they had collaborated on four aspects of analyzing data and reporting results, and whether the described data had been shared between the authors. We acquired responses for 49.6% of the articles and found that co-piloting on statistical analysis and reporting results is quite uncommon among psychologists, while data sharing among co-authors seems reasonably but not completely standard. We then used an automated procedure to study the prevalence of statistical reporting errors in the articles in our sample and examined the relationship between reporting errors and co-piloting. Overall, 63% of the articles contained at least one p-value that was inconsistent with the reported test statistic and the accompanying degrees of freedom, and 20% of the articles contained at least one p-value that was inconsistent to such a degree that it may have affected decisions about statistical significance. Overall, the probability that a given p-value was inconsistent was over 10%. Co-piloting was not found to be associated with reporting errors.
Facebook
TwitterEmaturga data for quantitative genetic analysesAn Excel file with three sheets
Sheet 1: Pedigree data presenting the relatedness structure
id Individual identification number (including also individuals without phenotype data) sire Sire identification number (zero, if unknown) dam Dam identification number (zero, if unknown)
Sheets 2 and 3: Heather.data & Bilberry.data: individual-based valued of the traits being analysed
gen Generation number (1 - F1, 2 - F2) plant Plant (1 - heatrher, 2 - bilberry) sex Sex (1 - male, 2 - female)
h_rgr & b_rgr Growth ratio in 5th instar on heather and on bilberry, respectively
h_pupw & b_pupw Pupal weight (mg) on heather and on bilberry, respectively
h_fifth & b_fifth Duration of the 5th instar (days) on heather and on bilberry, respectively
h_dscore & b_dscore Melanic darkness MCA score on heather and on bilberry, respectivelydryaddata.xlsx
Facebook
TwitterLicence Ouverte / Open Licence 2.0https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf
License information was derived automatically
Find all the companies and their establishments. The Sirene® database is updated every day, it includes approximately 30 million establishments in activity or not.
As the Sirene database contains personal data, INSEE draws your attention to the resulting legal obligations:
Indeed , Article A123-96 of the Commercial Code provides that:
"Any natural person may request either directly during their creation or modification formalities, or by letter addressed to the Managing Director of the National Institute of Statistics and Economic Studies, that the information in the directory concerning it may not be used by third parties other than the bodies authorized under Article R. 123-224 or the administrations, for the purposes of prospecting, particularly commercial."
ODS presents a database of establishments consolidated with the data of its associated legal unit.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains sales records from a café. Initially, it was messy, with missing values represented as NaN, UNKNOWN, and ERROR. The following cleaning steps were applied: 1. Handling Missing Values Replaced missing values with appropriate statistics: i. Mode for categorical columns (Item, Payment Method, and Location). ii. Mean for numerical columns (Quantity). iii. Median for temporal data (Transaction Date).
2. Price Standardization Inconsistent values in the Price per Unit column were corrected by filling them with the appropriate consistent price from the dataset.
3. Data Type Conversion Converted all columns to their appropriate data types (e.g., datetime for transaction dates, numeric for quantities and prices, categorical for items, payment methods, and locations)
Facebook
TwitterFind all the companies and their establishments. The Sirene® database is updated every day, it includes about 30 million establishments in operation or not.
Since the Sirene database contains personal data, INSEE draws your attention to the legal obligations arising therefrom:
Article A123-96 of the Commercial Code provides that:
"Any natural person may request, either directly at the time of his creation or modification formalities, or by letter addressed to the Director-General of the National Institute of Statistics and Economic Studies, that the information in the directory concerning him may not be used by third parties other than bodies authorized under Article R. 123-224 or administrations, for prospecting purposes, particularly commercial."
ODS presents a consolidated institution database with data from its associated legal unit.
Facebook
TwitterThe prediction of web service quality plays an important role in improving user services; it has been one of the most popular topics in the field of Internet services. In traditional collaborative filtering methods, differences in the personalization and preferences of different users have been ignored. In this paper, we propose a prediction method for web service quality based on different types of quality of service (QoS) attributes. Different extraction rules are applied to extract the user preference matrices from the original web data, and the negative value filtering-based top-K method is used to merge the optimization results into the collaborative prediction method. Thus, the individualized differences are fully exploited, and the problem of inconsistent QoS values is resolved. The experimental results demonstrate the validity of the proposed method. Compared with other methods, the proposed method performs better, and the results are closer to the real values.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/QQHBHYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/QQHBHY
In text, images, merged surveys, voter files, and elsewhere, data sets are often missing important covariates, either because they are latent features of observations (such as sentiment in text) or because they are not collected (such as race in voter files). One promising approach for coping with this missing data is to find the true values of the missing covariates for a subset of the observations and then train a machine learning algorithm to predict the values of those covariates for the rest. However, plugging in these predictions without regard for prediction error renders regression analyses biased, inconsistent, and overconfident. We characterize the severity of the problem posed by prediction error, describe a procedure to avoid these inconsistencies under comparatively general assumptions, and demonstrate the performance of our estimators through simulations and a study of hostile political dialogue on the Internet. We provide software implementing our approach.
Facebook
TwitterComparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149).