90 datasets found

Z
Messy Spreadsheet Example for Instruction
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Curty, Renata Gonçalves (2024). Messy Spreadsheet Example for Instruction [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_12586562
Explore at:
Dataset updated
Jun 28, 2024
Dataset provided by
University of California, Santa Barbara
Authors
Curty, Renata Gonçalves
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A disorganized toy spreadsheet used for teaching good data organization. Learners are tasked with identifying as many errors as possible before creating a data dictionary and reconstructing the spreadsheet according to best practices.
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
q
Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio
qubeshub.org
Updated Jul 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shelly Gaynor (2020). Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio [Dataset]. http://doi.org/10.25334/DRGD-F069
Explore at:
Unique identifier
https://doi.org/10.25334/DRGD-F069
Dataset updated
Jul 16, 2020
Dataset provided by
QUBES
Authors
Shelly Gaynor
Description
Access and clean an open source herbarium dataset using Excel or RStudio.

Retail Store Sales: Dirty for Data Cleaning

kaggle.com

zip

Updated Jan 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning

Explore at:

zip(226740 bytes)Available download formats

Dataset updated

Jan 18, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Retail Store Sales Dataset

Overview

The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

File Information

File Name: retail_store_sales.csv
Number of Rows: 12,575
Number of Columns: 11

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Customer ID`	A unique identifier for each customer. 25 unique customers.	`CUST_01`
`Category`	The category of the purchased item.	`Food`, `Furniture`
`Item`	The name of the purchased item. May contain missing values or `None`.	`Item_1_FOOD`, `None`
`Price Per Unit`	The static price of a single unit of the item. May contain missing or `None` values.	`4.00`, `None`
`Quantity`	The quantity of the item purchased. May contain missing or `None` values.	`1`, `None`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `None`
`Payment Method`	The method of payment used. May contain missing or invalid values.	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Online`
`Transaction Date`	The date of the transaction. Always present and valid.	`2023-01-15`
`Discount Applied`	Indicates if a discount was applied to the transaction. May contain missing values.	`True`, `False`, `None`

Categories and Items

The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

Electric Household Essentials

Item Code	Item Name	Price
Item_1_EHE	Blender	5.0
Item_2_EHE	Microwave	6.5
Item_3_EHE	Toaster	8.0
Item_4_EHE	Vacuum Cleaner	9.5
Item_5_EHE	Air Purifier	11.0
Item_6_EHE	Electric Kettle	12.5
Item_7_EHE	Rice Cooker	14.0
Item_8_EHE	Iron	15.5
Item_9_EHE	Ceiling Fan	17.0
Item_10_EHE	Table Fan	18.5
Item_11_EHE	Hair Dryer	20.0
Item_12_EHE	Heater	21.5
Item_13_EHE	Humidifier	23.0
Item_14_EHE	Dehumidifier	24.5
Item_15_EHE	Coffee Maker	26.0
Item_16_EHE	Portable AC	27.5
Item_17_EHE	Electric Stove	29.0
Item_18_EHE	Pressure Cooker	30.5
Item_19_EHE	Induction Cooktop	32.0
Item_20_EHE	Water Dispenser	33.5
Item_21_EHE	Hand Blender	35.0
Item_22_EHE	Mixer Grinder	36.5
Item_23_EHE	Sandwich Maker	38.0
Item_24_EHE	Air Fryer	39.5
Item_25_EHE	Juicer	41.0

Furniture

Item Code	Item Name	Price
Item_1_FUR	Office Chair	5.0
Item_2_FUR	Sofa	6.5
Item_3_FUR	Coffee Table	8.0
Item_4_FUR	Dining Table	9.5
Item_5_FUR	Bookshelf	11.0
Item_6_FUR	Bed F...

Data Cleaning Project
kaggle.com
zip
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohanad Hazem Qabil (2024). Data Cleaning Project [Dataset]. https://www.kaggle.com/datasets/muhannadhazemqabil/data-cleaning-project
Explore at:
zip(79166 bytes)Available download formats
Dataset updated
Aug 19, 2024
Authors
Mohanad Hazem Qabil
Description
Dataset

This dataset was created by Mohanad Hazem Qabil

Contents
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
nada-demo.ihsn.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Is it time to stop sweeping data cleaning under the carpet? A novel...
plos.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements (2023). Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data [Dataset]. http://doi.org/10.1371/journal.pone.0228154
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0228154
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.
Data cleaning using unstructured data
zenodo.org
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer (2024). Data cleaning using unstructured data [Dataset]. http://doi.org/10.5281/zenodo.13135983
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13135983
Dataset updated
Jul 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this project, we work on repairing three datasets:

Trials design: This dataset was obtained from the European Union Drug Regulating Authorities Clinical Trials Database (EudraCT) register and the ground truth was created from external registries. In the dataset, multiple countries, identified by the attribute country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help find informative details about the design of the trial.

Trials population: This dataset delineates the demographic origins of participants in clinical trials primarily conducted across European countries. This dataset include structured attributes indicating whether the trial pertains to a specific gender, age group or healthy volunteers. Each of these categories is labeled as (`1') or (`0') respectively denoting whether it is included in the trials or not. It is important to note that the population category should remain consistent across all countries conducting the same clinical trial identified by an eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.

Allergens: This dataset contains information about products and their allergens. The data was collected from the German version of the `Alnatura' (Access date: 24 November, 2020), a free database of food products from around the world `Open Food Facts', and the websites: `Migipedia', 'Piccantino', and `Das Ist Drin'. There may be overlapping products across these websites. Each product in the dataset is identified by a unique code. Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients.

N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:

"{dataset_name}_train.csv": samples used for the ML-model training. (e.g "allergens_train.csv")

"{dataset_name}_test.csv": samples used to test the the ML-model performance. (e.g "allergens_test.csv")

"{dataset_name}_golden_standard.csv": samples represent the ground truth of the test samples. (e.g "allergens_golden_standard.csv")

"{dataset_name}_parker_train.csv": samples repaired using Parker Engine used for the ML-model training. (e.g "allergens_parker_train.csv")

"{dataset_name}_parker_train.csv": samples repaired using Parker Engine used to test the the ML-model performance. (e.g "allergens_parker_test.csv")
I
Data for A Conceptual Model for Transparent, Reusable, and Collaborative...
databank.illinois.edu
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikolaus Parulian (2023). Data for A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning [Dataset]. http://doi.org/10.13012/B2IDB-6827044_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-6827044_V1
Dataset updated
Jul 12, 2023
Authors
Nikolaus Parulian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dissertation_demo.zip contains the base code and demonstration purpose for the dissertation: A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning. Each chapter has a demo folder for demonstrating provenance queries or tools. The Airbnb dataset for demonstration and simulation is not included in this demo but is available to access directly from the reference website. Any updates on demonstration and examples can be found online at: https://github.com/nikolausn/dissertation_demo

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

Z
Soundscape Attributes Translation Project (SATP) Dataset
data.niaid.nih.gov
data.europa.eu
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oberman, Tin; Mitchell, Andrew; Aletta, Francesco; Almagro Pastor, José Antonio; Jambrošić, Kristian; Kang, Jian (2024). Soundscape Attributes Translation Project (SATP) Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6914433
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
University of Zagreb
University College London
Universidad de Granada
Authors
Oberman, Tin; Mitchell, Andrew; Aletta, Francesco; Almagro Pastor, José Antonio; Jambrošić, Kristian; Kang, Jian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data and audio included here were collected for the Soundscape Attributes Translation Project (SATP). First introduced in Aletta et. al. (2020), the SATP is an attempt to provide validated translations of soundscape attributes in languages other than English. The recordings were used for headphones - based listening experiments.

The data are provided to accompany publications resulting from this project and to provide a unique dataset of 1000s of perceptual responses to a standardised set of urban soundscape recordings. This dataset is the result of efforts from hundreds of researchers, students, assistants, PIs, and participants from institutions around the world. We have made an attempt to list every contributor to this Zenodo repo; if you feel you should be included, please get in touch.

Citation: If you use the SATP dataset or part of it, please cite our paper describing the data collection and this dataset itself.

Overview: The SATP dataset consists of 27 30-sec binaural audio recordings made in urban public spaces in London and one 60 sec stereo calibration signal.

The recordings were made at locations as reported in Table 1 of the README.md (Recording locations), at various times of day by an operator wearing a binaural kit consisting of BHS II microphones and a SQobold (HEAD acoustics) device. Recordings were then exported to WAV via the ArtemiS SUITE software, using the original dynamic range from HDF. The listening experiment and the calibration procedure were intended for a headphone playback system (Sennheiser HD650 or similar open-back headphones recommended).

The recordings were selected from an initial set of 80 recordings through a pilot study to ensure the test set had an even coverage of the soundscape circumplex space. These recordings were sent to the partner institutions (see Table 2 of the README.md) and assessed by approximately 30 participants in the institution's target language. The questionnaire used in each assessment is a translation of Method A Questionnaire, ISO 12913-2:2018. Each institution carried out their own lab experiment to collect data, then submitted their data to the team at UCL to compile into a single dataset. Some institutions included additional questions or translation options; the combined dataset (SATP Dataset v1.x.xlsx) includes only the base set of questions, the extended set of questions from each institution is included in the Institution Datasets folder.

In all, SATP Dataset v1.4 contains 19,089 samples, including 707 participants, for 27 recordings, in 18 languages with contributions from 29 institutions.

Descriptions of the recordings, including GPS coordinates and sound sources, can be found in the README.md file.

Format: The audio recordings are provided as 24 bit, 48 kHz, stereo WAV files. The combined dataset and Institutional datasets are provided as long tidy data tables in .xlsx files.

Calibration: The recommended calibration approach was based on the open-circuit voltage (OCV) procedure which was considered most accessible but other calibration procedures are also possible (Lam et. al. (2022)). The provided calibration file is a computer generated sine wave at 1kHz, matching a sine wave recorded using the exact same setup at SPL of 94 dB. In case of the calibration signal playback level set to match SPL of 94 dB at the eardrum, all the 27 samples should be reproduced at realistic loudness. More details on OCV calibration procedure and other options you can find in Lam et. al. (2022) and the attached documentation. PLEASE DO NOT EXPOSE YOURSELF NOR THE PARTICIPANTS TO THE CALIBRATION SIGNAL SET AT THE REALISTIC LEVEL AS IT CAN CAUSE HARM.

License and reuse: All SATP recordings are provided under the Creative Commons Attribution 4.0 International (CC BY 4.0) License and are free to use. We encourage other researchers to replicate the SATP protocol and contribute new languages to the dataset. We also encourage the use of these recordings and the perceptual data for further soundscape research purposes. Please provide the proper attribution and get in touch with the authors if you would like to contribute a new translation or for any other collaborations.
Z
Tidy coverage data for all 9943 CrossRef members
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris HJ Hartgerink (2020). Tidy coverage data for all 9943 CrossRef members [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1208404
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Mozilla Science Lab
Authors
Chris HJ Hartgerink
Description
Included in this deposit are the scripts and data collected on the coverage of metadata by the CrossRef members. Collection date is March 27, 2018. The data is available in long, tidy, format. Each row depicts the coverage of one specific aspect, indicating which member and how many DOIs that member deposited. This allows for easy parsing in for example ggplot2 or dplyr.
i
Household Expenditure and Income Survey 2008, Economic Research Forum (ERF)...
catalog.ihsn.org
Updated Jan 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Statistics (2022). Household Expenditure and Income Survey 2008, Economic Research Forum (ERF) Harmonization Data - Jordan [Dataset]. https://catalog.ihsn.org/index.php/catalog/7661
Explore at:
Dataset updated
Jan 12, 2022
Dataset authored and provided by
Department of Statistics
Time period covered
2008 - 2009
Area covered
Jordan
Description
Abstract

The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.

Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demograohic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor chracteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty

Geographic coverage

National

Analysis unit

Household/families

Individuals

Universe

The survey covered a national sample of households and all individuals permanently residing in surveyed households.

Kind of data

Sample survey data [ssd]

Sampling procedure

The 2008 Household Expenditure and Income Survey sample was designed using two-stage cluster stratified sampling method. In the first stage, the primary sampling units (PSUs), the blocks, were drawn using probability proportionate to the size, through considering the number of households in each block to be the block size. The second stage included drawing the household sample (8 households from each PSU) using the systematic sampling method. Fourth substitute households from each PSU were drawn, using the systematic sampling method, to be used on the first visit to the block in case that any of the main sample households was not visited for any reason.

To estimate the sample size, the coefficient of variation and design effect in each subdistrict were calculated for the expenditure variable from data of the 2006 Household Expenditure and Income Survey. This results was used to estimate the sample size at sub-district level, provided that the coefficient of variation of the expenditure variable at the sub-district level did not exceed 10%, with a minimum number of clusters that should not be less than 6 at the district level, that is to ensure good clusters representation in the administrative areas to enable drawing poverty pockets.

It is worth mentioning that the expected non-response in addition to areas where poor families are concentrated in the major cities were taken into consideration in designing the sample. Therefore, a larger sample size was taken from these areas compared to other ones, in order to help in reaching the poverty pockets and covering them.

Mode of data collection

Face-to-face [f2f]

Research instrument

List of survey questionnaires: (1) General Form (2) Expenditure on food commodities Form (3) Expenditure on non-food commodities Form

Cleaning operations

Raw Data The design and implementation of this survey procedures were: 1. Sample design and selection 2. Design of forms/questionnaires, guidelines to assist in filling out the questionnaires, and preparing instruction manuals 3. Design the tables template to be used for the dissemination of the survey results 4. Preparation of the fieldwork phase including printing forms/questionnaires, instruction manuals, data collection instructions, data checking instructions and codebooks 5. Selection and training of survey staff to collect data and run required data checkings 6. Preparation and implementation of the pretest phase for the survey designed to test and develop forms/questionnaires, instructions and software programs required for data processing and production of survey results 7. Data collection 8. Data checking and coding 9. Data entry 10. Data cleaning using data validation programs 11. Data accuracy and consistency checks 12. Data tabulation and preliminary results 13. Preparation of the final report and dissemination of final results

Harmonized Data - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets - The harmonization process started with cleaning all raw data files received from the Statistical Office - Cleaned data files were then all merged to produce one data file on the individual level containing all variables subject to harmonization - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables - A post-harmonization cleaning process was run on the data - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format
Full Dataset prior to Cleaning
figshare.com
zip
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paige Chesshire (2023). Full Dataset prior to Cleaning [Dataset]. http://doi.org/10.6084/m9.figshare.22455616.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22455616.v1
Dataset updated
Mar 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Paige Chesshire
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes all of the data downloaded from GBIF (DOIs provided in README.md as well as below, downloaded Feb 2021) as well as data downloaded from SCAN. This dataset has 2,808,432 records and can be used as a reference to the verbatim data before it underwent the cleaning process. The only modifications made to this datset after direct download from the data portals are the following:

1) for GBIF records, I renamed the countryCode column to be "country" so that the column title is consistent across both GBIF and SCAN 2) A source column was added where I specify if the record came from GBIF or SCAN 3) Duplicate records across SCAN and GBIF were removed by identifying identical instances "catalogNumber" and "institutionCode" 4) Only the Darwin core columns (DwC) that were shared across downloaded datasets were retained. GBIF contained ~249 DwC variables, and SCAN data contained fewer, so this combined dataset only includes the ~80 columns shared between the two datasets

For GBIF, we downloaded the data in three separate chunks, therefore there are three DOIs. See below:

GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.6cxfsw GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.b9rfa7 GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.w2nndm
Household Survey on Information and Communications Technology– 2019 - West...
pcbs.gov.ps
Updated Mar 16, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palestinian Central Bureau of Statistics (2020). Household Survey on Information and Communications Technology– 2019 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/489
Explore at:
Dataset updated
Mar 16, 2020
Dataset authored and provided by
Palestinian Central Bureau of Statisticshttps://pcbs.gov/
Time period covered
2019
Area covered
West Bank, Gaza, Gaza Strip
Description
Abstract

The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2019. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.

Geographic coverage

Palestine, West Bank, Gaza strip

Analysis unit

Household, Individual

Universe

All Palestinian households and individuals (10 years and above) whose usual place of residence in 2019 was in the state of Palestine.

Kind of data

Sample survey data [ssd]

Sampling procedure

Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.

Sample size The estimated sample size is 8,040 households.

Sample Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.

Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, refugee camps).

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.

Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.

Section III: Data on Individuals (10 years and over) about computer use, access to the Internet and possession of a mobile phone.

Cleaning operations

Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.

Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.

In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.

Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.

Tabulation After finalizing checking and cleaning data from any errors. Tables extracted according to prepared list of tables.

Response rate

The response rate in the West Bank reached 77.6% while in the Gaza Strip it reached 92.7%.

Sampling error estimates

Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, There is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.

Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.

The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non response cases. The total non-response rate reached 17.5%. The refusal percentage reached 2.9% which is relatively low percentage compared to the household surveys conducted by PCBS, and the reason is the questionnaire survey is clear.
Additional file 2 of Modeling and cleaning RNA-seq data significantly...
figshare.com
txt
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Igor V. Deyneko; Orkhan N. Mustafaev; Alexander А. Tyurin; Ksenya V. Zhukova; Alexander Varzari; Irina V. Goldenkova-Pavlova (2023). Additional file 2 of Modeling and cleaning RNA-seq data significantly improve detection of differentially expressed genes [Dataset]. http://doi.org/10.6084/m9.figshare.21582224.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21582224.v1
Dataset updated
Jun 14, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Igor V. Deyneko; Orkhan N. Mustafaev; Alexander А. Tyurin; Ksenya V. Zhukova; Alexander Varzari; Irina V. Goldenkova-Pavlova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2. R program and examples. The program code of RNAdeNoise in R language, examples of the use.
Data from: The Regressinator: A Simulation Tool for Teaching Regression...
tandf.figshare.com
txt
Updated Aug 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Reinhart (2025). The Regressinator: A Simulation Tool for Teaching Regression Assumptions and Diagnostics in R [Dataset]. http://doi.org/10.6084/m9.figshare.29361136.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29361136.v2
Dataset updated
Aug 6, 2025
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Alex Reinhart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
When students learn linear regression, they must learn to use diagnostics to check and improve their models. Model-building is an expert skill requiring the interpretation of diagnostic plots, an understanding of model assumptions, the selection of appropriate changes to remedy problems, and an intuition for how potential problems may affect results. Simulation offers opportunities to practice these skills, and is already widely used to teach important concepts in sampling, probability, and statistical inference. Visual inference, which uses simulation, has also recently been applied to regression instruction. This article presents the regressinator, an R package designed to facilitate simulation and visual inference in regression settings. Simulated regression problems can be easily defined with minimal programming, using the same modeling and plotting code students may already learn. The simulated data can then be used for model diagnostics, visual inference, and other activities, with the package providing functions to facilitate common tasks with a minimum of programming. Example activities covering model diagnostics, statistical power, and model selection are shown for both advanced undergraduate and Ph.D.-level regression courses.
Resilience Index Measurement and Analysis 2014 - Mali
microdata.worldbank.org
catalog.ihsn.org
Updated Feb 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Food and Agriculture Organization (2023). Resilience Index Measurement and Analysis 2014 - Mali [Dataset]. https://microdata.worldbank.org/index.php/catalog/5674
Explore at:
Dataset updated
Feb 6, 2023
Dataset authored and provided by
Food and Agriculture Organizationhttp://fao.org/
Time period covered
2014 - 2015
Area covered
Mali
Description
Abstract

Mali is a Sahelian country, landlocked and structurally vulnerable to food insecurity and malnutrition. The economy is heavily dependent on the primary sector: agriculture, livestock, fishing and forestry account for 68.0% of the active population1 . This sector is itself dependent on exogenous factors, mainly climatic, such as recurrent droughts. In 2018, the prevalence of food insecurity at the national level was 19.1%, of which 2.6% was severely food insecure. The most affected regions were Kidal, Gao, Timbuktu, Mopti and Kayes. The Global Food Crisis Network Partnership Programme baseline studies are designed to feed into the overall monitoring, evaluation, accountability and learning programme of each project. In this regard, the baseline study has short, medium and long-term objectives.

Geographic coverage

Regional coverage

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

The EAC-I 2014 has been designed to have national coverage, including both urban and rural areas in all the regions of the country except Kidal. The domains were defined as the entire country, district of Bamako, other urban areas, and rural areas; and in the rural areas: agricultural zones, agro-pastoral zones and pastoral zones. Taking this into account, 51 explicit sample strata were selected. The target population was drawn from households in all regions of Mali except Kidal which was not accessible for security reasons. Kidal also has very low population density.The sample was chosen through a random two stage process: - In the first stage, 1070 enumeration areas (EAs) were selected with Probability Proportional to Size (PPS) using the 2009 Census of Population as the base for the sample, and the number of households as a measure of size. - In the second stage o 3 households were selected with equal probability in each of the rural EAs o 9 households were selected with equal probability in each of the urban EAs The total estimated size of the sample for the survey was 4,218.

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Please refer to the Questionnaires for the value labels of the variables.

Cleaning operations

Data Entry & Data cleaning Data entry was performed at the CPS/SDR using a CSPro application designed by an international consultant recruited by the LSMS team. The data entry program allows three types of data checks: (1) range checks; (2) intra-record check to verify inconsistencies pertinent to the particular module of the questionnaire; and (3) inter-record checks to determine inconsistencies pertinent between the different modules of the questionnaire. Data entry for the first visit was done from August 11th, 2014 to November 30th, 2014 and, from February 9th 2015 to March 27th, 2015 for the second visit. Data cleaning was done from May 2015 to July 2015. Data cleaning was done in a CSPro application. The data cleaning focused on more intra-record and inter-record checks.
Data Carpentry Genomics Curriculum Example Data
figshare.com
datasetcatalog.nlm.nih.gov
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olivier Tenaillon; Jeffrey E Barrick; Noah Ribeck; Daniel E. Deatherage; Jeffrey L. Blanchard; Aurko Dasgupta; Gabriel C. Wu; Sébastien Wielgoss; Stéphane Cruvellier; Claudine Medigue; Dominique Schneider; Richard E. Lenski; Taylor Reiter; Jessica Mizzi; Fotis Psomopoulos; Ryan Peek; Jason Williams (2023). Data Carpentry Genomics Curriculum Example Data [Dataset]. http://doi.org/10.6084/m9.figshare.7726454.v3
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7726454.v3
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Olivier Tenaillon; Jeffrey E Barrick; Noah Ribeck; Daniel E. Deatherage; Jeffrey L. Blanchard; Aurko Dasgupta; Gabriel C. Wu; Sébastien Wielgoss; Stéphane Cruvellier; Claudine Medigue; Dominique Schneider; Richard E. Lenski; Taylor Reiter; Jessica Mizzi; Fotis Psomopoulos; Ryan Peek; Jason Williams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Andale Mono'; color: #29f914; background-color: #000000} span.s1 {font-variant-ligatures: no-common-ligatures} These files are intended for use with the Data Carpentry Genomics curriculum (https://datacarpentry.org/genomics-workshop/). Files will be useful for instructors teaching this curriculum in a workshop setting, as well as individuals working through these materials on their own.

This curriculum is normally taught using Amazon Web Services (AWS). Data Carpentry maintains an AWS image that includes all of the data files needed to use these lesson materials. For information on how to set up an AWS instance from that image, see https://datacarpentry.org/genomics-workshop/setup.html. Learners and instructors who would prefer to teach on a different remote computing system can access all required files from this FigShare dataset.

This curriculum uses data from a long term evolution experiment published in 2016: Tempo and mode of genome evolution in a 50,000-generation experiment (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/) by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959). All sequencing data sets are available in the NCBI BioProject database under accession number PRJNA294072 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).

backup.tar.gz: contains original fastq files, reference genome, and subsampled fastq files. Directions for obtaining these files from public databases are given during the lesson https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html). On the AWS image, these files are stored in ~/.backup directory. 1.3Gb in size.

Ecoli_metadata.xlsx: an example Excel file to be loaded during the R lesson.

shell_data.tar.gz: contains the files used as input to the Introduction to the Command Line for Genomics lesson (https://datacarpentry.org/shell-genomics/).

sub.tar.gz: contains subsampled fastq files that are used as input to the Data Wrangling and Processing for Genomics lesson (https://datacarpentry.org/wrangling-genomics/). 109Mb in size.

solutions: contains the output files of the Shell Genomics and Wrangling Genomics lessons, including fastqc output, sam, bam, bcf, and vcf files.

vcf_clean_script.R: converts vcf output in .solutions/wrangling_solutions/variant_calling_auto to single tidy data frame.

combined_tidy_vcf.csv: output of vcf_clean_script.R

Facebook

Twitter

Click to copy link

Link copied

Cite

Curty, Renata Gonçalves (2024). Messy Spreadsheet Example for Instruction [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_12586562

Messy Spreadsheet Example for Instruction

Explore at:

Dataset updated

Jun 28, 2024

Dataset provided by

University of California, Santa Barbara

Authors

Curty, Renata Gonçalves

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A disorganized toy spreadsheet used for teaching good data organization. Learners are tasked with identifying as many errors as possible before creating a data dictionary and reconstructing the spreadsheet according to best practices.

Clear search

Close search

Google apps

Main menu

Messy Spreadsheet Example for Instruction

Data Cleaning Sample

Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio

Retail Store Sales: Dirty for Data Cleaning

Dirty Retail Store Sales Dataset

Overview

File Information

Columns Description

Categories and Items

Electric Household Essentials

Furniture

Data Cleaning Project

Dataset

Contents

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Understanding and Managing Missing Data.pdf

Is it time to stop sweeping data cleaning under the carpet? A novel...

Data cleaning using unstructured data

Data for A Conceptual Model for Transparent, Reusable, and Collaborative...

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

Soundscape Attributes Translation Project (SATP) Dataset

Tidy coverage data for all 9943 CrossRef members

Household Expenditure and Income Survey 2008, Economic Research Forum (ERF)...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Full Dataset prior to Cleaning

Household Survey on Information and Communications Technology– 2019 - West...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Additional file 2 of Modeling and cleaning RNA-seq data significantly...

Data from: The Regressinator: A Simulation Tool for Teaching Regression...

Resilience Index Measurement and Analysis 2014 - Mali

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Data Carpentry Genomics Curriculum Example Data

Messy Spreadsheet Example for Instruction