38 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. Full Dataset prior to Cleaning

    • figshare.com
    zip
    Updated Mar 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paige Chesshire (2023). Full Dataset prior to Cleaning [Dataset]. http://doi.org/10.6084/m9.figshare.22455616.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 31, 2023
    Dataset provided by
    figshare
    Authors
    Paige Chesshire
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes all of the data downloaded from GBIF (DOIs provided in README.md as well as below, downloaded Feb 2021) as well as data downloaded from SCAN. This dataset has 2,808,432 records and can be used as a reference to the verbatim data before it underwent the cleaning process. The only modifications made to this datset after direct download from the data portals are the following:

    1) for GBIF records, I renamed the countryCode column to be "country" so that the column title is consistent across both GBIF and SCAN 2) A source column was added where I specify if the record came from GBIF or SCAN 3) Duplicate records across SCAN and GBIF were removed by identifying identical instances "catalogNumber" and "institutionCode" 4) Only the Darwin core columns (DwC) that were shared across downloaded datasets were retained. GBIF contained ~249 DwC variables, and SCAN data contained fewer, so this combined dataset only includes the ~80 columns shared between the two datasets

    For GBIF, we downloaded the data in three separate chunks, therefore there are three DOIs. See below:

    GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.6cxfsw GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.b9rfa7 GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.w2nndm

  3. r

    Semi-supervised data cleaning

    • resodate.org
    Updated Dec 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Mahdavi Lahijani (2020). Semi-supervised data cleaning [Dataset]. http://doi.org/10.14279/depositonce-10928
    Explore at:
    Dataset updated
    Dec 4, 2020
    Dataset provided by
    DepositOnce
    Technische Universität Berlin
    Authors
    Mohammad Mahdavi Lahijani
    Description

    Data cleaning is one of the most important but time-consuming tasks for data scientists. The data cleaning task consists of two major steps: (1) error detection and (2) error correction. The goal of error detection is to identify wrong data values. The goal of error correction is to fix these wrong values. Data cleaning is a challenging task due to the trade-off among correctness, completeness, and automation. In fact, detecting/correcting all data errors accurately without any user involvement is not possible for every dataset. We propose a novel data cleaning approach that detects/corrects data errors with a novel two-step task formulation. The intuition is that, by collecting a set of base error detectors/correctors that can independently mark/fix data errors, we can learn to combine them into a final set of data errors/corrections using a few informative user labels. First, each base error detector/corrector generates an initial set of potential data errors/corrections. Then, the approach ensembles the output of these base error detectors/correctors into one final set of data errors/corrections in a semi-supervised manner. In fact, the approach iteratively asks the user to annotate a tuple, i.e., marking/fixing a few data errors. The approach learns to generalize the user-provided error detection/correction examples to the rest of the dataset, accordingly. Our novel two-step formulation of the error detection/correction task has four benefits. First, the approach is configuration free and does not need any user-provided rules or parameters. In fact, the approach considers the base error detectors/correctors as black-box algorithms that are not necessarily correct or complete. Second, the approach is effective in the error detection/correction task as its first and second steps maximize recall and precision, respectively. Third, the approach also minimizes human involvement as it samples the most informative tuples of the dataset for user labeling. Fourth, the task formulation of our approach allows us to leverage previous data cleaning efforts to optimize the current data cleaning task. We design an end-to-end data cleaning pipeline according to this approach that takes a dirty dataset as input and outputs a cleaned dataset. Our pipeline leverages user feedback, a set of data cleaning algorithms, and a set of previously cleaned datasets, if available. Internally, our pipeline consists of an error detection system (named Raha), an error correction system (named Baran), and a transfer learning engine. As our extensive experiments show, our data cleaning systems are effective and efficient, and involve the user minimally. Raha and Baran significantly outperform existing data cleaning approaches in terms of effectiveness and human involvement on multiple well-known datasets.

  4. r

    TESA example data

    • researchdata.edu.au
    • bridges.monash.edu
    Updated Mar 20, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nigel Rogasch (2017). TESA example data [Dataset]. http://doi.org/10.4225/03/5719cebc59438
    Explore at:
    Dataset updated
    Mar 20, 2017
    Dataset provided by
    Monash University
    Authors
    Nigel Rogasch
    Description

    The TMS-EEG signal analyser (TESA) is an open source extension for EEGLAB that includes functions necessary for cleaning and analysing TMS-EEG data. Both EEGLAB and TESA run in Matlab (r2015b or later). The attached files are example data files which can be used with TESA.

    To download TESA, visit here:

    http://nigelrogasch.github.io/TESA/

    To read the TESA user manual, visit here:

    https://www.gitbook.com/book/nigelrogasch/tesa-user-manual/details


    File info:

    example_data.set

    WARNING: file size = 1.1 GB. A raw data set for trialling TESA. Load the data file in to EEGLAB using the existing EEGLAB data set functions. Note that both the .fdt and .set files are required.

    example_data_epoch_demean.set

    File size = 340 MB. A partially processed data file of smaller size corresponding to step 8 of the analysis pipeline in the TESA user manual. Channel locations were loaded, unused electrodes removed, bad electrodes removed, epoched (-1000 to 1000 ms) and demeaned (baseline correct -1000 to 1000). Load the data file in to EEGLAB using the existing EEGLAB data set functions. Note that both the .fdt and .set files are required.

    example_data_epoch_demean_cut_int_ds.set

    File size = 69 MB. A further processed data file even smaller in size corresponding to step 11 of the analysis pipeline in the TESA user manual. In addition to the above steps, data around the TMS pulse artifact was removed (-2 to 10 ms), replaced using linear interpolation, and downsampled to 1,000 Hz. Load the data file in to EEGLAB using the existing EEGLAB data set functions. Note that both the .fdt and .set files are required.


    Example data info:

    Monophasic TMS pulses (current flow = posterior-anterior in brain) were given through a figure-of-eight coil (external diameter = 90 mm) connected to a Magstim 2002 unit (Magstim company, UK). 150 TMS pulses were delivered over the left superior parietal cortex (MNI coordinates: -20, -65, 65) at a rate of 0.2 Hz ± 25% jitter. TMS coil position was determined using frameless stereotaxic neuronavigation (Localite TMS Navigator, Localite, Germany) and intensity was set at resting motor threshold of the first dorsal interosseous muscle (68% maximum stimulator output). EEG was recorded from 62 TMS-specialised, c-ring slit electrodes (EASYCAP, Germany) using a TMS-compatible EEG amplifier (BrainAmp DC, BrainProducts GmbH, Germany). Data from all channels were referenced to the FCz electrode online with the AFz electrode serving as the common ground. EEG signals were digitised at 5 kHz (filtering: DC-1000 Hz) and EEG electrode impedance was kept below 5 kΩ.


  5. Dataset - A Computational Simulator for Incineration of Wastes Generated...

    • s.cnmilf.com
    • catalog.data.gov
    Updated Jul 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). Dataset - A Computational Simulator for Incineration of Wastes Generated from Cleanup of Chemical and Biological Contamination Incidents [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/dataset-a-computational-simulator-for-incineration-of-wastes-generated-from-cleanup-of-che
    Explore at:
    Dataset updated
    Jul 16, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The figures in the paper are outputs from the model based on example run conditions.Figures 4 and 5 show comparisons between the model predictions and measured data from the EPA Lab Kiln module. Figure 6 shows the difference between using the Z-Value and the Arrhenius approach to the kinetics of biological agent destruction. Figures 7 and 8 are 3-D heat maps of the temperature and oxygen concentration, respectively, in the commercial rotary kiln module. They are raster pictures so they are not the traditional x-y coordinate graphs. Figure 9 shows streamlines within the primary combustion chamber of the commercial rotary kiln and predicted destruction of the GB nerve agent along those streamlines. Figure 10 shows predicted gas temperature along a streamline in the commercial rotary kiln module. Figure 11 shows example predictions of the mole fraction of 3 chemical warfare agents along streamlines in the commercial rotary kiln module. Figures 12 and 13 show predicted destruction and waste "piece" temperature of the biological agent Bacillus anthracis in bundles of carpet in the commercial rotary kiln. This dataset is associated with the following publication: Lemieux, P., T. Boe, A. Tschursin, M. Denison, K. Davis, and D. Swenson. Computational simulation of incineration of chemically and biologically contaminated wastes. JOURNAL OF THE AIR & WASTE MANAGEMENT ASSOCIATION. Air & Waste Management Association, Pittsburgh, PA, USA, 71(4): 462-476, (2021).

  6. Z

    A set of generated Instagram Data Download Packages (DDPs) to investigate...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Boeschoten (2021). A set of generated Instagram Data Download Packages (DDPs) to investigate their structure and content [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4472605
    Explore at:
    Dataset updated
    Jan 28, 2021
    Dataset provided by
    Daniel Oberski
    Ruben van den Goorbergh
    Laura Boeschoten
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Instagram data-download example dataset

    In this repository you can find a data-set consisting of 11 personal Instagram archives, or Data-Download Packages (DDPs).

    How the data was generated

    These Instagram accounts were all new and generated by a group of researchers who were interested to figure out in detail the structure and variety in structure of these Instagram DDPs. The participants user the Instagram account extensively for approximately a week. The participants also intensively communicated with each other so that the data can be used as an example of a network.

    The data was primarily generated to evaluate the performance of de-identification software. Therefore, the text in the DDPs particularly contain many randomly chosen (Dutch) first names, phone numbers, e-mail addresses and URLS. In addition, the images in the DDPs contain many faces and text as well. The DDPs contain faces and text (usernames) of third parties. However, only content of so-called `professional accounts' are shared, such as accounts of famous individuals or institutions who self-consciously and actively seek publicity, and these sources are easily publicly available. Furthermore, the DDPs do not contain sensitive personal data of these individuals.

    Obtaining your Instagram DDP

    After using the Instagram accounts intensively for approximately a week, the participants requested their personal Instagram DDPs by using the following steps. You can follow these steps yourself if you are interested in your personal Instagram DDP.

    1. Go to www.instagram.com and log in
    2. Click on your profile picture, go to Settings and Privacy and Security
    3. Scroll to Data download and click Request download
    4. Enter your email adress and click Next
    5. Enter your password and click Request download

    Instagram then delivered the data in a compressed zip folder with the format username_YYYYMMDD.zip (i.e., Instagram handle and date of download) to the participant, and the participants shared these DDPs with us.

    Data cleaning

    To comply with the Instagram user agreement, participants shared their full name, phone number and e-mail address. In addition, Instagram logged the i.p. addresses the participant used during their active period on Instagram. After colleting the DDPs, we manually replaced such information with random replacements such that the DDps shared here do not contain any personal data of the participants.

    How this data-set can be used

    This data-set was generated with the intention to evaluate the performance of the de-identification software. We invite other researchers to use this data-set for example to investigate what type of data can be found in Instagram DDPs or to investigate the structure of Instagram DDPs. The packages can also be used for example data-analyses, although no substantive research questions can be answered using this data as the data does not reflect how research subjects behave `in the wild'.

    Authors

    The data collection is executed by Laura Boeschoten, Ruben van den Goorbergh and Daniel Oberski of Utrecht University. For questions, please contact l.boeschoten@uu.nl.

    Acknowledgments

    The researchers would like to thank everyone who participated in this data-generation project.

  7. w

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • microdata.worldbank.org
    • nada-demo.ihsn.org
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

  8. D

    CNVVE Dataset clean audio samples

    • darus.uni-stuttgart.de
    Updated Feb 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramin Hedeshy; Raphael Menges; Steffen Staab (2024). CNVVE Dataset clean audio samples [Dataset]. http://doi.org/10.18419/DARUS-3898
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2024
    Dataset provided by
    DaRUS
    Authors
    Ramin Hedeshy; Raphael Menges; Steffen Staab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    BMBF
    BMWK/ESF
    Description

    This CNVVE Dataset contains clean audio samples encompassing six distinct classes of voice expressions, namely “Uh-huh” or “mm-hmm”, “Uh-uh” or “mm-mm”, “Hush” or “Shh”, “Psst”, “Ahem”, and Continuous humming, e.g., “hmmm.” Audio samples of each class are found in the respective folders. These audio samples have undergone a thorough cleaning process. The raw samples are published in https://doi.org/10.18419/darus-3897. Initially, we applied the Google WebRTC voice activity detection (VAD) algorithm on the given audio files to remove noise or silence from the collected voice signals. The intensity was set to "2", which could be a value between "1" and "3". However, because of variations in the data, some files required additional manual cleaning. These outliers, characterized by sharp click sounds (such as those occurring at the end of recordings), were addressed. The samples are recorded through a dedicated website for data collection that defines the purpose and type of voice data by providing example recordings to participants as well as the expressions’ written equivalent, e.g., “Uh-huh”. Audio recordings were automatically saved in the .wav format and kept anonymous, with a sampling rate of 48 kHz and a bit depth of 32 bits. For more info, please check the paper or feel free to contact the authors for any inquiries.

  9. i

    Household Expenditure and Income Survey 2008, Economic Research Forum (ERF)...

    • catalog.ihsn.org
    Updated Jan 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Statistics (2022). Household Expenditure and Income Survey 2008, Economic Research Forum (ERF) Harmonization Data - Jordan [Dataset]. https://catalog.ihsn.org/index.php/catalog/7661
    Explore at:
    Dataset updated
    Jan 12, 2022
    Dataset authored and provided by
    Department of Statistics
    Time period covered
    2008 - 2009
    Area covered
    Jordan
    Description

    Abstract

    The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.

    Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demograohic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor chracteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty

    Geographic coverage

    National

    Analysis unit

    • Household/families
    • Individuals

    Universe

    The survey covered a national sample of households and all individuals permanently residing in surveyed households.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2008 Household Expenditure and Income Survey sample was designed using two-stage cluster stratified sampling method. In the first stage, the primary sampling units (PSUs), the blocks, were drawn using probability proportionate to the size, through considering the number of households in each block to be the block size. The second stage included drawing the household sample (8 households from each PSU) using the systematic sampling method. Fourth substitute households from each PSU were drawn, using the systematic sampling method, to be used on the first visit to the block in case that any of the main sample households was not visited for any reason.

    To estimate the sample size, the coefficient of variation and design effect in each subdistrict were calculated for the expenditure variable from data of the 2006 Household Expenditure and Income Survey. This results was used to estimate the sample size at sub-district level, provided that the coefficient of variation of the expenditure variable at the sub-district level did not exceed 10%, with a minimum number of clusters that should not be less than 6 at the district level, that is to ensure good clusters representation in the administrative areas to enable drawing poverty pockets.

    It is worth mentioning that the expected non-response in addition to areas where poor families are concentrated in the major cities were taken into consideration in designing the sample. Therefore, a larger sample size was taken from these areas compared to other ones, in order to help in reaching the poverty pockets and covering them.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    List of survey questionnaires: (1) General Form (2) Expenditure on food commodities Form (3) Expenditure on non-food commodities Form

    Cleaning operations

    Raw Data The design and implementation of this survey procedures were: 1. Sample design and selection 2. Design of forms/questionnaires, guidelines to assist in filling out the questionnaires, and preparing instruction manuals 3. Design the tables template to be used for the dissemination of the survey results 4. Preparation of the fieldwork phase including printing forms/questionnaires, instruction manuals, data collection instructions, data checking instructions and codebooks 5. Selection and training of survey staff to collect data and run required data checkings 6. Preparation and implementation of the pretest phase for the survey designed to test and develop forms/questionnaires, instructions and software programs required for data processing and production of survey results 7. Data collection 8. Data checking and coding 9. Data entry 10. Data cleaning using data validation programs 11. Data accuracy and consistency checks 12. Data tabulation and preliminary results 13. Preparation of the final report and dissemination of final results

    Harmonized Data - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets - The harmonization process started with cleaning all raw data files received from the Statistical Office - Cleaned data files were then all merged to produce one data file on the individual level containing all variables subject to harmonization - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables - A post-harmonization cleaning process was run on the data - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format

  10. College Education

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RN Uma; Alade Tokuta; Rebecca Zulli Lowe; Adrienne Smith (2023). College Education [Dataset]. http://doi.org/10.6084/m9.figshare.16571459.v3
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    RN Uma; Alade Tokuta; Rebecca Zulli Lowe; Adrienne Smith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is accessed from https://www.kaggle.com/jessemostipak/college-tuition-diversity-and-pay and was downloaded on August 4, 2021.

    The following excerpt is from Kaggle regarding the sources of this dataset:

    The data this week comes from many different sources but originally came from the US Department of Education.

    Tuition and fees by college/university for 2018-2019, along with school type, degree length, state, in-state vs out-of-state from the Chronicle of Higher Education. Diversity by college/university for 2014, along with school type, degree length, state, in-state vs out-of-state from the Chronicle of Higher Education. Example diversity graphics from Priceonomics. Average net cost by income bracket from TuitionTracker.org. Example price trend and graduation rates from TuitionTracker.org Salary potential data comes from payscale.com.

    This dataset included the following files:

    1. diversity_school.csv

    2. historical_tuition.csv

    3. salary_potential.csv

    4. tuition_cost.csv

    5. tuition_income.csv

    After data cleaning, the data in diversity_school.csv and tuition_cost.csv were merged and the data in salary_potential.csv and tuition_income.csv were merged. The combined datasets were then split based on the US Census Regions into West, Midwest, Northeast and South (https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf).

  11. S

    Daily Tasks Park Cleaning Records

    • splitgraph.com
    • data.cityofnewyork.us
    • +2more
    Updated Oct 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cityofnewyork-us (2024). Daily Tasks Park Cleaning Records [Dataset]. https://www.splitgraph.com/cityofnewyork-us/daily-tasks-park-cleaning-records-kwte-dppd
    Explore at:
    application/vnd.splitgraph.image, application/openapi+json, jsonAvailable download formats
    Dataset updated
    Oct 14, 2024
    Authors
    cityofnewyork-us
    Description

    This data contains cleaning records for City property under the jurisdiction of or maintained by NYC Parks. It includes records of tasks such as opening parks, cleaning and restocking comfort stations, and removing graffiti, litter and natural debris.

    For the User Guide, please follow this link

    For the Data Dictionary, please follow this link

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  12. S

    Street Sweeping Schedule - 2016

    • splitgraph.com
    • data.cityofchicago.org
    • +5more
    Updated Oct 25, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2016). Street Sweeping Schedule - 2016 [Dataset]. https://www.splitgraph.com/cityofchicago/street-sweeping-schedule-2016-x2vd-qke7/
    Explore at:
    json, application/openapi+json, application/vnd.splitgraph.imageAvailable download formats
    Dataset updated
    Oct 25, 2016
    Dataset authored and provided by
    City of Chicago
    Description

    Street sweeping schedule by Ward and Ward section number. To find your Ward section, visit https://data.cityofchicago.org/d/icje-4fmy. For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  13. m

    Ultimate Arabic News Dataset

    • data.mendeley.com
    • opendatalab.com
    • +1more
    Updated Jul 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Hashim Al-Dulaimi (2022). Ultimate Arabic News Dataset [Dataset]. http://doi.org/10.17632/jz56k5wxz7.2
    Explore at:
    Dataset updated
    Jul 4, 2022
    Authors
    Ahmed Hashim Al-Dulaimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles.

    Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources.

    • The data we collect consists of two Primary files:

    UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification.

    UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification.

    • We have added two folders containing additional detailed datasets:

    1- Sample: This folder contains samples of the results of web-scraping techniques for two popular Arab websites in two different news categories, Sports and Politics. this folder contain two datasets:

    Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website.

    2- Dataset Versions: This volume contains four different versions of the original data set, from which the appropriate version can be selected for use in text classification techniques. The first data set (Original) contains the raw data without pre-processing the data in any way, so the number of tokens in the first data set is very high. In the second data set (Original_without_Stop) the data was cleaned, such as removing symbols, numbers, and non-Arabic words, as well as stop words, so the number of symbols is greatly reduced. In the third dataset (Original_with_Stem) the data was cleaned, and text stemming technique was used to remove all additions and suffixes that might affect the accuracy of the results and to obtain the words roots. In the 4th edition of the dataset (Original_Without_Stop_Stem) all preprocessing techniques such as data cleaning, stop word removal and text stemming technique were applied, so we note that the number of tokens in the 4th edition is the lowest among all releases.

    • The data is divided into 10 different categories: Culture, Diverse, Economy, Sport, Politic, Art, Society, Technology, Medical and Religion.
  14. S

    Street Sweeping Schedule - 2013

    • splitgraph.com
    • data.cityofchicago.org
    • +1more
    Updated Jul 11, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2018). Street Sweeping Schedule - 2013 [Dataset]. https://www.splitgraph.com/cityofchicago/street-sweeping-schedule-2013-8h6a-imtk
    Explore at:
    application/vnd.splitgraph.image, json, application/openapi+jsonAvailable download formats
    Dataset updated
    Jul 11, 2018
    Dataset authored and provided by
    City of Chicago
    Description

    Street sweeping schedule by Ward and Ward sections number. To find your Ward section, visit http://bit.ly/Hz0aCo. For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  15. S

    sweepingsections

    • splitgraph.com
    • data.cityofchicago.org
    • +3more
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2024). sweepingsections [Dataset]. https://www.splitgraph.com/cityofchicago/sweepingsections-eiv4-4c3n
    Explore at:
    application/openapi+json, application/vnd.splitgraph.image, jsonAvailable download formats
    Dataset updated
    Apr 10, 2024
    Dataset authored and provided by
    City of Chicago
    Description

    Street sweeping zones by Ward and Ward Section Number. For the corresponding schedule, see https://data.cityofchicago.org/d/izuq-yb9q.

    For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP. The data can be viewed on the Chicago Data Portal with a web browser. However, to view or use the files outside of a web browser, you will need to use compression software and special GIS software, such as ESRI ArcGIS (shapefile) or Google Earth (KML or KMZ).

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  16. Climate Change: Earth Surface Temperature Data

    • kaggle.com
    • redivis.com
    zip
    Updated May 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berkeley Earth (2017). Climate Change: Earth Surface Temperature Data [Dataset]. https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data
    Explore at:
    zip(88843537 bytes)Available download formats
    Dataset updated
    May 1, 2017
    Dataset authored and provided by
    Berkeley Earthhttp://berkeleyearth.org/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Earth
    Description

    Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.

    us-climate-change

    Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.

    Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.

    We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.

    In this dataset, we have include several files:

    Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):

    • Date: starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures
    • LandAverageTemperature: global average land temperature in celsius
    • LandAverageTemperatureUncertainty: the 95% confidence interval around the average
    • LandMaxTemperature: global average maximum land temperature in celsius
    • LandMaxTemperatureUncertainty: the 95% confidence interval around the maximum land temperature
    • LandMinTemperature: global average minimum land temperature in celsius
    • LandMinTemperatureUncertainty: the 95% confidence interval around the minimum land temperature
    • LandAndOceanAverageTemperature: global average land and ocean temperature in celsius
    • LandAndOceanAverageTemperatureUncertainty: the 95% confidence interval around the global average land and ocean temperature

    Other files include:

    • Global Average Land Temperature by Country (GlobalLandTemperaturesByCountry.csv)
    • Global Average Land Temperature by State (GlobalLandTemperaturesByState.csv)
    • Global Land Temperatures By Major City (GlobalLandTemperaturesByMajorCity.csv)
    • Global Land Temperatures By City (GlobalLandTemperaturesByCity.csv)

    The raw data comes from the Berkeley Earth data page.

  17. H

    3. Original EPIC-1 Data Source

    • dataverse.harvard.edu
    Updated Dec 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard T.H. Chan School of Public Health - Center for Health Decision Sciences (2016). 3. Original EPIC-1 Data Source [Dataset]. http://doi.org/10.7910/DVN/A4HJUR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2016
    Dataset provided by
    Harvard Dataverse
    Authors
    Harvard T.H. Chan School of Public Health - Center for Health Decision Sciences
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/A4HJURhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.7910/DVN/A4HJUR

    Description

    Original EPIC-1 data source and documented intermediate data manipulation. These files are provided in order to ensure a complete audit trail and documentation. These files include original source data, as well as files created in the process of cleaning and preparing the datasets found in section I of the dataverse (1. Pooled and Adjusted EPIC Data). These intermediary files contain documentation in any adjustment in assumptions, currency conversions, and data cleaning processes. Ordinarily, analysis would be done using the datasets in section I. Researchers would not find the need to use the files in this section unless for tracing the origin of the variables to the original source. “Adjustments for the EPIC-2 data is conducted with advice and input from data collection team (EPIC-1). The magnitude of these adjustments are documented in the table attached. These documented adjustments explained the lion’s share of the discrepancies, leaving only minor unaccounted differences in the data (Δ range 0% - 1.1%).” “In addition to using the sampling weights, any extrapolation to achieve nationwide cost estimates for Benin, Ghana, Zambia, and Honduras uses scale-up factor to take into account facilities that are outside of the sampling frame. For example, after taking into account the sampling weights, the total facility-level delivery cost in Benin sampling frame (343 facilities) is $2,094,031. To estimate the total facility-level delivery cost in the entire country of Benin (695 facilities), the sample-frame cost estimate is multiplied by 695/343. “Additional adjustments for the EPIC-2 analysis include the series of decisions for weighting, methods, and data sources. For EPIC-2 analyses, average costs per dose and DTP3 were calculated as total costs divided by total outputs, representing a funder’s perspective. We also report results as a simple average of the site-level cost per output. All estimates were adjusted for survey weighting. In particular, the analyses in EPIC-2 relied exclusively on information from the sample, whereas in some instance EPIC-1 teams were able to strategically leverage other available data sources.”

  18. T

    wikipedia

    • tensorflow.org
    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
    Explore at:
    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  19. The Surface Water Chemistry (SWatCh) database

    • zenodo.org
    bin, csv, xml
    Updated Apr 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lobke Rotteveel; Lobke Rotteveel; Franz Heubach; Franz Heubach (2022). The Surface Water Chemistry (SWatCh) database [Dataset]. http://doi.org/10.5281/zenodo.6484939
    Explore at:
    bin, xml, csvAvailable download formats
    Dataset updated
    Apr 26, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lobke Rotteveel; Lobke Rotteveel; Franz Heubach; Franz Heubach
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset presented in the following manuscript: The Surface Water Chemistry (SWatCh) database: A standardized global database of water chemistry to facilitate large-sample hydrological research, which is currently under review at Earth System Science Data.

    Openly accessible global scale surface water chemistry datasets are urgently needed to detect widespread trends and problems, to help identify their possible solutions, and determine critical spatial data gaps where more monitoring is required. Existing datasets are limited in availability, sample size/sampling frequency, and geographic scope. These limitations inhibit the answering of emerging transboundary water chemistry questions, for example, the detection and understanding of delayed recovery from freshwater acidification. Here, we begin to address these limitations by compiling the global surface water chemistry (SWatCh) database. We collect, clean, standardize, and aggregate open access data provided by six national and international agencies to compile a database containing information on sites, methods, and samples, and a GIS shapefile of site locations. We remove poor quality data (for example, values flagged as “suspect” or “rejected”), standardize variable naming conventions and units, and perform other data cleaning steps required for statistical analysis. The database contains water chemistry data for streams, rivers, canals, ponds, lakes, and reservoirs across seven continents, 24 variables, 33,722 sites, and over 5 million samples collected between 1960 and 2022. Similar to prior research, we identify critical spatial data gaps on the African and Asian continents, highlighting the need for more data collection and sharing initiatives in these areas, especially considering freshwater ecosystems in these environs are predicted to be among the most heavily impacted by climate change. We identify the main challenges associated with compiling global databases – limited data availability, dissimilar sample collection and analysis methodology, and reporting ambiguity – and provide recommended solutions. By addressing these challenges and consolidating data from various sources into one standardized, openly available, high quality, and trans-boundary database, SWatCh allows users to conduct powerful and robust statistical analyses of global surface water chemistry.

  20. Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
162 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu