33 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. d

    Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop

    • search.dataone.org
    • borealisdata.ca
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Costanzo, Lucia; Jadon, Vivek (2024). Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop [Dataset]. http://doi.org/10.5683/SP3/FF6AI9
    Explore at:
    Dataset updated
    Jul 31, 2024
    Dataset provided by
    Borealis
    Authors
    Costanzo, Lucia; Jadon, Vivek
    Description

    Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.

  3. Excel-project: Glassdoor Data Cleaning

    • kaggle.com
    Updated Sep 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis Lira (2023). Excel-project: Glassdoor Data Cleaning [Dataset]. https://www.kaggle.com/datasets/luisliraportfolio/excel-project-clean-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Luis Lira
    Description

    Dataset

    This dataset was created by Luis Lira

    Contents

  4. Global exporters importers-export import data of Clean excel

    • volza.com
    csv
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volza FZ LLC (2025). Global exporters importers-export import data of Clean excel [Dataset]. https://www.volza.com/trade-data-global/global-exporters-importers-export-import-data-of-clean+excel
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 31, 2025
    Dataset provided by
    Volza
    Authors
    Volza FZ LLC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Count of exporters, Count of importers, Count of shipments, Sum of export import value
    Description

    9130 Global exporters importers export import shipment records of Clean excel with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.

  5. o

    Data from: Cleaning Data with Open Refine

    • explore.openaire.eu
    Updated Jan 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr Richard Berry; Dr Luc Small; Dr Jeff Christiansen (2016). Cleaning Data with Open Refine [Dataset]. http://doi.org/10.5281/zenodo.6423839
    Explore at:
    Dataset updated
    Jan 1, 2016
    Authors
    Dr Richard Berry; Dr Luc Small; Dr Jeff Christiansen
    Description

    About this course Do you have messy data from multiple inconsistent sources, or open-responses to questionnaires? Do you want to improve the quality of your data by refining it and using the power of the internet? Open Refine is the perfect partner to Excel. It is a powerful, free tool for exploring, normalising and cleaning datasets, and extending data by accessing the internet through APIs. In this course we’ll work through the various features of Refine, including importing data, faceting, clustering, and calling remote APIs, by working on a fictional but plausible humanities research project. Learning Outcomes Download, install and run Open Refine Import data from csv, text or online sources and create projects Navigate data using the Open Refine interface Explore data by using facets Clean data using clustering Parse data using GREL syntax Extend data using Application Programming Interfaces (APIs) Export project for use in other applications Prerequisites The course has no prerequisites. Licence Copyright © 2021 Intersect Australia Ltd. All rights reserved.

  6. popular baby names with data cleaning

    • kaggle.com
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Real Sourabh Singhal (2023). popular baby names with data cleaning [Dataset]. https://www.kaggle.com/datasets/realsourabhsinghal/popular-baby-names-with-data-cleaning/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Real Sourabh Singhal
    Description

    It completely data clean excel file to attain accurate data analysis with proper visualization

  7. covid19_clean_complete & Data_Excel & Assignment_1

    • kaggle.com
    Updated Feb 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Khaled Boyka (2025). covid19_clean_complete & Data_Excel & Assignment_1 [Dataset]. https://www.kaggle.com/datasets/mohammedkhaledboyka/covid19-clean-complete-and-data-excel-and-assignment-1/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohammed Khaled Boyka
    Description

    Dataset

    This dataset was created by Mohammed Khaled Boyka

    Released under Other (specified in description)

    Contents

  8. f

    Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping

    • figshare.com
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maryam Binti Haji Abdul Halim (2025). Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping [Dataset]. http://doi.org/10.6084/m9.figshare.28147451.v1
    Explore at:
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    figshare
    Authors
    Maryam Binti Haji Abdul Halim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.

  9. Bike Sharing case study 1

    • kaggle.com
    Updated Nov 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mukti shukla (2022). Bike Sharing case study 1 [Dataset]. https://www.kaggle.com/datasets/muktishukla/bike-sharing-case-study-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 2, 2022
    Dataset provided by
    Kaggle
    Authors
    mukti shukla
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Case Study 1- Bike Sharing Introduction: In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. There are two types of members are sharing bike differently! 1.) Annual members- who bought annual membership. 2.) Casual members- who bought or buying single-ride passes, full-day passes.

    Phase_1- Ask- 1. Identify the business task- • How do annual members and casual riders use Cyclistic bikes differently? • Why would casual riders buy Cyclistic annual memberships? • How can Cyclistic use digital media to influence casual riders to become members? 2. Consider key stakeholders- Lily Moreno: The director of marketing and manager, Cyclistic marketing analytics team, Cyclistic executive team.

    Phase_2- Prepare--
    I downloaded and store it in my excel sheet, I am using only one month (April_2020) data, and using excel for solving task, I am also sorting and filtering my data according to requirement. I downloaded data from public source and it’s fully reliable, unbiased. Data is also, complete, consistent and accurate. Phase_3- Process— • I downloaded 202004-divvy-tripdata.cvs data and I unzip the file and converted into .xls file, here I am using only April data because this case study is my first case study and only for my learning, so I want to keep it simple. I am using excel this time because I am more comfortable with excel then other tools. I also want to perform good analysis and don’t want to lost in multiple sheets & large dataset, in initial stage.

    • I Checked the data errors, and corrected some errors, I also did some calculation in my sheet, and try to clean data, so I can use sheet appropriately, Phase_4- analyze— I organize my data, performed sorting and filtering multiple time as I needed, did some calculation, add few pivots table and try to analyze data properly, also try to Identify trends and relationships.

    Phase_5- Share— • After completing my analysis, I used some charts to present my findings. First, I found Total count of ride is 16383 and annual members took 11552 count of ride what is 71% of total ride, and casual riders took only 29% of ride which is 4831.

    • I also found that casual riders using ride for some times but members are taking ride anytime no matter if they need bike for long time or short time, they are taking ride without any second thought, because after buying annual pass they no need to pay (any extra money or) every time.

    • Clark St & Elm St is a most bike rented point, people took 180 bikes from this station, and 132 are the annual member from that. Also, I found other station where we need more bikes. Likewise, we also can find station name where most people end their ride, so they have plenty space for bikes. Phase_6- Act— Feeling happy to share my finding with you, feeling little confident after completing my first case study.

  10. A

    Low-Income Energy Affordability Data (LEAD) Tool

    • data.amerigeoss.org
    • datadiscoverystudio.org
    • +1more
    csv, pdf, xls, xlsb
    Updated Jul 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States[old] (2019). Low-Income Energy Affordability Data (LEAD) Tool [Dataset]. https://data.amerigeoss.org/vi/dataset/clean-energy-for-low-income-communities-accelerator-energy-data-profiles-2fffb
    Explore at:
    csv, xlsb, xls, pdfAvailable download formats
    Dataset updated
    Jul 29, 2019
    Dataset provided by
    United States[old]
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABOUT THIS TOOL:

    The Better Building’s Clean Energy for Low Income Communities Accelerator (CELICA) was launched in 2016 to help state and local partners across the nation meet their goals for increasing uptake of energy efficiency and renewable energy technologies in low and moderate income communities. As a part of the Accelerator, DOE created this Low-Income Energy Affordability Data (LEAD) Tool to assist partners with understanding their LMI community characteristics. This can be utilized for low income and moderate income energy policy and program planning, as it provides interactive state, county and city level worksheets with graphs and data including number of households at different income levels and numbers of homeowners versus renters. It provides a breakdown based on fuel type, building type, and construction year. It also provides average monthly energy expenditures and energy burden (percentage of income spent on energy).

    HOW TO USE:

    The LEAD tool can be used to support program design and goal setting, and they can be paired with other data to improve LMI community energy benchmarking and program evaluation. Datasets are available for all 50 states, census divisions, and tract levels. You will have to enable macros in MS Excel to interact with the data. A description of each of the files and what states are included in each U.S. Census Division can be found in the file "DESCRIPTION OF FILES".

    For more information, visit: https://betterbuildingsinitiative.energy.gov/accelerators/clean-energy-low-income-communities

  11. Real Estate Data

    • kaggle.com
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AgarwalYashhh (2024). Real Estate Data [Dataset]. https://www.kaggle.com/datasets/agarwalyashhh/gurgaon-real-estate-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AgarwalYashhh
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Datasets contains 4 files- the excel file is the original file after scraping the data from the website but is very raw and uncleaned. After spending a lot of time, I tried to clean the data, which I thought fits best to represent the dataset and can be used for projects. Explore all the datasets and share your notebooks and insights! Consider upvoting if you find it helpful, Thank you.

  12. n

    Data from: Designing data science workshops for data-intensive environmental...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 8, 2020
    Dataset provided by
    California State Polytechnic University
    Montana State University
    Authors
    Allison Theobold; Stacey Hancock; Sara Mannheimer
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

    Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

    Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

    The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. 
    The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw.
    
      The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey.
    
    
    The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. 
    The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. 
    The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
    
  13. g

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • datasearch.gesis.org
    • openicpsr.org
    Updated Feb 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
    Explore at:
    Dataset updated
    Feb 19, 2020
    Dataset provided by
    da|ra (Registration agency for social science and economic data)
    Authors
    Kaplan, Jacob
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.

  14. d

    The fractured lab notebook: undergraduate and ecological data management...

    • search.dataone.org
    Updated Nov 14, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for Ecological Analysis and Synthesis; Carly Strasser (2013). The fractured lab notebook: undergraduate and ecological data management training in the United States [Dataset]. https://search.dataone.org/view/knb.300.9
    Explore at:
    Dataset updated
    Nov 14, 2013
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    National Center for Ecological Analysis and Synthesis; Carly Strasser
    Time period covered
    Mar 29, 2011 - May 25, 2011
    Area covered
    Variables measured
    Answer, Coding, EndDate, Question, R script, StartDate, First Name, Param name, Description, RespondentID, and 157 more
    Description

    Data presented here are those collected from a survey of Ecology professors at 48 undergraduate institutions to assess the current state of data management education. The following files have been uploaded:

    Scripts(2): 1. DataCleaning_20120105.R is an R script for cleaning up data prior to analysis. This script removes spaces, substitutes text for codes, removed duplicate schools, and converts questions and answers from the survey into more simple parameter names, without any numbers, spaces, or symbols. This script is heavily annotated to assist the user of the file in understanding what is being done to the data files. The script produces the file cleandata_[date].Rdata, which is called in the file DataTrimming_20120105.R 2. DataTrimming_20120105.R is an R script for trimming extraneous variables not used in final analyses. Some variables are combined as needed and NAs (no answers) are removed. The file is heavily annotated. It produces trimdata_[date].Rdata, which was imported into Excel for summary statistics.

    Data files (3) 3. AdvancedSpreadsheet_20110526.csv is the output file from the SurveyMonkey online survey tool used for this project. It is a .csv sheet with the complete set of survey data, although some data (e.g., open-ended responses, institution names) are removed to prevent schools and/or instructors from being identifiable. This file is read into DataCleaning_20120105.R for cleaning and editing. 4. VariableRenaming_20110711.csv is called into the DataCleaning_20120105.R script to convert the questions and answers from the survey into simple parameter names, without any numbers, spaces, or symbols. 5. ParamTable.csv is a list of the parameter names used for analysis and the value codes. It can be used to understand outputs from the scripts above (cleandata_[date].Rdata and trimdata_[date].Rdata).

  15. KAP WASH 2019 in South Sudan's Ajuong Thok and Pamir Camps - South Sudan

    • datacatalog.ihsn.org
    • microdata.unhcr.org
    • +1more
    Updated Oct 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samaritan's Purse (2021). KAP WASH 2019 in South Sudan's Ajuong Thok and Pamir Camps - South Sudan [Dataset]. https://datacatalog.ihsn.org/catalog/9787
    Explore at:
    Dataset updated
    Oct 14, 2021
    Dataset provided by
    United Nations High Commissioner for Refugeeshttp://www.unhcr.org/
    Samaritan's Purse
    Time period covered
    2019
    Area covered
    South Sudan
    Description

    Abstract

    A Knowledge, Attitudes and Practices (KAP) survey was conducted in Ajuong Thok and Pamir Refugee Camps in October 2019 to determine the current Water, Sanitation and Hygiene (WASH) conditions as well as hygiene attitudes and practices within the households (HHs) surveyed. The assessment utilized a systematic random sampling method, and a total of 1,474 HHs (735 HHs in Ajuong Thok and 739 HHs in Pamir) were surveyed using mobile data collection (MDC) within a period of 21 days. Data was cleaned and analyzed in Excel. The summary of the results is presented in this report.

    The findings show that the overall average number of liters of water per person per day was 23.4, in both Ajuong Thok and Pamir Camps, which was slightly higher than the recommended United Nations High Commissioner for Refugees (UNHCR) minimum standard of at least 20 liters of water available per person per day. This is a slight improvement from the 21 liters reported the previous year. The average HH size was six people. Women comprised 83% of the surveyed respondents and males 17%. Almost all the respondents were refugees, constituting 99.5% (n=1,466). The refugees were aware of the key health and hygiene practices, possibly as a result of routine health and hygiene messages delivered to them by Samaritan´s Purse (SP) and other health partners. Most refugees had knowledge about keeping the water containers clean, washing hands during critical times, safe excreta disposal and disease prevention.

    Geographic coverage

    Ajuong Thok and Pamir Refugee Camps

    Analysis unit

    Households

    Universe

    All households in Ajuong Thok and Pamir Refugee Camps

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Households were selected using systematic random sampling. Enumerators systematically walked through the camp block by block, row by row, in such a way as to pass each HH. Within blocks, enumerators started at one corner, then systematically used the sampling interval as they walked up and down each of the rows throughout the block, covering every block in Ajuong Thok and Pamir.

    In each location, the first HH sampled in a block was generated using an Excel tool customized by UNHCR which generated a Random Start and Sampling Interval.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The survey questionnaire used to collect the data consists of the following sections: - Demographics - Water collection and storage - Drinking water hygiene - Hygiene - Sanitation - Messaging - Distribution (NFI) - Diarrhea prevalence, knowledge and health seeking behaviour - Menstrual hygiene

    Cleaning operations

    The data collected was uploaded to a server at the end of each day. IFormBuilder generated a Microsoft (MS) Excel spreadsheet dataset which was then cleaned and analyzed using MS Excel.

    Given that SP is currently implementing a WASH program in Ajuong Thok and Pamir, the assessment data collected in these camps will not only serve as the endline for UNHCR 2018 programming but also as the baseline for 2019 programming.

    Data was anonymized through decoding and local suppression.

  16. ENTSO-E Hydropower modelling data (PECD) in CSV format

    • zenodo.org
    csv
    Updated Aug 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3950048
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 14, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matteo De Felice; Matteo De Felice
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PECD Hydro modelling

    This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

    The original URLs:

    The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

    As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

    Data description

    The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

    In this repository you can find 5 CSV files:

    • PECD-hydro-capacities.csv: installed capacities
    • PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping
    • PECD-hydro-daily-ror-generation.csv: daily run-of-river generation
    • PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation
    • PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

    Capacities

    The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5
    • sheet Reservoir, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

    Inflows

    The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 16 to 51
    • sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

    Daily run-of-river

    The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

    Miminum and maximum reservoir generation

    The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 196 to 231
    • sheet Reservoir, rows from 13 to 66, columns from 232 to 267

    Minimum/Maximum reservoir levels

    The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 14 to 66, column 12
    • sheet Reservoir, rows from 14 to 66, column 13

    CHANGELOG

    [2020/07/17] Added maximum generation for the reservoir

  17. python Data Science

    • kaggle.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Sadek (2025). python Data Science [Dataset]. https://www.kaggle.com/datasets/ahmedsadek07/python-data-science/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmed Sadek
    Description

    It's a file that contains instructions written in Python to work with data. Data Scientists use these files to understand, clean, and analyze information. They help turn raw numbers into useful insights.

    📊 What Does It Usually Contain? Loading Data The file usually starts by loading data from files like Excel or CSV. This is the raw information that needs to be studied.

    Understanding the Data It checks how the data looks — what columns are there, how many rows, and if there are any problems like missing values.

    Cleaning the Data If the data has issues (like empty cells or wrong values), this part fixes them so the analysis will be correct.

    Exploring the Data Here, the file shows basic statistics (like averages, maximum, minimum) and finds patterns.

    Visualizing the Data This step draws charts and graphs to help humans understand the data better — like bar charts, pie charts, etc.

    (Optional) Predictive Analysis or Machine Learning Sometimes, the file includes tools that use the data to make predictions — for example, predicting sales, weather, or customer behavior.

    ✅ In Simple Words: A Python Data Science file is like a recipe that:

    Takes in raw ingredients (data)

    Cleans and organizes them

    Analyzes them

    And sometimes uses them to predict the future!

  18. s

    Clean Label Ingredients Market Size, Share, Growth Analysis, By Form(Powder,...

    • skyquestt.com
    Updated Jan 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SkyQuest Technology (2024). Clean Label Ingredients Market Size, Share, Growth Analysis, By Form(Powder, Liquid, Others), By Type(Natural Colors, Natural Flavor, Fruit and Vegetable ingredient, Starch and Sweeteners), By Application(Food, Pet Food, Dairy, Non-Dairy), By Distribution Channel(B2B, B2C), By Region - Industry Forecast 2024-2031 [Dataset]. https://www.skyquestt.com/report/clean-label-ingredients-market
    Explore at:
    Dataset updated
    Jan 15, 2024
    Dataset authored and provided by
    SkyQuest Technology
    License

    https://www.skyquestt.com/privacy/https://www.skyquestt.com/privacy/

    Time period covered
    2024 - 2031
    Area covered
    Global
    Description

    Global Clean label ingredients Market size was valued at USD 47.10 Billion in 2022 and is poised to grow from USD 50.17 Billion in 2023 to USD 88.03 Billion by 2031, at a CAGR of 6.5% during the forecast period (2024-2031).

    Report MetricDetails
    Market size value in 2022USD 47.10 Billion
    Market size value in 2023USD 50.17 Billion
    Market size value in 2031USD 88.03 Billion
    Forecast Year2024-2031
    Growth Rate (CAGR)6.5%
    Segments Covered
    • Form
      • Dry, and Liquid
    • Type
      • Flavors, Colorants, Preservatives, Emulsifier, Stabilizer, and Thickeners (EST) & Others
    Largest MarketNorth America
    Fastest Growing MarketAsia Pacific

  19. A

    ‘Cardiovascular diseases dataset (clean)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Mar 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Cardiovascular diseases dataset (clean)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-cardiovascular-diseases-dataset-clean-cdcb/latest
    Explore at:
    Dataset updated
    Mar 15, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Cardiovascular diseases dataset (clean)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aiaiaidavid/cardio-data-dv13032020 on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Description of the data set

    This data set is a cleaned up copy of cardio_train.csv which can be found at:

    https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

    The original data set has been analyzed with Excel, correcting negative values, and removing outliers.

    A number of features in the dataset are used to predict the presence or absence of a cardiovascular disease.

    Below is a description of the features:

    AGE: integer (years of age)
    HEIGHT: integer (cm) 
    WEIGHT: integer (kg)
    GENDER: categorical (1: female, 2: male)
    AP_HIGH: systolic blood pressure, integer
    AP_LOW: diastolic blood pressure, integer 
    CHOLESTEROL: categorical (1: normal, 2: above normal, 3: well above normal)
    GLUCOSE: categorical (1: normal, 2: above normal, 3: well above normal)
    SMOKE: categorical (0: no, 1: yes)
    ALCOHOL: categorical (0: no, 1: yes)
    PHYSICAL_ACTIVITY: categorical (0: no, 1: yes)
    

    And the target variable:

    CARDIO_DISEASE: categorical (0: no, 1: yes)
    

    --- Original source retains full ownership of the source dataset ---

  20. o

    Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

    • openicpsr.org
    • search.datacite.org
    Updated Aug 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/E102263V5
    Explore at:
    Dataset updated
    Aug 16, 2018
    Dataset provided by
    University of Pennsylvania
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1980 - 2016
    Area covered
    United States
    Description
    Version 5 release notes:
    • Removes support for SPSS and Excel data.
    • Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
    • Adds in agencies that report 0 months of the year.
    • Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.
    • Removes data on runaways.
    Version 4 release notes:
    • Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
    Version 3 release notes:
    • Add data for 2016.
    • Order rows by year (descending) and ORI.
    Version 2 release notes:
    • Fix bug where Philadelphia Police Department had incorrect FIPS county code.

    The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.

    All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here.
    https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

    I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

    To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

    To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

    I created 9 arrest categories myself. The categories are:
    • Total Male Juvenile
    • Total Female Juvenile
    • Total Male Adult
    • Total Female Adult
    • Total Ma

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
151 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu