89 datasets found

Project Python- Data Cleaning - EDA- Visualization
kaggle.com
zip
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hussein Al Chami (2023). Project Python- Data Cleaning - EDA- Visualization [Dataset]. https://www.kaggle.com/datasets/husseinalchami/project-python-data-cleaning-eda-visualization
Explore at:
zip(322085 bytes)Available download formats
Dataset updated
Dec 10, 2023
Authors
Hussein Al Chami
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Hussein Al Chami

Released under MIT

Contents
Nashville Housing Data Cleaning Project
kaggle.com
zip
Updated Aug 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Elhelbawy (2024). Nashville Housing Data Cleaning Project [Dataset]. https://www.kaggle.com/datasets/elhelbawylogin/nashville-housing-data-cleaning-project/discussion
Explore at:
zip(1282 bytes)Available download formats
Dataset updated
Aug 20, 2024
Authors
Ahmed Elhelbawy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Nashville
Description
Project Overview : This project demonstrates a thorough data cleaning process for the Nashville Housing dataset using SQL. The script performs various data cleaning and transformation operations to improve the quality and usability of the data for further analysis.

Technologies Used : SQL Server T-SQL

Dataset: The project uses the Nashville Housing dataset, which contains information about property sales in Nashville, Tennessee. The original dataset includes various fields such as property addresses, sale dates, sale prices, and other relevant real estate information. Data Cleaning Operations The script performs the following data cleaning operations:

Date Standardization: Converts the SaleDate column to a standard Date format for consistency and easier manipulation. Populating Missing Property Addresses: Fills in NULL values in the PropertyAddress field using data from other records with the same ParcelID. Breaking Down Address Components: Separates the PropertyAddress and OwnerAddress fields into individual columns for Address, City, and State, improving data granularity and queryability. Standardizing Values: Converts 'Y' and 'N' values to 'Yes' and 'No' in the SoldAsVacant field for clarity and consistency. Removing Duplicates: Identifies and removes duplicate records based on specific criteria to ensure data integrity. Dropping Unused Columns: Removes unnecessary columns to streamline the dataset.

Key SQL Techniques Demonstrated :

Data type conversion Self joins for data population String manipulation (SUBSTRING, CHARINDEX, PARSENAME) CASE statements Window functions (ROW_NUMBER) Common Table Expressions (CTEs) Data deletion Table alterations (adding and dropping columns)

Important Notes :

The script includes cautionary comments about data deletion and column dropping, emphasizing the importance of careful consideration in a production environment. This project showcases various SQL data cleaning techniques and can serve as a template for similar data cleaning tasks.

Potential Improvements :

Implement error handling and transaction management for more robust execution. Add data validation steps to ensure the cleaned data meets specific criteria. Consider creating indexes on frequently queried columns for performance optimization.
U
Data Cleaning Methodology Source Code
data.usgs.gov
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Varela (2024). Data Cleaning Methodology Source Code [Dataset]. http://doi.org/10.5066/F7TD9VG7
Explore at:
Unique identifier
https://doi.org/10.5066/F7TD9VG7
Dataset updated
Apr 30, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Brian Varela
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Petroleum production data are usually stored in a format that makes it easy to determine the year and month production started, if there are any breaks, and when production ends. However, in some cases, you may want to compare production runs where the start of production for all wells starts at month one regardless of the year the wells started producing. This report describes the JAVA program the U.S. Geological Survey developed to examine water-to-oil and water-to-gas ratios in the form of month one, month two, and so on with the objective of estimating quantities of water and proppant used in low-permeability petroleum production. The text covers the data used by the program, the challenges with production data, the program logic for checking the quality of the production data, and the program logic for checking the completeness of the data.
f
Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
docx
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s004
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s004
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
Netflix Movies and TV Shows Dataset Cleaned(excel)
kaggle.com
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gaurav Tawri
Description
This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.
Household Survey on Information and Communications Technology– 2019 - West...
pcbs.gov.ps
Updated Mar 16, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palestinian Central Bureau of Statistics (2020). Household Survey on Information and Communications Technology– 2019 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/489
Explore at:
Dataset updated
Mar 16, 2020
Dataset authored and provided by
Palestinian Central Bureau of Statisticshttps://pcbs.gov/
Time period covered
2019
Area covered
Gaza, Gaza Strip, West Bank
Description
Abstract

The Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2019. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.

Geographic coverage

Palestine, West Bank, Gaza strip

Analysis unit

Household, Individual

Universe

All Palestinian households and individuals (10 years and above) whose usual place of residence in 2019 was in the state of Palestine.

Kind of data

Sample survey data [ssd]

Sampling procedure

Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.

Sample size The estimated sample size is 8,040 households.

Sample Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.

Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, refugee camps).

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.

Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.

Section III: Data on Individuals (10 years and over) about computer use, access to the Internet and possession of a mobile phone.

Cleaning operations

Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.

Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.

In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.

Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.

Tabulation After finalizing checking and cleaning data from any errors. Tables extracted according to prepared list of tables.

Response rate

The response rate in the West Bank reached 77.6% while in the Gaza Strip it reached 92.7%.

Sampling error estimates

Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, There is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.

Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.

The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non response cases. The total non-response rate reached 17.5%. The refusal percentage reached 2.9% which is relatively low percentage compared to the household surveys conducted by PCBS, and the reason is the questionnaire survey is clear.
d
Taking Care of Business (TCB) Clean Corridors Program
catalog.data.gov
s.cnmilf.com
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Philadelphia (2025). Taking Care of Business (TCB) Clean Corridors Program [Dataset]. https://catalog.data.gov/dataset/taking-care-of-business-tcb-clean-corridors-program
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
City of Philadelphia
Description
Philadelphia Taking Care of Business (PHL TCB) Clean Corridors Program funds community-based nonprofits to sweep sidewalks and remove litter within neighborhood commercial corridors. PHL TCB seeks to 1-Maintain clean commercial districts, 2-Promote the economic success of neighborhood businesses by creating an inviting environment for shoppers, 3-Create work opportunities for Philadelphians, 4-Grow the capacity of local small businesses and organizations that provide cleaning services.
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v2
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identiﬁed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
Raw dataset of Laptop - for purpose of Cleaning
kaggle.com
zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rootpi3 (2024). Raw dataset of Laptop - for purpose of Cleaning [Dataset]. https://www.kaggle.com/datasets/rootpi3/raw-dataset-of-laptop-for-purpose-of-eda
Explore at:
zip(41633 bytes)Available download formats
Dataset updated
Aug 2, 2024
Authors
rootpi3
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is web scraped dataset with the help of selenium. So it needs lots of efforts to make it useful.

Efforts need- 1) Remove Duplicates 2) Remove nullity 3) Separate features 4) Reduce memory

Feel free to perform EDA using this dataset - enjoy with the data Can you find brand of the laptop form the title? Can you separate the Rating Count and Reviews into two separate columns?

Think accordingly and perform EDA - you can use MySQL or pandas
C
Street Sweeping Schedule - 2016
data.cityofchicago.org
gimi9.com
+4more
csv, xlsx, xml
Updated Oct 25, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2016). Street Sweeping Schedule - 2016 [Dataset]. https://data.cityofchicago.org/Sanitation/Street-Sweeping-Schedule-2016/x2vd-qke7
Explore at:
csv, xml, xlsxAvailable download formats
Dataset updated
Oct 25, 2016
Dataset authored and provided by
City of Chicago
Description
Street sweeping schedule by Ward and Ward section number. To find your Ward section, visit https://data.cityofchicago.org/d/icje-4fmy. For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP.
A
Data from: Street Sweeping Schedules
data.boston.gov
csv
Updated Dec 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public Works Department (2025). Street Sweeping Schedules [Dataset]. https://data.boston.gov/dataset/street-sweeping-schedules
Explore at:
csv(606101)Available download formats
Dataset updated
Dec 3, 2025
Dataset authored and provided by
Public Works Department
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
This is a legacy dataset which contains detailed information on the timing and location of street sweeping service throughout the City. Daily street cleaning takes place April 1 to November 30 in most Boston neighborhoods (weather permitting), and over 400 curb miles of streets are maintained under the Daytime Street Sweeping Program.
d
Street Sweeping Schedule - 2012
datasets.ai
data.cityofchicago.org
+2more
23, 40, 55, 8
Updated Nov 10, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2020). Street Sweeping Schedule - 2012 [Dataset]. https://datasets.ai/datasets/street-sweeping-schedule-2012
Explore at:
8, 23, 40, 55Available download formats
Dataset updated
Nov 10, 2020
Dataset authored and provided by
City of Chicago
Description
Street sweeping schedule by Ward and Ward sections number. To find your Ward section, visit http://bit.ly/Hz0aCo. For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP.
h
github-code-clean
huggingface.co
opendatalab.com
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2022). github-code-clean [Dataset]. https://huggingface.co/datasets/codeparrot/github-code-clean
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2022
Dataset authored and provided by
CodeParrot
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.
d
Street Sweeping Schedule - 2024
datasets.ai
data.cityofchicago.org
+2more
23, 40, 55, 8
Updated Apr 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2024). Street Sweeping Schedule - 2024 [Dataset]. https://datasets.ai/datasets/street-sweeping-schedule-2024
Explore at:
40, 8, 55, 23Available download formats
Dataset updated
Apr 5, 2024
Dataset authored and provided by
City of Chicago
Description
Street sweeping schedule by Ward and Ward section number. To find your Ward section, visit https://data.cityofchicago.org/d/ytfi-mzdz. For more information about the City's Street Sweeping program, go to https://www.chicago.gov/city/en/depts/streets/provdrs/streets_san/svcs/street_sweeping.html.

Corrections are possible during the course of the sweeping season.
SQLcleaning
kaggle.com
zip
Updated Mar 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen M Blake (2023). SQLcleaning [Dataset]. https://www.kaggle.com/datasets/stephenmblake/sqlcleaning
Explore at:
zip(8206870 bytes)Available download formats
Dataset updated
Mar 15, 2023
Authors
Stephen M Blake
Description
Using SQL was able to cleaning up data so the it is easier to analyze. Used JOIN's, Substrings, parsename, update/alter tables, CTE, case statement, and row_number.. Learned many different ways to cleaning the data.
d
Street Sweeping Schedule
catalog.data.gov
data.sfgov.org
Updated Oct 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.sfgov.org (2025). Street Sweeping Schedule [Dataset]. https://catalog.data.gov/dataset/street-sweeping-schedule
Explore at:
Dataset updated
Oct 4, 2025
Dataset provided by
data.sfgov.org
Description
A. SUMMARY Mechanical street sweeping and street cleaning schedule managed by San Francisco Public Works. B. HOW THE DATASET IS CREATED This dataset is created by extracting all street sweeping schedule data from a Department of Public Works database, it is then geocoded to add common identifiers such as Centerline Network Number ("CNN") then published to the open data portal. C. UPDATE PROCESS This dataset will be updated on an 'as needed' basis, when sweeping schedules change. D. HOW TO USE THIS DATASET Use this dataset to understand, track, or analyze street sweeping in San Francisco.
MoDeRn ArTs DaTa SeT CleAnInG AnD VisUaLizAtiOn
kaggle.com
zip
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sidrajaved30 (2024). MoDeRn ArTs DaTa SeT CleAnInG AnD VisUaLizAtiOn [Dataset]. https://www.kaggle.com/datasets/sidrajaved30/modern-arts-data-set-cleaning-and-visualization
Explore at:
zip(738683 bytes)Available download formats
Dataset updated
Aug 23, 2024
Authors
sidrajaved30
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Description: Modern art dataset cleaning and visualization involve refining raw data related to artworks, artists, styles, and exhibitions. This process ensures accuracy, consistency, and completeness, making the dataset suitable for analysis and interpretation.

Source: The data can come from museum archives, online art databases, auction records, or curated datasets from organizations like MoMA, Tate, or Kaggle. These sources provide detailed information about artists, artworks, techniques, and historical contexts.

Inspiration: The goal of modern art data visualization is to uncover trends, highlight artistic influences, and provide insights into art movements. Through charts, graphs, and interactive dashboards, we can analyze color usage, artist popularity, and regional influences, making art data more accessible and engaging.

BI intro to data cleaning eda and machine learning

kaggle.com

zip

Updated Nov 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Walekhwa Tambiti Leo Philip (2025). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning/suggestions

Explore at:

zip(9961 bytes)Available download formats

Dataset updated

Nov 17, 2025

Authors

Walekhwa Tambiti Leo Philip

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

Background

Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

Admissions decision-making
Academic support strategies
Overall program impact and ROI

Your Mission

Answer this central question:

“Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

Key Strategic Areas

You are required to analyze and provide actionable insights for the following three areas:

1. Admissions Optimization

Should entry exams remain the primary admissions filter?

Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

✅ Deliverables:

Feature importance ranking for predicting Python and DB scores
Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
Business rationale and risk analysis

2. Curriculum Support Strategy

Are there at-risk student groups who need extra support?

Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

✅ Deliverables:

At-risk segment identification
Support program design (e.g., prep course, mentoring)
Expected outcomes, costs, and KPIs

3. Resource Allocation & Program ROI

How can we allocate resources for maximum student success?

Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

✅ Deliverables:

Performance drivers
Student segmentation
Resource allocation plan and ROI projection

🛠️ Dataset Overview

Column	Description
`fNAME`, `lNAME`	Student first and last name
`Age`	Student age (21–71 years)
`gender`	Gender (standardized as "Male"/"Female")
`country`	Student’s country of origin
`residence`	Student housing/residence type
`entryEXAM`	Entry test score (28–98)
`prevEducation`	Prior education (High School, Diploma, etc.)
`studyHOURS`	Total study hours logged
`Python`	Final Python exam score
`DB`	Final Database exam score

📊 Dataset

You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

Raw Dataset (Recommended for Full Project)

Download: bi.csv

This dataset includes common data quality challenges:

Country name inconsistencies
e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom
Residence type variations
e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence
Education level typos and casing issues
e.g. Barrrchelors → Bachelor, DIPLOMA, Diplomaaa → Diploma
Gender value noise
e.g. M, F, female → standardize to Male / Female
Missing scores in Python subject
Fill NaN values using column mean or suitable imputation strategy

Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

✅ Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

Cleaned Dataset (Optional Shortcut)

Download: cleaned_bi.csv

This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
a
Dry Cleaner Points
hub.arcgis.com
gis-tceq.opendata.arcgis.com
Updated Mar 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Texas Commission on Environmental Quality (2020). Dry Cleaner Points [Dataset]. https://hub.arcgis.com/maps/TCEQ::dry-cleaner-points/about
Explore at:
Dataset updated
Mar 17, 2020
Dataset authored and provided by
Texas Commission on Environmental Quality
Area covered

Description
The TCEQ Dry Cleaner Remediation Program (DCRP) POINTS layer is used to identify the geographic location of all "Active” and “Inactive” DCRP sites within the State Location and site description of participants of the Dry Cleaner Remediation Program.This data layer can be used for a variety of purposes, including: the plotting of DCRP sites on maps; utilization by field personnel; and performing spatial analysis on how the sites affect their surroundings. The purpose of the Dry Cleaner Remediation Program is to oversee the cleanup of with soil and groundwater contamination caused by dry cleaning solvents from dry cleaning facilities. The goal is to assure that the public is not exposed to hazardous levels of chemicals by requiring mitigation and/or removal of the contamination to levels protective of human health and the environment.The Dry Cleaner Remediation Program (DCRP) was established by the Texas Legislature in 2003. It created the Dry Cleaning Facility Release Fund for state lead clean up of dry cleaner related contaminated sites. It also established dry cleaner facility registration requirements, fees, performance standards, distributor registration, and revenue disbursement. The Dry Cleaner Remediation Program web URL is: (https://www.tceq.texas.gov/remediation/dry_cleaners/index.html).

Facebook

Twitter

Click to copy link

Link copied

Cite

Hussein Al Chami (2023). Project Python- Data Cleaning - EDA- Visualization [Dataset]. https://www.kaggle.com/datasets/husseinalchami/project-python-data-cleaning-eda-visualization

Project Python- Data Cleaning - EDA- Visualization

Explore at:

zip(322085 bytes)Available download formats

Dataset updated

Dec 10, 2023

Authors

Hussein Al Chami

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset

This dataset was created by Hussein Al Chami

Released under MIT

Clear search

Close search

Google apps

Main menu

Project Python- Data Cleaning - EDA- Visualization

Dataset

Contents

Nashville Housing Data Cleaning Project

Data Cleaning Methodology Source Code

Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene...

Netflix Movies and TV Shows Dataset Cleaned(excel)

Household Survey on Information and Communications Technology– 2019 - West...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Taking Care of Business (TCB) Clean Corridors Program

LSC (Leicester Scientific Corpus)

Raw dataset of Laptop - for purpose of Cleaning

Street Sweeping Schedule - 2016

Data from: Street Sweeping Schedules

Street Sweeping Schedule - 2012

github-code-clean

Street Sweeping Schedule - 2024

SQLcleaning

Street Sweeping Schedule

MoDeRn ArTs DaTa SeT CleAnInG AnD VisUaLizAtiOn

BI intro to data cleaning eda and machine learning

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Background

Your Mission

Key Strategic Areas

1. Admissions Optimization

2. Curriculum Support Strategy

3. Resource Allocation & Program ROI

🛠️ Dataset Overview

📊 Dataset

Raw Dataset (Recommended for Full Project)

Cleaned Dataset (Optional Shortcut)

LScDC (Leicester Scientific Dictionary-Core)

Dry Cleaner Points

Project Python- Data Cleaning - EDA- Visualization

Dataset

Contents