71 datasets found

Data Cleaning Portfolio Project
kaggle.com
zip
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project
Explore at:
zip(6053781 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
Deepali Sukhdeve
Description
Dataset

This dataset was created by Deepali Sukhdeve

Contents
Cleaning Data in SQL Portfolio Project
kaggle.com
zip
Updated Apr 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin Kennell (2023). Cleaning Data in SQL Portfolio Project [Dataset]. https://www.kaggle.com/austinkennell/cleaning-data-in-sql-portfolio-project
Explore at:
zip(6054868 bytes)Available download formats
Dataset updated
Apr 19, 2023
Authors
Austin Kennell
Description
The dataset contained information on housing data in the Nashville, TN area. I used SQL Server to clean the data to make it easier to use. For example, I converted some dates to remove unnecessary timestamps; I populated data for null values; I changed address columns from containing all of the address, city and state into separate columns; I changed a column that had different representations of the same data into consistent usage; I removed duplicate rows; and I deleted unused columns.
Nashville Housing Data Cleaning Project
kaggle.com
zip
Updated Aug 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Elhelbawy (2024). Nashville Housing Data Cleaning Project [Dataset]. https://www.kaggle.com/datasets/elhelbawylogin/nashville-housing-data-cleaning-project/discussion
Explore at:
zip(1282 bytes)Available download formats
Dataset updated
Aug 20, 2024
Authors
Ahmed Elhelbawy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Nashville
Description
Project Overview : This project demonstrates a thorough data cleaning process for the Nashville Housing dataset using SQL. The script performs various data cleaning and transformation operations to improve the quality and usability of the data for further analysis.

Technologies Used : SQL Server T-SQL

Dataset: The project uses the Nashville Housing dataset, which contains information about property sales in Nashville, Tennessee. The original dataset includes various fields such as property addresses, sale dates, sale prices, and other relevant real estate information. Data Cleaning Operations The script performs the following data cleaning operations:

Date Standardization: Converts the SaleDate column to a standard Date format for consistency and easier manipulation. Populating Missing Property Addresses: Fills in NULL values in the PropertyAddress field using data from other records with the same ParcelID. Breaking Down Address Components: Separates the PropertyAddress and OwnerAddress fields into individual columns for Address, City, and State, improving data granularity and queryability. Standardizing Values: Converts 'Y' and 'N' values to 'Yes' and 'No' in the SoldAsVacant field for clarity and consistency. Removing Duplicates: Identifies and removes duplicate records based on specific criteria to ensure data integrity. Dropping Unused Columns: Removes unnecessary columns to streamline the dataset.

Key SQL Techniques Demonstrated :

Data type conversion Self joins for data population String manipulation (SUBSTRING, CHARINDEX, PARSENAME) CASE statements Window functions (ROW_NUMBER) Common Table Expressions (CTEs) Data deletion Table alterations (adding and dropping columns)

Important Notes :

The script includes cautionary comments about data deletion and column dropping, emphasizing the importance of careful consideration in a production environment. This project showcases various SQL data cleaning techniques and can serve as a template for similar data cleaning tasks.

Potential Improvements :

Implement error handling and transaction management for more robust execution. Add data validation steps to ensure the cleaned data meets specific criteria. Consider creating indexes on frequently queried columns for performance optimization.
SQLcleaning
kaggle.com
zip
Updated Mar 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen M Blake (2023). SQLcleaning [Dataset]. https://www.kaggle.com/datasets/stephenmblake/sqlcleaning
Explore at:
zip(8206870 bytes)Available download formats
Dataset updated
Mar 15, 2023
Authors
Stephen M Blake
Description
Using SQL was able to cleaning up data so the it is easier to analyze. Used JOIN's, Substrings, parsename, update/alter tables, CTE, case statement, and row_number.. Learned many different ways to cleaning the data.
SQL Data Cleaning Portfolio V2
kaggle.com
zip
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Hurairah (2023). SQL Data Cleaning Portfolio V2 [Dataset]. https://www.kaggle.com/datasets/mohammadhurairah/sql-cleaning-portfolio-v2/discussion
Explore at:
zip(6054498 bytes)Available download formats
Dataset updated
Jun 16, 2023
Authors
Mohammad Hurairah
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data Cleaning from Public Nashville Housing Data:

Standardize the Date Format

Populate Property Address data

Breaking out Addresses into Individual Columns (Address, City, State)

Change Y and N to Yes and No in the "Sold as Vacant" field

Remove Duplicates

Delete Unused Columns
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
h
NSText2SQL
huggingface.co
opendatalab.com
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NumbersStation (2024). NSText2SQL [Dataset]. https://huggingface.co/datasets/NumbersStation/NSText2SQL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2024
Dataset authored and provided by
NumbersStation
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Summary

NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs. For more… See the full description on the dataset page: https://huggingface.co/datasets/NumbersStation/NSText2SQL.
MY SQL DATA CLEANING PROJECT
kaggle.com
zip
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George M122 (2024). MY SQL DATA CLEANING PROJECT [Dataset]. https://www.kaggle.com/georgem122/my-sql-data-cleaning-project
Explore at:
zip(1421 bytes)Available download formats
Dataset updated
Jun 20, 2024
Authors
George M122
Description
Dataset

This dataset was created by George M122

Contents
S
StreetSweeping022819
splitgraph.com
data.cityofchicago.org
+2more
Updated Apr 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2024). StreetSweeping022819 [Dataset]. https://www.splitgraph.com/cityofchicago/streetsweeping022819-jqxt-c6gd
Explore at:
application/openapi+json, json, application/vnd.splitgraph.imageAvailable download formats
Dataset updated
Apr 10, 2024
Dataset authored and provided by
City of Chicago
Description
Street sweeping zones by Ward and Ward Section Number. For the corresponding schedule, see https://data.cityofchicago.org/d/k737-xg34.

For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP.

The data can be viewed on the Chicago Data Portal with a web browser. However, to view or use the files outside of a web browser, you will need to use compression software and special GIS software, such as ESRI ArcGIS (shapefile) or Google Earth (KML or KMZ).

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
SQL Data Cleaning & EDA Project
kaggle.com
zip
Updated Oct 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bilal424 (2024). SQL Data Cleaning & EDA Project [Dataset]. https://www.kaggle.com/datasets/bilal424/sql-data-cleaning-and-eda-project/code
Explore at:
zip(5352 bytes)Available download formats
Dataset updated
Oct 15, 2024
Authors
Bilal424
Description
This dataset is a comprehensive collection of healthcare facility ratings across multiple countries. It includes detailed information on various attributes such as facility name, location, type, total beds, accreditation status, and annual visits of hospitals throughout the world. This cleaned dataset is ideal for conducting trend analysis, comparative studies between countries, or developing predictive models for facility ratings based on various factors. It offers a foundation for exploratory data analysis, machine learning modelling, and data visualization projects aimed at uncovering insights in the healthcare industry. The Project consists of the Original dataset, Data Cleaning Script and an EDA script in the data explorer tab for further analysis.
S
Street Sweeping Schedule - 2024
splitgraph.com
data.cityofchicago.org
+2more
Updated Mar 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2024). Street Sweeping Schedule - 2024 [Dataset]. https://www.splitgraph.com/cityofchicago/street-sweeping-schedule-2024-3q8d-2t69
Explore at:
application/vnd.splitgraph.image, application/openapi+json, jsonAvailable download formats
Dataset updated
Mar 29, 2024
Dataset authored and provided by
City of Chicago
Description
Street sweeping schedule by Ward and Ward section number. To find your Ward section, visit https://data.cityofchicago.org/d/ytfi-mzdz. For more information about the City's Street Sweeping program, go to https://www.chicago.gov/city/en/depts/streets/provdrs/streetssan/svcs/streetsweeping.html.

Corrections are possible during the course of the sweeping season.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
S
Street Sweeping Zones - 2023
splitgraph.com
data.cityofchicago.org
+1more
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2023). Street Sweeping Zones - 2023 [Dataset]. https://www.splitgraph.com/cityofchicago/street-sweeping-zones-2023-6c59-kupn
Explore at:
json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
Dataset updated
Mar 31, 2023
Dataset authored and provided by
City of Chicago
Description
Street sweeping zones by Ward and Ward Section Number. For the corresponding schedule, see https://data.cityofchicago.org/d/3dx4-5j8t.

For more information about the City's Street Sweeping program, go to https://www.chicago.gov/city/en/depts/streets/provdrs/streetssan/svcs/streetsweeping.html.

This dataset is in a format for spatial datasets that is inherently tabular but allows for a map as a derived view. Please click the indicated link below for such a map.

To export the data in either tabular or geographic format, please use the Export button on this dataset.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
Z
IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...
data.niaid.nih.gov
nde-dev.biothings.io
+2more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Indiana University
Authors
Cains, Mariana; Anand, Srini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

The companion paper can be found here: doi.org/10.5281/zenodo.814979

Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)
o
UK Power Networks Grid Substation Distribution Areas
ukpowernetworks.opendatasoft.com
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). UK Power Networks Grid Substation Distribution Areas [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-grid-postcode-area/
Explore at:
Dataset updated
Mar 31, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThis dataset is a geospatial view of the areas fed by grid substations. The aim is to create an indicative map showing the extent to which individual grid substations feed areas based on MPAN data.

Methodology

Data Extraction and Cleaning: MPAN data is queried from SQL Server and saved as a CSV. Invalid values and incorrectly formatted postcodes are removed using a Test Filter in FME.

Data Filtering and Assignment: MPAN data is categorized into EPN, LPN, and SPN based on the first two digits. Postcodes are assigned a Primary based on the highest number of MPANs fed from different Primary Sites.

Polygon Creation and Cleaning: Primary Feed Polygons are created and cleaned to remove holes and inclusions. Donut Polygons (holes) are identified, assigned to the nearest Primary, and merged.

Grid Supply Point Integration: Primaries are merged into larger polygons based on Grid Site relationships. ny Primaries not fed from a Grid Site are marked as NULL and labeled.

Functional Location Codes (FLOC) Matching: FLOC codes are extracted and matched to Primaries, Grid Sites and Grid Supply Points. Confirmed FLOCs are used to ensure accuracy, with any unmatched sites reviewed by the Open Data Team.

Quality Control Statement

Quality Control Measures include:

Verification steps to match features only with confirmed functional locations. Manual review and correct of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology Regular updates and reviews documented in the version history

Assurance Statement The Open Data Team and Network Data Team worked with the Geospatial Data Engineering Team to ensure data accuracy and consistency.

Other

Download dataset information: Metadata (JSON)

Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/To view this data please register and login.
S
Texas Commission on Environmental Quality - Historical Dry Cleaner...
splitgraph.com
data.texas.gov
+2more
Updated Oct 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Waste (2024). Texas Commission on Environmental Quality - Historical Dry Cleaner Registrations [Dataset]. https://www.splitgraph.com/texas-gov/texas-commission-on-environmental-quality-xcc6-2a52
Explore at:
application/openapi+json, application/vnd.splitgraph.image, jsonAvailable download formats
Dataset updated
Oct 15, 2024
Dataset authored and provided by
Office of Waste
Description
This dataset contains all historical Dry Cleaner Registrations in Texas. Note that most registrations listed are expired and are from previous years.

View operating dry cleaners with current and valid (unexpired) registration certificates here: https://data.texas.gov/dataset/Texas-Commission-on-Environmental-Quality-Current-/qfph-9bnd/

State law requires dry cleaning facilities and drop stations to register with TCEQ. Dry cleaning facilities and drop stations must renew their registration by August 1st of each year. The Dry Cleaners Registrations reflect self-reported registration information about whether a dry cleaning location is a facility or drop station, and whether they have opted out of the Dry Cleaning Environmental Remediation Fund. Distributors can find out whether to collect solvent fees from each registered facility as well as the registration status and delivery certificate expiration date of a location.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
S
Streetsweeping_2015A
splitgraph.com
data.cityofchicago.org
+2more
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2024). Streetsweeping_2015A [Dataset]. https://www.splitgraph.com/cityofchicago/streetsweeping2015a-j58c-hv2e
Explore at:
application/vnd.splitgraph.image, application/openapi+json, jsonAvailable download formats
Dataset updated
Apr 10, 2024
Dataset authored and provided by
City of Chicago
Description
Street sweeping zones by Ward and Ward Section Number. The zones are the same as those used in 2014. For the corresponding schedule, see https://data.cityofchicago.org/d/waad-z968. Because the City of Chicago ward map will change on May 18, 2015, this dataset will be supplemented with an additional dataset to cover the remainder of 2015 (through November).

For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP. The data can be viewed on the Chicago Data Portal with a web browser. However, to view or use the files outside of a web browser, you will need to use compression software and special GIS software, such as ESRI ArcGIS (shapefile) or Google Earth (KML or KMZ).

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
Audible Dataset Cleaning SQL
kaggle.com
zip
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filip Kobus (2024). Audible Dataset Cleaning SQL [Dataset]. https://www.kaggle.com/datasets/fkobus/audible-dataset-cleaning-sql/data
Explore at:
zip(6590021 bytes)Available download formats
Dataset updated
Oct 8, 2024
Authors
Filip Kobus
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
I took this data from some kaggle datased and cleaned it myself in MySQL.
SQL Data Cleaning Project1
kaggle.com
zip
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
christopher alverio (2024). SQL Data Cleaning Project1 [Dataset]. https://www.kaggle.com/datasets/christopheralverio/sql-data-cleaning-project1/code
Explore at:
zip(1312 bytes)Available download formats
Dataset updated
Nov 12, 2024
Authors
christopher alverio
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by christopher alverio

Released under MIT

Contents
i
Annual Household Survey 2004 - Lao PDR
catalog.ihsn.org
datacatalog.ihsn.org
Updated Mar 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Statistical Center (NSC) (2019). Annual Household Survey 2004 - Lao PDR [Dataset]. https://catalog.ihsn.org/catalog/study/LAO_2004_AHS_v01_M
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
National Statistical Center (NSC)
Time period covered
2004
Area covered
Laos
Description
Abstract

The Annual Household Survey is conducted to provide data requirements of the National Accounts of Lao PDR.

Geographic coverage

National

Province

Urban and Rural

Three Regions: North, Central, South

Analysis unit

Household

Individual

Universe

All private household in Lao PDR

all person 10 years and over

Kind of data

Sample survey data [ssd]

Sampling procedure

Household Sameple Size: 2,400 household

Village Sample Size: 240 Villages

AHS 2004, For each village, random sample of 5 household from 10 household of AHS2003 and 5 new household from the list provided by village chief to the enumerator.

For detail please refer to the manual for "AHS2004 Report - Final" (Lao version) on page 2.

Mode of data collection

Face-to-face [f2f]

Research instrument

AHS 2004 has 2 forms:

(1) Household Survey - identification - household composition - labor force: labor force participation last sevendays; overview of work in the last seven days - construction activities in the past 12 months - household businesses: establishing the existence of non-farm enterprises - agriculture: crop harvested during last 12 months; fishery; forestry; livestock, poultry - households' purchase and selling of durables during the last 12 months - income and transfers

(2) Diary Expenditure and Consumption Household Survey - identification - households' diary sheet for household transactions

Cleaning operations

Data editing: - Office editing and coding - Use software Microsoft Access for entry and checking data - Use SQL Server for Database - Use SPSS for analysis
SQL PROJECT
kaggle.com
zip
Updated Jul 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SHAW RICK (2024). SQL PROJECT [Dataset]. https://www.kaggle.com/datasets/shawrick/sql-project
Explore at:
zip(69397 bytes)Available download formats
Dataset updated
Jul 27, 2024
Authors
SHAW RICK
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains a collection of SQL scripts and techniques developed by business data analyst to assist with data optimization and cleaning tasks. The scripts cover a range of data management operations, including:

1) Data cleansing: Identifying and addressing issues such as missing values, duplicate records, formatting inconsistencies, and outliers. 2) Data normalization: Designing optimized database schemas and normalizing data structures to minimize redundancy and improve data integrity. 3) Data transformation and ETL: Developing efficient Extract, Transform, and Load (ETL) pipelines to integrate data from multiple sources and perform complex data transformations. 4) Reporting and dashboarding: Creating visually appealing and insightful reports, dashboards, and data visualizations to support informed decision-making.

The scripts and techniques in this dataset are tailored to the needs of business data analysts and can be used to enhance the quality, efficiency, and value of data-driven insights.

Facebook

Twitter

Click to copy link

Link copied

Cite

Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project

Data Cleaning Portfolio Project

Cleaning Data with SQL Queries

Explore at:

zip(6053781 bytes)Available download formats

Dataset updated

Apr 2, 2024

Authors

Deepali Sukhdeve

Description

Dataset

This dataset was created by Deepali Sukhdeve

Clear search

Close search

Google apps

Main menu

Data Cleaning Portfolio Project

Dataset

Contents

Cleaning Data in SQL Portfolio Project

Nashville Housing Data Cleaning Project

SQLcleaning

SQL Data Cleaning Portfolio V2

Data and tools for studying isograms

NSText2SQL

MY SQL DATA CLEANING PROJECT

Dataset

Contents

StreetSweeping022819

SQL Data Cleaning & EDA Project

Street Sweeping Schedule - 2024

Street Sweeping Zones - 2023

IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

UK Power Networks Grid Substation Distribution Areas

Texas Commission on Environmental Quality - Historical Dry Cleaner...

Streetsweeping_2015A

Audible Dataset Cleaning SQL

SQL Data Cleaning Project1

Dataset

Contents

Annual Household Survey 2004 - Lao PDR

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

SQL PROJECT

Data Cleaning Portfolio Project

Cleaning Data with SQL Queries

Dataset

Contents