85 datasets found

Data Cleaning Portfolio Project
kaggle.com
zip
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project
Explore at:
zip(6053781 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
Deepali Sukhdeve
Description
Dataset

This dataset was created by Deepali Sukhdeve

Contents
Nashville Housing Data Cleaning Project
kaggle.com
zip
Updated Aug 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Elhelbawy (2024). Nashville Housing Data Cleaning Project [Dataset]. https://www.kaggle.com/datasets/elhelbawylogin/nashville-housing-data-cleaning-project/discussion
Explore at:
zip(1282 bytes)Available download formats
Dataset updated
Aug 20, 2024
Authors
Ahmed Elhelbawy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Nashville
Description
Project Overview : This project demonstrates a thorough data cleaning process for the Nashville Housing dataset using SQL. The script performs various data cleaning and transformation operations to improve the quality and usability of the data for further analysis.

Technologies Used : SQL Server T-SQL

Dataset: The project uses the Nashville Housing dataset, which contains information about property sales in Nashville, Tennessee. The original dataset includes various fields such as property addresses, sale dates, sale prices, and other relevant real estate information. Data Cleaning Operations The script performs the following data cleaning operations:

Date Standardization: Converts the SaleDate column to a standard Date format for consistency and easier manipulation. Populating Missing Property Addresses: Fills in NULL values in the PropertyAddress field using data from other records with the same ParcelID. Breaking Down Address Components: Separates the PropertyAddress and OwnerAddress fields into individual columns for Address, City, and State, improving data granularity and queryability. Standardizing Values: Converts 'Y' and 'N' values to 'Yes' and 'No' in the SoldAsVacant field for clarity and consistency. Removing Duplicates: Identifies and removes duplicate records based on specific criteria to ensure data integrity. Dropping Unused Columns: Removes unnecessary columns to streamline the dataset.

Key SQL Techniques Demonstrated :

Data type conversion Self joins for data population String manipulation (SUBSTRING, CHARINDEX, PARSENAME) CASE statements Window functions (ROW_NUMBER) Common Table Expressions (CTEs) Data deletion Table alterations (adding and dropping columns)

Important Notes :

The script includes cautionary comments about data deletion and column dropping, emphasizing the importance of careful consideration in a production environment. This project showcases various SQL data cleaning techniques and can serve as a template for similar data cleaning tasks.

Potential Improvements :

Implement error handling and transaction management for more robust execution. Add data validation steps to ensure the cleaned data meets specific criteria. Consider creating indexes on frequently queried columns for performance optimization.
SQL Data Cleaning Portfolio V2
kaggle.com
zip
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Hurairah (2023). SQL Data Cleaning Portfolio V2 [Dataset]. https://www.kaggle.com/datasets/mohammadhurairah/sql-cleaning-portfolio-v2/discussion
Explore at:
zip(6054498 bytes)Available download formats
Dataset updated
Jun 16, 2023
Authors
Mohammad Hurairah
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data Cleaning from Public Nashville Housing Data:

Standardize the Date Format

Populate Property Address data

Breaking out Addresses into Individual Columns (Address, City, State)

Change Y and N to Yes and No in the "Sold as Vacant" field

Remove Duplicates

Delete Unused Columns
SQL Data Cleaning & EDA Project
kaggle.com
zip
Updated Oct 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bilal424 (2024). SQL Data Cleaning & EDA Project [Dataset]. https://www.kaggle.com/datasets/bilal424/sql-data-cleaning-and-eda-project/code
Explore at:
zip(5352 bytes)Available download formats
Dataset updated
Oct 15, 2024
Authors
Bilal424
Description
This dataset is a comprehensive collection of healthcare facility ratings across multiple countries. It includes detailed information on various attributes such as facility name, location, type, total beds, accreditation status, and annual visits of hospitals throughout the world. This cleaned dataset is ideal for conducting trend analysis, comparative studies between countries, or developing predictive models for facility ratings based on various factors. It offers a foundation for exploratory data analysis, machine learning modelling, and data visualization projects aimed at uncovering insights in the healthcare industry. The Project consists of the Original dataset, Data Cleaning Script and an EDA script in the data explorer tab for further analysis.
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
MY SQL DATA CLEANING PROJECT
kaggle.com
zip
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George M122 (2024). MY SQL DATA CLEANING PROJECT [Dataset]. https://www.kaggle.com/georgem122/my-sql-data-cleaning-project
Explore at:
zip(1421 bytes)Available download formats
Dataset updated
Jun 20, 2024
Authors
George M122
Description
Dataset

This dataset was created by George M122

Contents
SQLcleaning
kaggle.com
zip
Updated Mar 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen M Blake (2023). SQLcleaning [Dataset]. https://www.kaggle.com/datasets/stephenmblake/sqlcleaning
Explore at:
zip(8206870 bytes)Available download formats
Dataset updated
Mar 15, 2023
Authors
Stephen M Blake
Description
Using SQL was able to cleaning up data so the it is easier to analyze. Used JOIN's, Substrings, parsename, update/alter tables, CTE, case statement, and row_number.. Learned many different ways to cleaning the data.
SQL Data Cleaning Project1
kaggle.com
zip
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
christopher alverio (2024). SQL Data Cleaning Project1 [Dataset]. https://www.kaggle.com/datasets/christopheralverio/sql-data-cleaning-project1/code
Explore at:
zip(1312 bytes)Available download formats
Dataset updated
Nov 12, 2024
Authors
christopher alverio
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by christopher alverio

Released under MIT

Contents
O
NSText2SQL
opendatalab.com
huggingface.co
zip
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). NSText2SQL [Dataset]. https://opendatalab.com/OpenDataLab/NSText2SQL
Explore at:
zipAvailable download formats
Dataset updated
Jul 1, 2024
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs.
Z
IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...
data.niaid.nih.gov
nde-dev.biothings.io
+2more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Indiana University
Authors
Cains, Mariana; Anand, Srini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

The companion paper can be found here: doi.org/10.5281/zenodo.814979

Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)
S
StreetSweeping022819
splitgraph.com
data.cityofchicago.org
+2more
Updated Apr 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2024). StreetSweeping022819 [Dataset]. https://www.splitgraph.com/cityofchicago/streetsweeping022819-jqxt-c6gd
Explore at:
application/openapi+json, json, application/vnd.splitgraph.imageAvailable download formats
Dataset updated
Apr 10, 2024
Dataset authored and provided by
City of Chicago
Description
Street sweeping zones by Ward and Ward Section Number. For the corresponding schedule, see https://data.cityofchicago.org/d/k737-xg34.

For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP.

The data can be viewed on the Chicago Data Portal with a web browser. However, to view or use the files outside of a web browser, you will need to use compression software and special GIS software, such as ESRI ArcGIS (shapefile) or Google Earth (KML or KMZ).

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
Cleaning Data in SQL Portfolio Project
kaggle.com
zip
Updated Apr 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin Kennell (2023). Cleaning Data in SQL Portfolio Project [Dataset]. https://www.kaggle.com/austinkennell/cleaning-data-in-sql-portfolio-project
Explore at:
zip(6054868 bytes)Available download formats
Dataset updated
Apr 19, 2023
Authors
Austin Kennell
Description
The dataset contained information on housing data in the Nashville, TN area. I used SQL Server to clean the data to make it easier to use. For example, I converted some dates to remove unnecessary timestamps; I populated data for null values; I changed address columns from containing all of the address, city and state into separate columns; I changed a column that had different representations of the same data into consistent usage; I removed duplicate rows; and I deleted unused columns.
Data cleaning and analysis SQL code
kaggle.com
zip
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jccs95 (2023). Data cleaning and analysis SQL code [Dataset]. https://www.kaggle.com/datasets/jccs95/data-cleaning-and-analysis-sql-code
Explore at:
zip(2728 bytes)Available download formats
Dataset updated
Jun 21, 2023
Authors
jccs95
Description
Dataset

This dataset was created by jccs95

Contents
S
Street Sweeping Schedule - 2024
splitgraph.com
data.cityofchicago.org
+2more
Updated Mar 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2024). Street Sweeping Schedule - 2024 [Dataset]. https://www.splitgraph.com/cityofchicago/street-sweeping-schedule-2024-3q8d-2t69
Explore at:
application/vnd.splitgraph.image, application/openapi+json, jsonAvailable download formats
Dataset updated
Mar 29, 2024
Dataset authored and provided by
City of Chicago
Description
Street sweeping schedule by Ward and Ward section number. To find your Ward section, visit https://data.cityofchicago.org/d/ytfi-mzdz. For more information about the City's Street Sweeping program, go to https://www.chicago.gov/city/en/depts/streets/provdrs/streetssan/svcs/streetsweeping.html.

Corrections are possible during the course of the sweeping season.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
S
Street Sweeping Zones - 2023
splitgraph.com
data.cityofchicago.org
+1more
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2023). Street Sweeping Zones - 2023 [Dataset]. https://www.splitgraph.com/cityofchicago/street-sweeping-zones-2023-6c59-kupn
Explore at:
json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
Dataset updated
Mar 31, 2023
Dataset authored and provided by
City of Chicago
Description
Street sweeping zones by Ward and Ward Section Number. For the corresponding schedule, see https://data.cityofchicago.org/d/3dx4-5j8t.

For more information about the City's Street Sweeping program, go to https://www.chicago.gov/city/en/depts/streets/provdrs/streetssan/svcs/streetsweeping.html.

This dataset is in a format for spatial datasets that is inherently tabular but allows for a map as a derived view. Please click the indicated link below for such a map.

To export the data in either tabular or geographic format, please use the Export button on this dataset.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
u
University of Cape Town Student Admissions Data 2006-2014 - South Africa
datafirst.uct.ac.za
Updated Jul 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCT Student Administration (2020). University of Cape Town Student Admissions Data 2006-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/556
Explore at:
Dataset updated
Jul 28, 2020
Dataset authored and provided by
UCT Student Administration
Time period covered
2006 - 2014
Area covered
South Africa
Description
Abstract

This dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.

The dataset was separated into the following data files:

Application level information: the "finest" unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,714,669 applications on record.

Individual level information: individuals may have multiple applications. Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. There are a total of 285,005 individuals on record.

Secondary Education Information: individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC).

Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.

Analysis unit

Applications, individuals

Kind of data

Administrative records [adm]

Mode of data collection

Other [oth]

Cleaning operations

The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.
S
Texas Commission on Environmental Quality - Historical Dry Cleaner...
splitgraph.com
data.texas.gov
+2more
Updated Oct 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Waste (2024). Texas Commission on Environmental Quality - Historical Dry Cleaner Registrations [Dataset]. https://www.splitgraph.com/texas-gov/texas-commission-on-environmental-quality-xcc6-2a52
Explore at:
application/openapi+json, application/vnd.splitgraph.image, jsonAvailable download formats
Dataset updated
Oct 15, 2024
Dataset authored and provided by
Office of Waste
Description
This dataset contains all historical Dry Cleaner Registrations in Texas. Note that most registrations listed are expired and are from previous years.

View operating dry cleaners with current and valid (unexpired) registration certificates here: https://data.texas.gov/dataset/Texas-Commission-on-Environmental-Quality-Current-/qfph-9bnd/

State law requires dry cleaning facilities and drop stations to register with TCEQ. Dry cleaning facilities and drop stations must renew their registration by August 1st of each year. The Dry Cleaners Registrations reflect self-reported registration information about whether a dry cleaning location is a facility or drop station, and whether they have opted out of the Dry Cleaning Environmental Remediation Fund. Distributors can find out whether to collect solvent fees from each registered facility as well as the registration status and delivery certificate expiration date of a location.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
o
UK Power Networks Grid Substation Distribution Areas
ukpowernetworks.opendatasoft.com
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). UK Power Networks Grid Substation Distribution Areas [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-grid-postcode-area/
Explore at:
Dataset updated
Mar 31, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThis dataset is a geospatial view of the areas fed by grid substations. The aim is to create an indicative map showing the extent to which individual grid substations feed areas based on MPAN data.

Methodology

Data Extraction and Cleaning: MPAN data is queried from SQL Server and saved as a CSV. Invalid values and incorrectly formatted postcodes are removed using a Test Filter in FME.

Data Filtering and Assignment: MPAN data is categorized into EPN, LPN, and SPN based on the first two digits. Postcodes are assigned a Primary based on the highest number of MPANs fed from different Primary Sites.

Polygon Creation and Cleaning: Primary Feed Polygons are created and cleaned to remove holes and inclusions. Donut Polygons (holes) are identified, assigned to the nearest Primary, and merged.

Grid Supply Point Integration: Primaries are merged into larger polygons based on Grid Site relationships. ny Primaries not fed from a Grid Site are marked as NULL and labeled.

Functional Location Codes (FLOC) Matching: FLOC codes are extracted and matched to Primaries, Grid Sites and Grid Supply Points. Confirmed FLOCs are used to ensure accuracy, with any unmatched sites reviewed by the Open Data Team.

Quality Control Statement

Quality Control Measures include:

Verification steps to match features only with confirmed functional locations. Manual review and correct of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology Regular updates and reviews documented in the version history

Assurance Statement The Open Data Team and Network Data Team worked with the Geospatial Data Engineering Team to ensure data accuracy and consistency.

Other

Download dataset information: Metadata (JSON)

Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/To view this data please register and login.
National Household Income and Expenditure Survey 2009-2010 - Namibia
microdata.nsanamibia.com
Updated Aug 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Namibia Statistics Agency (2024). National Household Income and Expenditure Survey 2009-2010 - Namibia [Dataset]. https://microdata.nsanamibia.com/index.php/catalog/6
Explore at:
Dataset updated
Aug 5, 2024
Dataset authored and provided by
Namibia Statistics Agencyhttps://nsa.org.na/
Time period covered
2009 - 2010
Area covered
Namibia
Description
Abstract

The Household Income and Expenditure Survey is a survey collecting data on income, consumption and expenditure patterns of households, in accordance with methodological principles of statistical enquiries, which are linked to demographic and socio-economic characteristics of households. A Household Income and expenditure Survey is the sole source of information on expenditure, consumption and income patterns of households, which is used to calculate poverty and income distribution indicators. It also serves as a statistical infrastructure for the compilation of the national basket of goods used to measure changes in price levels. Furthermore, it is used for updating of the national accounts.

The main objective of the NHIES 2009/2010 is to comprehensively describe the levels of living of Namibians using actual patterns of consumption and income, as well as a range of other socio-economic indicators based on collected data. This survey was designed to inform policy making at the international, national and regional levels within the context of the Fourth National Development Plan, in support of monitoring and evaluation of Vision 2030 and the Millennium Development Goals. The NHIES was designed to provide policy decision making with reliable estimates at regional levels as well as to meet rural - urban disaggregation requirements.

Geographic coverage

National Coverage

Analysis unit

Individuals and Households

Universe

Every week of the four weeks period of a survey round all persons in the household were asked if they spent at least 4 nights of the week in the household. Any person who spent at least 4 nights in the household was taken as having spent the whole week in the household. To qualify as a household member a person must have stayed in the household for at least two weeks out of four weeks.

Kind of data

Sample survey data [ssd]

Sampling procedure

The targeted population of NHIES 2009/2010 was the private households of Namibia. The population living in institutions, such as hospitals, hostels, police barracks and prisons were not covered in the survey. However, private households residing within institutional settings were covered. The sample design for the survey was a stratified two-stage probability sample, where the first stage units were geographical areas designated as the Primary Sampling Units (PSUs) and the second stage units were the households. The PSUs were based on the 2001 Census EAs and the list of PSUs serves as the national sample frame. The urban part of the sample frame was updated to include the changes that take place due to rural to urban migration and the new developments in housing. The sample frame is stratified first by region followed by urban and rural areas within region. In urban areas further stratification is carried out by level of living which is based on geographic location and housing characteristics. The first stage units were selected from the sampling frame of PSUs and the second stage units were selected from a current list of households within each selected PSU, which was compiled just before the interviews.

PSUs were selected using probability proportional to size sampling coupled with the systematic sampling procedure where the size measure was the number of households within the PSU in the 2001 Population and Housing Census. The households were selected from the current list of households using systematic sampling procedure.

The sample size was designed to achieve reliable estimates at the region level and for urban and rural areas within each region. However the actual sample sizes in urban or rural areas within some of the regions may not satisfy the expected precision levels for certain characteristics. The final sample consists of 10 660 households in 533 PSUs. The selected PSUs were randomly allocated to the 13 survey rounds.

Sampling deviation

All the expected sample of 533 PSUs was covered. However a number of originally selected PSUs had to be substituted by new ones due to the following reasons.

Urban areas: Movement of people for resettlement in informal settlement areas from one place to another caused a selected PSU to be empty of households.

Rural areas: In addition to Caprivi region (where one constituency is generally flooded every year) Ohangwena and Oshana regions were badly affected from an unusual flood situation. Although this situation was generally addressed by interchanging the PSUs betweensurvey rounds still some PSUs were under water close to the end of the survey period. There were five empty PSUs in the urban areas of Hardap (1), Karas (3) and Omaheke (1) regions. Since these PSUs were found in the low strata within the urban areas of the relevant regions the substituting PSUs were selected from the same strata. The PSUs under water were also five in rural areas of Caprivi (1), Ohangwena (2) and Oshana (2) regions. Wherever possible the substituting PSUs were selected from the same constituency where the original PSU was selected. If not, the selection was carried out from the rural stratum of the particular region. One sampled PSU in urban area of Khomas region (Windhoek city) had grown so large that it had to be split into 7 PSUs. This was incorporated into the geographical information system (GIS) and one PSU out of the seven was selected for the survey. In one PSU in Erongo region only fourteen households were listed and one in Omusati region listed only eleven households. All these households were interviewed and no additional selection was done to cover for the loss in sample.

Mode of data collection

Face-to-face [f2f]

Research instrument

The instruments for data collection were as in the previous survey the questionnaires and manuals. Form I questionnaire collected demographic and socio-economic information of household members, such as: sex, age, education, employment status among others. It also collected information on household possessions like animals, land, housing, household goods, utilities, household income and expenditure, etc.

Form II or the Daily Record Book is a diary for recording daily household transactions. A book was administered to each sample household each week for four consecutive weeks (survey round). Households were asked to record transactions, item by item, for all expenditures and receipts, including incomes and gifts received or given out. Own produce items were also recorded. Prices of items from different outlets were also collected in both rural and urban areas. The price collection was needed to supplement information from areas where price collection for consumer price indices (CPI) does not currently take place.

Cleaning operations

The questionnaires received from the regions were registered and counterchecked at the survey head office. The data processing team consisted of Systems administrator, IT technician, Programmers, Statisticians and Data typists.

Data capturing

The data capturing process was undertakenin the following ways: Form 1 was scanned, interpreted and verified using the “Scan”, “Interpret” & “Verify” modules of the Eyes & Hands software respectively. Some basic checks were carried out to ensure that each PSU was valid and every household was unique. Invalid characters were removed. The scanned and verified data was converted into text files using the “Transfer” module of the Eyes & Hands. Finally, the data was transferred to a SQL database for further processing, using the “TranScan” application. The Daily Record Books (DRB or form 2) were manually entered after the scanned data had been transferred to the SQL database. The reason was to ensure that all DRBs were linked to the correct Form 1, i.e. each household’s Form 1 was linked to the corresponding Daily Record Book. In total, 10 645 questionnaires (Form 1), comprising around 500 questions each, were scanned and close to one million transactions from the Form 2 (DRBs) were manually captured.

Response rate

Household response rate: Total number of responding households and non-responding households and the reason for non-response are shown below. Non-contacts and incomplete forms, which were rejected due to a lot of missing data in the questionnaire, at 3.4 and 4.0 percent, respectively, formed the largest part of non-response. At the regional level Erongo, Khomas, and Kunene reported the lowest response rate and Caprivi and Kavango the highest. See page 17 of the report for a detailed breakdown of response rates by region.

Data appraisal

To be able to compare with the previous survey in 2003/2004 and to follow up the development of the country, methodology and definitions were kept the same. Comparisons between the surveys can be found in the different chapters in this report. Experiences from the previous survey gave valuable input to this one and the data collection was improved to avoid earlier experienced errors. Also, some additional questions in the questionnaire helped to confirm the accuracy of reported data. During the data cleaning process it turned out, that some households had difficulty to separate their household consumption from their business consumption when recording their daily transactions in DRB. This was in particular applicable for the guest farms, the number of which has shown a big increase during the past five years. All households with extreme high consumption were examined manually and business transactions were recorded and separated from private consumption.
Housing - SQL Project
kaggle.com
Updated Jun 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ann Truong (2023). Housing - SQL Project [Dataset]. https://www.kaggle.com/datasets/bvanntruong/housing-sql-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ann Truong
Description
This dataset contains information about housing sales in Nashville, TN such as property, owner, sales, and tax information. The SQL queries I created for Data Cleaning can be found here.

Facebook

Twitter

Click to copy link

Link copied

Cite

Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project

Data Cleaning Portfolio Project

Cleaning Data with SQL Queries

Explore at:

zip(6053781 bytes)Available download formats

Dataset updated

Apr 2, 2024

Authors

Deepali Sukhdeve

Description

Dataset

This dataset was created by Deepali Sukhdeve

Clear search

Close search

Google apps

Main menu

Data Cleaning Portfolio Project

Dataset

Contents

Nashville Housing Data Cleaning Project

SQL Data Cleaning Portfolio V2

SQL Data Cleaning & EDA Project

Data and tools for studying isograms

MY SQL DATA CLEANING PROJECT

Dataset

Contents

SQLcleaning

SQL Data Cleaning Project1

Dataset

Contents

NSText2SQL

IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

StreetSweeping022819

Cleaning Data in SQL Portfolio Project

Data cleaning and analysis SQL code

Dataset

Contents

Street Sweeping Schedule - 2024

Street Sweeping Zones - 2023

University of Cape Town Student Admissions Data 2006-2014 - South Africa

Abstract

Analysis unit

Kind of data

Mode of data collection

Cleaning operations

Texas Commission on Environmental Quality - Historical Dry Cleaner...

UK Power Networks Grid Substation Distribution Areas

National Household Income and Expenditure Survey 2009-2010 - Namibia

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Response rate

Data appraisal

Housing - SQL Project

Data Cleaning Portfolio Project

Cleaning Data with SQL Queries

Dataset

Contents