85 datasets found
  1. Data Cleaning Portfolio Project

    • kaggle.com
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project
    Explore at:
    zip(6053781 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    Deepali Sukhdeve
    Description

    Dataset

    This dataset was created by Deepali Sukhdeve

    Contents

  2. Nashville Housing Data Cleaning Project

    • kaggle.com
    zip
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Elhelbawy (2024). Nashville Housing Data Cleaning Project [Dataset]. https://www.kaggle.com/datasets/elhelbawylogin/nashville-housing-data-cleaning-project/discussion
    Explore at:
    zip(1282 bytes)Available download formats
    Dataset updated
    Aug 20, 2024
    Authors
    Ahmed Elhelbawy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Nashville
    Description

    Project Overview : This project demonstrates a thorough data cleaning process for the Nashville Housing dataset using SQL. The script performs various data cleaning and transformation operations to improve the quality and usability of the data for further analysis.

    Technologies Used : SQL Server T-SQL

    Dataset: The project uses the Nashville Housing dataset, which contains information about property sales in Nashville, Tennessee. The original dataset includes various fields such as property addresses, sale dates, sale prices, and other relevant real estate information. Data Cleaning Operations The script performs the following data cleaning operations:

    Date Standardization: Converts the SaleDate column to a standard Date format for consistency and easier manipulation. Populating Missing Property Addresses: Fills in NULL values in the PropertyAddress field using data from other records with the same ParcelID. Breaking Down Address Components: Separates the PropertyAddress and OwnerAddress fields into individual columns for Address, City, and State, improving data granularity and queryability. Standardizing Values: Converts 'Y' and 'N' values to 'Yes' and 'No' in the SoldAsVacant field for clarity and consistency. Removing Duplicates: Identifies and removes duplicate records based on specific criteria to ensure data integrity. Dropping Unused Columns: Removes unnecessary columns to streamline the dataset.

    Key SQL Techniques Demonstrated :

    Data type conversion Self joins for data population String manipulation (SUBSTRING, CHARINDEX, PARSENAME) CASE statements Window functions (ROW_NUMBER) Common Table Expressions (CTEs) Data deletion Table alterations (adding and dropping columns)

    Important Notes :

    The script includes cautionary comments about data deletion and column dropping, emphasizing the importance of careful consideration in a production environment. This project showcases various SQL data cleaning techniques and can serve as a template for similar data cleaning tasks.

    Potential Improvements :

    Implement error handling and transaction management for more robust execution. Add data validation steps to ensure the cleaned data meets specific criteria. Consider creating indexes on frequently queried columns for performance optimization.

  3. SQL Data Cleaning Portfolio V2

    • kaggle.com
    zip
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Hurairah (2023). SQL Data Cleaning Portfolio V2 [Dataset]. https://www.kaggle.com/datasets/mohammadhurairah/sql-cleaning-portfolio-v2/discussion
    Explore at:
    zip(6054498 bytes)Available download formats
    Dataset updated
    Jun 16, 2023
    Authors
    Mohammad Hurairah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data Cleaning from Public Nashville Housing Data:

    1. Standardize the Date Format

    2. Populate Property Address data

    3. Breaking out Addresses into Individual Columns (Address, City, State)

    4. Change Y and N to Yes and No in the "Sold as Vacant" field

    5. Remove Duplicates

    6. Delete Unused Columns

  4. SQL Data Cleaning & EDA Project

    • kaggle.com
    zip
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bilal424 (2024). SQL Data Cleaning & EDA Project [Dataset]. https://www.kaggle.com/datasets/bilal424/sql-data-cleaning-and-eda-project/code
    Explore at:
    zip(5352 bytes)Available download formats
    Dataset updated
    Oct 15, 2024
    Authors
    Bilal424
    Description

    This dataset is a comprehensive collection of healthcare facility ratings across multiple countries. It includes detailed information on various attributes such as facility name, location, type, total beds, accreditation status, and annual visits of hospitals throughout the world. This cleaned dataset is ideal for conducting trend analysis, comparative studies between countries, or developing predictive models for facility ratings based on various factors. It offers a foundation for exploratory data analysis, machine learning modelling, and data visualization projects aimed at uncovering insights in the healthcare industry. The Project consists of the Original dataset, Data Cleaning Script and an EDA script in the data explorer tab for further analysis.

  5. Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  6. MY SQL DATA CLEANING PROJECT

    • kaggle.com
    zip
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George M122 (2024). MY SQL DATA CLEANING PROJECT [Dataset]. https://www.kaggle.com/georgem122/my-sql-data-cleaning-project
    Explore at:
    zip(1421 bytes)Available download formats
    Dataset updated
    Jun 20, 2024
    Authors
    George M122
    Description

    Dataset

    This dataset was created by George M122

    Contents

  7. SQLcleaning

    • kaggle.com
    zip
    Updated Mar 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen M Blake (2023). SQLcleaning [Dataset]. https://www.kaggle.com/datasets/stephenmblake/sqlcleaning
    Explore at:
    zip(8206870 bytes)Available download formats
    Dataset updated
    Mar 15, 2023
    Authors
    Stephen M Blake
    Description

    Using SQL was able to cleaning up data so the it is easier to analyze. Used JOIN's, Substrings, parsename, update/alter tables, CTE, case statement, and row_number.. Learned many different ways to cleaning the data.

  8. SQL Data Cleaning Project1

    • kaggle.com
    zip
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    christopher alverio (2024). SQL Data Cleaning Project1 [Dataset]. https://www.kaggle.com/datasets/christopheralverio/sql-data-cleaning-project1/code
    Explore at:
    zip(1312 bytes)Available download formats
    Dataset updated
    Nov 12, 2024
    Authors
    christopher alverio
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by christopher alverio

    Released under MIT

    Contents

  9. O

    NSText2SQL

    • opendatalab.com
    • huggingface.co
    zip
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). NSText2SQL [Dataset]. https://opendatalab.com/OpenDataLab/NSText2SQL
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 1, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs.

  10. Z

    IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +2more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Indiana University
    Authors
    Cains, Mariana; Anand, Srini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

    Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

    The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

    The companion paper can be found here: doi.org/10.5281/zenodo.814979

    Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

    Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)

  11. S

    StreetSweeping022819

    • splitgraph.com
    • data.cityofchicago.org
    • +2more
    Updated Apr 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2024). StreetSweeping022819 [Dataset]. https://www.splitgraph.com/cityofchicago/streetsweeping022819-jqxt-c6gd
    Explore at:
    application/openapi+json, json, application/vnd.splitgraph.imageAvailable download formats
    Dataset updated
    Apr 10, 2024
    Dataset authored and provided by
    City of Chicago
    Description

    Street sweeping zones by Ward and Ward Section Number. For the corresponding schedule, see https://data.cityofchicago.org/d/k737-xg34.

    For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP.

    The data can be viewed on the Chicago Data Portal with a web browser. However, to view or use the files outside of a web browser, you will need to use compression software and special GIS software, such as ESRI ArcGIS (shapefile) or Google Earth (KML or KMZ).

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  12. Cleaning Data in SQL Portfolio Project

    • kaggle.com
    zip
    Updated Apr 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin Kennell (2023). Cleaning Data in SQL Portfolio Project [Dataset]. https://www.kaggle.com/austinkennell/cleaning-data-in-sql-portfolio-project
    Explore at:
    zip(6054868 bytes)Available download formats
    Dataset updated
    Apr 19, 2023
    Authors
    Austin Kennell
    Description

    The dataset contained information on housing data in the Nashville, TN area. I used SQL Server to clean the data to make it easier to use. For example, I converted some dates to remove unnecessary timestamps; I populated data for null values; I changed address columns from containing all of the address, city and state into separate columns; I changed a column that had different representations of the same data into consistent usage; I removed duplicate rows; and I deleted unused columns.

  13. Data cleaning and analysis SQL code

    • kaggle.com
    zip
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jccs95 (2023). Data cleaning and analysis SQL code [Dataset]. https://www.kaggle.com/datasets/jccs95/data-cleaning-and-analysis-sql-code
    Explore at:
    zip(2728 bytes)Available download formats
    Dataset updated
    Jun 21, 2023
    Authors
    jccs95
    Description

    Dataset

    This dataset was created by jccs95

    Contents

  14. S

    Street Sweeping Schedule - 2024

    • splitgraph.com
    • data.cityofchicago.org
    • +2more
    Updated Mar 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2024). Street Sweeping Schedule - 2024 [Dataset]. https://www.splitgraph.com/cityofchicago/street-sweeping-schedule-2024-3q8d-2t69
    Explore at:
    application/vnd.splitgraph.image, application/openapi+json, jsonAvailable download formats
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    City of Chicago
    Description

    Street sweeping schedule by Ward and Ward section number. To find your Ward section, visit https://data.cityofchicago.org/d/ytfi-mzdz. For more information about the City's Street Sweeping program, go to https://www.chicago.gov/city/en/depts/streets/provdrs/streetssan/svcs/streetsweeping.html.

    Corrections are possible during the course of the sweeping season.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  15. S

    Street Sweeping Zones - 2023

    • splitgraph.com
    • data.cityofchicago.org
    • +1more
    Updated Mar 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2023). Street Sweeping Zones - 2023 [Dataset]. https://www.splitgraph.com/cityofchicago/street-sweeping-zones-2023-6c59-kupn
    Explore at:
    json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
    Dataset updated
    Mar 31, 2023
    Dataset authored and provided by
    City of Chicago
    Description

    Street sweeping zones by Ward and Ward Section Number. For the corresponding schedule, see https://data.cityofchicago.org/d/3dx4-5j8t.

    For more information about the City's Street Sweeping program, go to https://www.chicago.gov/city/en/depts/streets/provdrs/streetssan/svcs/streetsweeping.html.

    ​​​​​This dataset is in a forma​​t for spatial datasets that is inherently tabular but allows for a map as a derived view. Please click the indicated link below for such a map.

    To export the data in either tabular or geographic format, please use the Export button on this dataset.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  16. u

    University of Cape Town Student Admissions Data 2006-2014 - South Africa

    • datafirst.uct.ac.za
    Updated Jul 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCT Student Administration (2020). University of Cape Town Student Admissions Data 2006-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/556
    Explore at:
    Dataset updated
    Jul 28, 2020
    Dataset authored and provided by
    UCT Student Administration
    Time period covered
    2006 - 2014
    Area covered
    South Africa
    Description

    Abstract

    This dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.

    The dataset was separated into the following data files:

    1. Application level information: the "finest" unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,714,669 applications on record.
    2. Individual level information: individuals may have multiple applications. Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. There are a total of 285,005 individuals on record.
    3. Secondary Education Information: individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC).
    4. Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.

    Analysis unit

    Applications, individuals

    Kind of data

    Administrative records [adm]

    Mode of data collection

    Other [oth]

    Cleaning operations

    The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.

  17. S

    Texas Commission on Environmental Quality - Historical Dry Cleaner...

    • splitgraph.com
    • data.texas.gov
    • +2more
    Updated Oct 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Waste (2024). Texas Commission on Environmental Quality - Historical Dry Cleaner Registrations [Dataset]. https://www.splitgraph.com/texas-gov/texas-commission-on-environmental-quality-xcc6-2a52
    Explore at:
    application/openapi+json, application/vnd.splitgraph.image, jsonAvailable download formats
    Dataset updated
    Oct 15, 2024
    Dataset authored and provided by
    Office of Waste
    Description

    This dataset contains all historical Dry Cleaner Registrations in Texas. Note that most registrations listed are expired and are from previous years.

    View operating dry cleaners with current and valid (unexpired) registration certificates here: https://data.texas.gov/dataset/Texas-Commission-on-Environmental-Quality-Current-/qfph-9bnd/

    State law requires dry cleaning facilities and drop stations to register with TCEQ. Dry cleaning facilities and drop stations must renew their registration by August 1st of each year. The Dry Cleaners Registrations reflect self-reported registration information about whether a dry cleaning location is a facility or drop station, and whether they have opted out of the Dry Cleaning Environmental Remediation Fund. Distributors can find out whether to collect solvent fees from each registered facility as well as the registration status and delivery certificate expiration date of a location.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  18. o

    UK Power Networks Grid Substation Distribution Areas

    • ukpowernetworks.opendatasoft.com
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). UK Power Networks Grid Substation Distribution Areas [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-grid-postcode-area/
    Explore at:
    Dataset updated
    Mar 31, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThis dataset is a geospatial view of the areas fed by grid substations. The aim is to create an indicative map showing the extent to which individual grid substations feed areas based on MPAN data.

    Methodology

    Data Extraction and Cleaning: MPAN data is queried from SQL Server and saved as a CSV. Invalid values and incorrectly formatted postcodes are removed using a Test Filter in FME.

    Data Filtering and Assignment: MPAN data is categorized into EPN, LPN, and SPN based on the first two digits. Postcodes are assigned a Primary based on the highest number of MPANs fed from different Primary Sites.

    Polygon Creation and Cleaning: Primary Feed Polygons are created and cleaned to remove holes and inclusions. Donut Polygons (holes) are identified, assigned to the nearest Primary, and merged.

    Grid Supply Point Integration: Primaries are merged into larger polygons based on Grid Site relationships. ny Primaries not fed from a Grid Site are marked as NULL and labeled.

      Functional Location Codes (FLOC) Matching: FLOC codes are extracted and matched to Primaries, Grid Sites and Grid Supply Points. Confirmed FLOCs are used to ensure accuracy, with any unmatched sites reviewed by the Open Data Team.
    

    Quality Control Statement

    Quality Control Measures include:

    Verification steps to match features only with confirmed functional locations. Manual review and correct of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology Regular updates and reviews documented in the version history

    Assurance Statement The Open Data Team and Network Data Team worked with the Geospatial Data Engineering Team to ensure data accuracy and consistency.

    Other

    Download dataset information: Metadata (JSON)

    Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/To view this data please register and login.

  19. National Household Income and Expenditure Survey 2009-2010 - Namibia

    • microdata.nsanamibia.com
    Updated Aug 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Namibia Statistics Agency (2024). National Household Income and Expenditure Survey 2009-2010 - Namibia [Dataset]. https://microdata.nsanamibia.com/index.php/catalog/6
    Explore at:
    Dataset updated
    Aug 5, 2024
    Dataset authored and provided by
    Namibia Statistics Agencyhttps://nsa.org.na/
    Time period covered
    2009 - 2010
    Area covered
    Namibia
    Description

    Abstract

    The Household Income and Expenditure Survey is a survey collecting data on income, consumption and expenditure patterns of households, in accordance with methodological principles of statistical enquiries, which are linked to demographic and socio-economic characteristics of households. A Household Income and expenditure Survey is the sole source of information on expenditure, consumption and income patterns of households, which is used to calculate poverty and income distribution indicators. It also serves as a statistical infrastructure for the compilation of the national basket of goods used to measure changes in price levels. Furthermore, it is used for updating of the national accounts.

    The main objective of the NHIES 2009/2010 is to comprehensively describe the levels of living of Namibians using actual patterns of consumption and income, as well as a range of other socio-economic indicators based on collected data. This survey was designed to inform policy making at the international, national and regional levels within the context of the Fourth National Development Plan, in support of monitoring and evaluation of Vision 2030 and the Millennium Development Goals. The NHIES was designed to provide policy decision making with reliable estimates at regional levels as well as to meet rural - urban disaggregation requirements.

    Geographic coverage

    National Coverage

    Analysis unit

    Individuals and Households

    Universe

    Every week of the four weeks period of a survey round all persons in the household were asked if they spent at least 4 nights of the week in the household. Any person who spent at least 4 nights in the household was taken as having spent the whole week in the household. To qualify as a household member a person must have stayed in the household for at least two weeks out of four weeks.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The targeted population of NHIES 2009/2010 was the private households of Namibia. The population living in institutions, such as hospitals, hostels, police barracks and prisons were not covered in the survey. However, private households residing within institutional settings were covered. The sample design for the survey was a stratified two-stage probability sample, where the first stage units were geographical areas designated as the Primary Sampling Units (PSUs) and the second stage units were the households. The PSUs were based on the 2001 Census EAs and the list of PSUs serves as the national sample frame. The urban part of the sample frame was updated to include the changes that take place due to rural to urban migration and the new developments in housing. The sample frame is stratified first by region followed by urban and rural areas within region. In urban areas further stratification is carried out by level of living which is based on geographic location and housing characteristics. The first stage units were selected from the sampling frame of PSUs and the second stage units were selected from a current list of households within each selected PSU, which was compiled just before the interviews.

    PSUs were selected using probability proportional to size sampling coupled with the systematic sampling procedure where the size measure was the number of households within the PSU in the 2001 Population and Housing Census. The households were selected from the current list of households using systematic sampling procedure.

    The sample size was designed to achieve reliable estimates at the region level and for urban and rural areas within each region. However the actual sample sizes in urban or rural areas within some of the regions may not satisfy the expected precision levels for certain characteristics. The final sample consists of 10 660 households in 533 PSUs. The selected PSUs were randomly allocated to the 13 survey rounds.

    Sampling deviation

    All the expected sample of 533 PSUs was covered. However a number of originally selected PSUs had to be substituted by new ones due to the following reasons.

    Urban areas: Movement of people for resettlement in informal settlement areas from one place to another caused a selected PSU to be empty of households.

    Rural areas: In addition to Caprivi region (where one constituency is generally flooded every year) Ohangwena and Oshana regions were badly affected from an unusual flood situation. Although this situation was generally addressed by interchanging the PSUs betweensurvey rounds still some PSUs were under water close to the end of the survey period. There were five empty PSUs in the urban areas of Hardap (1), Karas (3) and Omaheke (1) regions. Since these PSUs were found in the low strata within the urban areas of the relevant regions the substituting PSUs were selected from the same strata. The PSUs under water were also five in rural areas of Caprivi (1), Ohangwena (2) and Oshana (2) regions. Wherever possible the substituting PSUs were selected from the same constituency where the original PSU was selected. If not, the selection was carried out from the rural stratum of the particular region. One sampled PSU in urban area of Khomas region (Windhoek city) had grown so large that it had to be split into 7 PSUs. This was incorporated into the geographical information system (GIS) and one PSU out of the seven was selected for the survey. In one PSU in Erongo region only fourteen households were listed and one in Omusati region listed only eleven households. All these households were interviewed and no additional selection was done to cover for the loss in sample.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The instruments for data collection were as in the previous survey the questionnaires and manuals. Form I questionnaire collected demographic and socio-economic information of household members, such as: sex, age, education, employment status among others. It also collected information on household possessions like animals, land, housing, household goods, utilities, household income and expenditure, etc.

    Form II or the Daily Record Book is a diary for recording daily household transactions. A book was administered to each sample household each week for four consecutive weeks (survey round). Households were asked to record transactions, item by item, for all expenditures and receipts, including incomes and gifts received or given out. Own produce items were also recorded. Prices of items from different outlets were also collected in both rural and urban areas. The price collection was needed to supplement information from areas where price collection for consumer price indices (CPI) does not currently take place.

    Cleaning operations

    The questionnaires received from the regions were registered and counterchecked at the survey head office. The data processing team consisted of Systems administrator, IT technician, Programmers, Statisticians and Data typists.

    Data capturing

    The data capturing process was undertakenin the following ways: Form 1 was scanned, interpreted and verified using the “Scan”, “Interpret” & “Verify” modules of the Eyes & Hands software respectively. Some basic checks were carried out to ensure that each PSU was valid and every household was unique. Invalid characters were removed. The scanned and verified data was converted into text files using the “Transfer” module of the Eyes & Hands. Finally, the data was transferred to a SQL database for further processing, using the “TranScan” application. The Daily Record Books (DRB or form 2) were manually entered after the scanned data had been transferred to the SQL database. The reason was to ensure that all DRBs were linked to the correct Form 1, i.e. each household’s Form 1 was linked to the corresponding Daily Record Book. In total, 10 645 questionnaires (Form 1), comprising around 500 questions each, were scanned and close to one million transactions from the Form 2 (DRBs) were manually captured.

    Response rate

    Household response rate: Total number of responding households and non-responding households and the reason for non-response are shown below. Non-contacts and incomplete forms, which were rejected due to a lot of missing data in the questionnaire, at 3.4 and 4.0 percent, respectively, formed the largest part of non-response. At the regional level Erongo, Khomas, and Kunene reported the lowest response rate and Caprivi and Kavango the highest. See page 17 of the report for a detailed breakdown of response rates by region.

    Data appraisal

    To be able to compare with the previous survey in 2003/2004 and to follow up the development of the country, methodology and definitions were kept the same. Comparisons between the surveys can be found in the different chapters in this report. Experiences from the previous survey gave valuable input to this one and the data collection was improved to avoid earlier experienced errors. Also, some additional questions in the questionnaire helped to confirm the accuracy of reported data. During the data cleaning process it turned out, that some households had difficulty to separate their household consumption from their business consumption when recording their daily transactions in DRB. This was in particular applicable for the guest farms, the number of which has shown a big increase during the past five years. All households with extreme high consumption were examined manually and business transactions were recorded and separated from private consumption.

  20. Housing - SQL Project

    • kaggle.com
    Updated Jun 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ann Truong (2023). Housing - SQL Project [Dataset]. https://www.kaggle.com/datasets/bvanntruong/housing-sql-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ann Truong
    Description

    This dataset contains information about housing sales in Nashville, TN such as property, owner, sales, and tax information. The SQL queries I created for Data Cleaning can be found here.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project
Organization logo

Data Cleaning Portfolio Project

Cleaning Data with SQL Queries

Explore at:
zip(6053781 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
Deepali Sukhdeve
Description

Dataset

This dataset was created by Deepali Sukhdeve

Contents

Search
Clear search
Close search
Google apps
Main menu