Facebook
TwitterThis dataset was created by Deepali Sukhdeve
Facebook
TwitterThe dataset contained information on housing data in the Nashville, TN area. I used SQL Server to clean the data to make it easier to use. For example, I converted some dates to remove unnecessary timestamps; I populated data for null values; I changed address columns from containing all of the address, city and state into separate columns; I changed a column that had different representations of the same data into consistent usage; I removed duplicate rows; and I deleted unused columns.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Project Overview : This project demonstrates a thorough data cleaning process for the Nashville Housing dataset using SQL. The script performs various data cleaning and transformation operations to improve the quality and usability of the data for further analysis.
Technologies Used : SQL Server T-SQL
Dataset: The project uses the Nashville Housing dataset, which contains information about property sales in Nashville, Tennessee. The original dataset includes various fields such as property addresses, sale dates, sale prices, and other relevant real estate information. Data Cleaning Operations The script performs the following data cleaning operations:
Date Standardization: Converts the SaleDate column to a standard Date format for consistency and easier manipulation. Populating Missing Property Addresses: Fills in NULL values in the PropertyAddress field using data from other records with the same ParcelID. Breaking Down Address Components: Separates the PropertyAddress and OwnerAddress fields into individual columns for Address, City, and State, improving data granularity and queryability. Standardizing Values: Converts 'Y' and 'N' values to 'Yes' and 'No' in the SoldAsVacant field for clarity and consistency. Removing Duplicates: Identifies and removes duplicate records based on specific criteria to ensure data integrity. Dropping Unused Columns: Removes unnecessary columns to streamline the dataset.
Key SQL Techniques Demonstrated :
Data type conversion Self joins for data population String manipulation (SUBSTRING, CHARINDEX, PARSENAME) CASE statements Window functions (ROW_NUMBER) Common Table Expressions (CTEs) Data deletion Table alterations (adding and dropping columns)
Important Notes :
The script includes cautionary comments about data deletion and column dropping, emphasizing the importance of careful consideration in a production environment. This project showcases various SQL data cleaning techniques and can serve as a template for similar data cleaning tasks.
Potential Improvements :
Implement error handling and transaction management for more robust execution. Add data validation steps to ensure the cleaned data meets specific criteria. Consider creating indexes on frequently queried columns for performance optimization.
Facebook
TwitterUsing SQL was able to cleaning up data so the it is easier to analyze. Used JOIN's, Substrings, parsename, update/alter tables, CTE, case statement, and row_number.. Learned many different ways to cleaning the data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data Cleaning from Public Nashville Housing Data:
Standardize the Date Format
Populate Property Address data
Breaking out Addresses into Individual Columns (Address, City, State)
Change Y and N to Yes and No in the "Sold as Vacant" field
Remove Duplicates
Delete Unused Columns
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Summary
NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs. For more… See the full description on the dataset page: https://huggingface.co/datasets/NumbersStation/NSText2SQL.
Facebook
TwitterThis dataset was created by George M122
Facebook
TwitterStreet sweeping zones by Ward and Ward Section Number. For the corresponding schedule, see https://data.cityofchicago.org/d/k737-xg34.
For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP.
The data can be viewed on the Chicago Data Portal with a web browser. However, to view or use the files outside of a web browser, you will need to use compression software and special GIS software, such as ESRI ArcGIS (shapefile) or Google Earth (KML or KMZ).
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
TwitterThis dataset is a comprehensive collection of healthcare facility ratings across multiple countries. It includes detailed information on various attributes such as facility name, location, type, total beds, accreditation status, and annual visits of hospitals throughout the world. This cleaned dataset is ideal for conducting trend analysis, comparative studies between countries, or developing predictive models for facility ratings based on various factors. It offers a foundation for exploratory data analysis, machine learning modelling, and data visualization projects aimed at uncovering insights in the healthcare industry. The Project consists of the Original dataset, Data Cleaning Script and an EDA script in the data explorer tab for further analysis.
Facebook
TwitterStreet sweeping schedule by Ward and Ward section number. To find your Ward section, visit https://data.cityofchicago.org/d/ytfi-mzdz. For more information about the City's Street Sweeping program, go to https://www.chicago.gov/city/en/depts/streets/provdrs/streetssan/svcs/streetsweeping.html.
Corrections are possible during the course of the sweeping season.
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
TwitterStreet sweeping zones by Ward and Ward Section Number. For the corresponding schedule, see https://data.cityofchicago.org/d/3dx4-5j8t.
For more information about the City's Street Sweeping program, go to https://www.chicago.gov/city/en/depts/streets/provdrs/streetssan/svcs/streetsweeping.html.
This dataset is in a format for spatial datasets that is inherently tabular but allows for a map as a derived view. Please click the indicated link below for such a map.
To export the data in either tabular or geographic format, please use the Export button on this dataset.
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.
Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.
The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm
The companion paper can be found here: doi.org/10.5281/zenodo.814979
Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922
Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThis dataset is a geospatial view of the areas fed by grid substations. The aim is to create an indicative map showing the extent to which individual grid substations feed areas based on MPAN data.
Methodology
Data Extraction and Cleaning: MPAN data is queried from SQL Server and saved as a CSV. Invalid values and incorrectly formatted postcodes are removed using a Test Filter in FME.
Data Filtering and Assignment: MPAN data is categorized into EPN, LPN, and SPN based on the first two digits. Postcodes are assigned a Primary based on the highest number of MPANs fed from different Primary Sites.
Polygon Creation and Cleaning: Primary Feed Polygons are created and cleaned to remove holes and inclusions. Donut Polygons (holes) are identified, assigned to the nearest Primary, and merged.
Grid Supply Point Integration: Primaries are merged into larger polygons based on Grid Site relationships. ny Primaries not fed from a Grid Site are marked as NULL and labeled.
Functional Location Codes (FLOC) Matching: FLOC codes are extracted and matched to Primaries, Grid Sites and Grid Supply Points. Confirmed FLOCs are used to ensure accuracy, with any unmatched sites reviewed by the Open Data Team.
Quality Control Statement
Quality Control Measures include:
Verification steps to match features only with confirmed functional locations. Manual review and correct of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology Regular updates and reviews documented in the version history
Assurance Statement The Open Data Team and Network Data Team worked with the Geospatial Data Engineering Team to ensure data accuracy and consistency.
Other
Download dataset information: Metadata (JSON)
Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/To view this data please register and login.
Facebook
TwitterThis dataset contains all historical Dry Cleaner Registrations in Texas. Note that most registrations listed are expired and are from previous years.
View operating dry cleaners with current and valid (unexpired) registration certificates here: https://data.texas.gov/dataset/Texas-Commission-on-Environmental-Quality-Current-/qfph-9bnd/
State law requires dry cleaning facilities and drop stations to register with TCEQ. Dry cleaning facilities and drop stations must renew their registration by August 1st of each year. The Dry Cleaners Registrations reflect self-reported registration information about whether a dry cleaning location is a facility or drop station, and whether they have opted out of the Dry Cleaning Environmental Remediation Fund. Distributors can find out whether to collect solvent fees from each registered facility as well as the registration status and delivery certificate expiration date of a location.
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
TwitterStreet sweeping zones by Ward and Ward Section Number. The zones are the same as those used in 2014. For the corresponding schedule, see https://data.cityofchicago.org/d/waad-z968. Because the City of Chicago ward map will change on May 18, 2015, this dataset will be supplemented with an additional dataset to cover the remainder of 2015 (through November).
For more information about the City's Street Sweeping program, go to http://bit.ly/H2PHUP. The data can be viewed on the Chicago Data Portal with a web browser. However, to view or use the files outside of a web browser, you will need to use compression software and special GIS software, such as ESRI ArcGIS (shapefile) or Google Earth (KML or KMZ).
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I took this data from some kaggle datased and cleaned it myself in MySQL.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by christopher alverio
Released under MIT
Facebook
TwitterThe Annual Household Survey is conducted to provide data requirements of the National Accounts of Lao PDR.
Sample survey data [ssd]
For detail please refer to the manual for "AHS2004 Report - Final" (Lao version) on page 2.
Face-to-face [f2f]
AHS 2004 has 2 forms:
(1) Household Survey - identification - household composition - labor force: labor force participation last sevendays; overview of work in the last seven days - construction activities in the past 12 months - household businesses: establishing the existence of non-farm enterprises - agriculture: crop harvested during last 12 months; fishery; forestry; livestock, poultry - households' purchase and selling of durables during the last 12 months - income and transfers
(2) Diary Expenditure and Consumption Household Survey - identification - households' diary sheet for household transactions
Data editing: - Office editing and coding - Use software Microsoft Access for entry and checking data - Use SQL Server for Database - Use SPSS for analysis
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains a collection of SQL scripts and techniques developed by business data analyst to assist with data optimization and cleaning tasks. The scripts cover a range of data management operations, including:
1) Data cleansing: Identifying and addressing issues such as missing values, duplicate records, formatting inconsistencies, and outliers. 2) Data normalization: Designing optimized database schemas and normalizing data structures to minimize redundancy and improve data integrity. 3) Data transformation and ETL: Developing efficient Extract, Transform, and Load (ETL) pipelines to integrate data from multiple sources and perform complex data transformations. 4) Reporting and dashboarding: Creating visually appealing and insightful reports, dashboards, and data visualizations to support informed decision-making.
The scripts and techniques in this dataset are tailored to the needs of business data analysts and can be used to enhance the quality, efficiency, and value of data-driven insights.
Facebook
TwitterThis dataset was created by Deepali Sukhdeve