70 datasets found

Data Cleaning Portfolio Project
kaggle.com
zip
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project
Explore at:
zip(6053781 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
Deepali Sukhdeve
Description
Dataset

This dataset was created by Deepali Sukhdeve

Contents
Cleaning Data in SQL Portfolio Project
kaggle.com
zip
Updated Apr 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin Kennell (2023). Cleaning Data in SQL Portfolio Project [Dataset]. https://www.kaggle.com/austinkennell/cleaning-data-in-sql-portfolio-project
Explore at:
zip(6054868 bytes)Available download formats
Dataset updated
Apr 19, 2023
Authors
Austin Kennell
Description
The dataset contained information on housing data in the Nashville, TN area. I used SQL Server to clean the data to make it easier to use. For example, I converted some dates to remove unnecessary timestamps; I populated data for null values; I changed address columns from containing all of the address, city and state into separate columns; I changed a column that had different representations of the same data into consistent usage; I removed duplicate rows; and I deleted unused columns.
SQL Data Cleaning Portfolio V2
kaggle.com
zip
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Hurairah (2023). SQL Data Cleaning Portfolio V2 [Dataset]. https://www.kaggle.com/datasets/mohammadhurairah/sql-cleaning-portfolio-v2/discussion
Explore at:
zip(6054498 bytes)Available download formats
Dataset updated
Jun 16, 2023
Authors
Mohammad Hurairah
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data Cleaning from Public Nashville Housing Data:

Standardize the Date Format

Populate Property Address data

Breaking out Addresses into Individual Columns (Address, City, State)

Change Y and N to Yes and No in the "Sold as Vacant" field

Remove Duplicates

Delete Unused Columns
Clean Meta Kaggle
kaggle.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yoni Kremer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

The Problems with the Original Dataset

The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.

The data is not normalized, so when you join tables you get a lot of errors.

Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.

There are missing values.

There are duplicate values.

There are values that are not valid. For example, Ids that are not positive integers.

The date and time columns are not in the right format.

Some columns only have the same value for all rows, so they are not useful.

The boolean columns have string values True or False.

Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.

Users upvote their own messages.

The Solution

To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.

The steps to create the database are:

Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.

Downloading the CSV files from Kaggle using the Kaggle API.

Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:

Drops the columns that are not needed.

Converts each column to the right data type.

Replaces foreign keys that do not exist with NULL.

Replaces some of the missing values with default values.

Removes rows where there are missing values in the primary key/not null columns.

Removes duplicate rows.

Loading the data into the database using the LOAD DATA INFILE command.

Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.

Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.

Update the Total columns in the database tables. I do that by running the update_totals.sql script.

Backup the database.
Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL
zenodo.org
bin, json, txt
Updated Aug 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
Explore at:
txt, json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5205322
Dataset updated
Aug 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}

Cleaned Retail Customer Dataset (SQL-based ETL)

kaggle.com

zip

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Rizwan Bin Akbar (2025). Cleaned Retail Customer Dataset (SQL-based ETL) [Dataset]. https://www.kaggle.com/datasets/rizwanbinakbar/cleaned-retail-customer-dataset-sql-based-etl

Explore at:

zip(1249509 bytes)Available download formats

Dataset updated

May 3, 2025

Authors

Rizwan Bin Akbar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset Description

This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)

This table contains information about customers, including their unique identifiers and demographic details.

Columns:

  cst_id: Customer ID (Primary Key)

  cst_gndr: Gender

  cst_marital_status: Marital status

  cst_create_date: Customer account creation date

Cleaning Steps:

  Removed duplicates and handled missing or null cst_id values.

  Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.

  Standardized gender values and identified inconsistencies in marital status.

Product Information (s_crm_prd_info / b_crm_prd_info)

This table contains information about products, including product identifiers, names, costs, and lifecycle dates.

Columns:

  prd_id: Product ID

  prd_key: Product key

  prd_nm: Product name

  prd_cost: Product cost

  prd_start_dt: Product start date

  prd_end_dt: Product end date

Cleaning Steps:

  Checked for duplicates and null values in the prd_key column.

  Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.

  Corrected product costs to remove invalid entries (e.g., negative values).

Sales Details (s_crm_sales_details / b_crm_sales_details)

This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.

Columns:

  sls_order_dt: Sales order date

  sls_due_dt: Sales due date

  sls_sales: Total sales amount

  sls_quantity: Number of products sold

  sls_price: Product unit price

Cleaning Steps:

  Validated sales order dates and corrected invalid entries.

  Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.

  Removed null and negative values from sls_sales, sls_quantity, and sls_price.

ERP Customer Data (b_erp_cust_az12, s_erp_cust_az12)

This table contains additional customer demographic data, including gender and birthdate.

Columns:

  cid: Customer ID

  gen: Gender

  bdate: Birthdate

Cleaning Steps:

  Checked for missing or null gender values and standardized inconsistent entries.

  Removed leading/trailing spaces from gen and bdate.

  Validated birthdates to ensure they were within a realistic range.

Location Information (b_erp_loc_a101)

This table contains country information related to the customers' locations.

Columns:

  cntry: Country

Cleaning Steps:

  Standardized country names (e.g., "US" and "USA" were mapped to "United States").

  Removed special characters (e.g., carriage returns) and trimmed whitespace.

Product Category (b_erp_px_cat_g1v2)

This table contains product category information.

Columns:

  Product category data (no significant cleaning required).

Key Features:

Customer demographics, including gender and marital status

Product details such as cost, start date, and end date

Sales data with order dates, quantities, and sales amounts

ERP-specific customer and location data

Data Cleaning Process:

This dataset underwent extensive cleaning and validation, including:

Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).

Date Validations: Ensuring correct date ranges and chronological consistency.

Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.

Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.

This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.

m
Rule-Based SQL Injection (RbSQLi) Dataset
data.mendeley.com
Updated Sep 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Abu Obaida Mullick (2025). Rule-Based SQL Injection (RbSQLi) Dataset [Dataset]. http://doi.org/10.17632/xz4d5zj5yw.4
Explore at:
Unique identifier
https://doi.org/10.17632/xz4d5zj5yw.4
Dataset updated
Sep 29, 2025
Authors
Mohammad Abu Obaida Mullick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The RbSQLi dataset has been developed to support advanced research and development in the detection of SQL injection (SQLi) vulnerabilities. It contains a total of 10,190,450 structured entries, out of which 2,699,570 are labeled as malicious and 7,490,880 as benign. The malicious entries are categorized into six distinct types of SQL injection attacks: Union-based (398,070 samples), Stackqueries-based (223,800 samples), Time-based (564,900 samples), Meta-based (481,280 samples), Boolean-based (207,900 samples), and Error-based (823,620 samples).

The malicious payloads for Union-based, Time-based, and Error-based injection types were sourced directly from the widely used and reputable open-source GitHub repository "Payloads All The Things – SQL Injection Payload List" (https://github.com/payloadbox/sql-injection-payload-list). Moreover, ChatGPT was employed to generate additional payloads for Boolean-based, Stack queries-based, and Meta-based injection categories. This hybrid approach ensures that the dataset reflects both known attack patterns and intelligently simulated variants, contributing to a broader representation of SQLi techniques. Again, some queries in the SQLi dataset are syntactically invalid yet contain malicious payloads, enabling models to detect SQL injection attempts even when attackers submit improperly formed or malformed queries. This highlights the importance of training models to recognize semantic intent rather than relying solely on syntactic correctness.

All payloads were carefully curated, anonymized, and structured during preprocessing. Sensitive data was replaced with secure placeholders, preserving semantic meaning while protecting data integrity and privacy. The dataset also underwent a thorough sanitization process to ensure consistency and usability. To support scalability and reproducibility, a rule-based classification algorithm was used to automate the labeling and organization of each payload by type. This methodology promotes standardization and ensures that the dataset is ready for use in machine learning pipelines, anomaly detection models, and intrusion detection systems. In addition to being comprehensive, the dataset provides a substantial volume of clean (benign) data, making it well-suited for supervised learning, comparative experiments, and robustness testing in cybersecurity research.

This dataset is intended to facilitate progress in the development of more accurate and generalizable SQL injection detection systems and to serve as a reliable benchmark for the broader security and machine learning communities.
Z
IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...
data.niaid.nih.gov
nde-dev.biothings.io
+2more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Indiana University
Authors
Cains, Mariana; Anand, Srini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

The companion paper can be found here: doi.org/10.5281/zenodo.814979

Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)
S
Clean Burning Wood Stove Grants
splitgraph.com
opendata.maryland.gov
+5more
Updated Mar 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
opendata-maryland-gov (2020). Clean Burning Wood Stove Grants [Dataset]. https://www.splitgraph.com/opendata-maryland-gov/clean-burning-wood-stove-grants-8aku-y93i
Explore at:
application/openapi+json, application/vnd.splitgraph.image, jsonAvailable download formats
Dataset updated
Mar 6, 2020
Authors
opendata-maryland-gov
Description
To help Maryland homeowners invest in clean energy, the Maryland Energy Administration provides grants for clean burning wood stoves that displace electric, non-natural gas fossil fuel heating systems or old wood stoves.

More information is available on the program's website at: http://energy.maryland.gov/Residential/woodstoves/

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
S
Annual Water Use at Washington Clean Buildings Act Tier 1 and 2 Buildings
splitgraph.com
open.piercecountywa.gov
+1more
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
internal-open-piercecountywa-gov (2024). Annual Water Use at Washington Clean Buildings Act Tier 1 and 2 Buildings [Dataset]. https://www.splitgraph.com/internal-open-piercecountywa-gov/annual-water-use-at-washington-clean-buildings-act-wma7-x5q2
Explore at:
application/vnd.splitgraph.image, json, application/openapi+jsonAvailable download formats
Dataset updated
Oct 7, 2024
Authors
internal-open-piercecountywa-gov
Area covered
Washington
Description
This data represents all water used at Pierce County owned Washington Clean Buildings Act (WCBA) Tier 1 and 2 buildings, which includes all County buildings that are larger than 20,000 Square Feet.

All water usage data is collected from utility bills.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
o
UK Power Networks Grid Substation Distribution Areas
ukpowernetworks.opendatasoft.com
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). UK Power Networks Grid Substation Distribution Areas [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-grid-postcode-area/
Explore at:
Dataset updated
Mar 31, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThis dataset is a geospatial view of the areas fed by grid substations. The aim is to create an indicative map showing the extent to which individual grid substations feed areas based on MPAN data.

Methodology

Data Extraction and Cleaning: MPAN data is queried from SQL Server and saved as a CSV. Invalid values and incorrectly formatted postcodes are removed using a Test Filter in FME.

Data Filtering and Assignment: MPAN data is categorized into EPN, LPN, and SPN based on the first two digits. Postcodes are assigned a Primary based on the highest number of MPANs fed from different Primary Sites.

Polygon Creation and Cleaning: Primary Feed Polygons are created and cleaned to remove holes and inclusions. Donut Polygons (holes) are identified, assigned to the nearest Primary, and merged.

Grid Supply Point Integration: Primaries are merged into larger polygons based on Grid Site relationships. ny Primaries not fed from a Grid Site are marked as NULL and labeled.

Functional Location Codes (FLOC) Matching: FLOC codes are extracted and matched to Primaries, Grid Sites and Grid Supply Points. Confirmed FLOCs are used to ensure accuracy, with any unmatched sites reviewed by the Open Data Team.

Quality Control Statement

Quality Control Measures include:

Verification steps to match features only with confirmed functional locations. Manual review and correct of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology Regular updates and reviews documented in the version history

Assurance Statement The Open Data Team and Network Data Team worked with the Geospatial Data Engineering Team to ensure data accuracy and consistency.

Other

Download dataset information: Metadata (JSON)

Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/To view this data please register and login.
S
AB617 "Path to Clean Air" - Port Terminals
splitgraph.com
data.bayareametro.gov
Updated Apr 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bayareametro-gov (2023). AB617 "Path to Clean Air" - Port Terminals [Dataset]. https://www.splitgraph.com/bayareametro-gov/ab617-path-to-clean-air-port-terminals-wega-rdzu
Explore at:
json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
Dataset updated
Apr 24, 2023
Authors
bayareametro-gov
Description
Port terminals within the AB617 "Path to Clean Air" emissions inventory domain. Locations of port terminals are based on bulk vessel call data from the Marine Exchange of San Francisco.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
S
AB617 "Path to Clean Air" - Bus Routes
splitgraph.com
data.bayareametro.gov
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AC Transit (2023). AB617 "Path to Clean Air" - Bus Routes [Dataset]. https://www.splitgraph.com/bayareametro-gov/ab617-path-to-clean-air-bus-routes-mv7a-yd2h
Explore at:
json, application/openapi+json, application/vnd.splitgraph.imageAvailable download formats
Dataset updated
Feb 23, 2023
Dataset authored and provided by
AC Transit
Description
Source URL:https://www.actransit.org/data-api-resource-center

Where dataset is from: AC Transit, Golden Gate Bus Service (digitized by BAAQMD), Westcat Bus Service (digitized by BAAQMD)

When obtained: 1/13/2022

For what purpose it was obtained: General AB 617 Planning

Additional Information: AC Transit work is licensed under a Creative Commons Attribution 3.0 Unported License

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
m
From Simulation to Classification: A Scalable Rule-Based SQL Injection...
data.mendeley.com
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Abu Obaida Mullick (2025). From Simulation to Classification: A Scalable Rule-Based SQL Injection Dataset Creation and Machine Learning Evaluation [Dataset]. http://doi.org/10.17632/xz4d5zj5yw.1
Explore at:
Unique identifier
https://doi.org/10.17632/xz4d5zj5yw.1
Dataset updated
May 23, 2025
Authors
Mohammad Abu Obaida Mullick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset has been developed to support advanced research and development in the detection of SQL injection (SQLi) vulnerabilities. It contains a total of 10,304,026 structured entries, out of which 2,813,146 are labeled as malicious and 7,490,880 as benign. The malicious entries are categorized into six distinct types of SQL injection attacks: Union-based (758,600 samples), Stackqueries-based (746,480 samples), Time-based (531,580 samples), Meta-based (481,280 samples), Boolean-based (226,080 samples), and Error-based (69,126 samples).

The malicious payloads for Union-based, Time-based, and Error-based injection types were sourced directly from the widely used and reputable open-source GitHub repository "Payloads All The Things – SQL Injection Payload List" (https://github.com/payloadbox/sql-injection-payload-list). Moreover, ChatGPT was employed to generate additional payloads for Boolean-based, Stack queries-based, and Meta-based injection categories. This hybrid approach ensures that the dataset reflects both known attack patterns and intelligently simulated variants, contributing to a broader representation of SQLi techniques.

All payloads were carefully curated, anonymized, and structured during preprocessing. Sensitive data was replaced with secure placeholders, preserving semantic meaning while protecting data integrity and privacy. The dataset also underwent a thorough sanitization process to ensure consistency and usability. To support scalability and reproducibility, a rule-based classification algorithm was used to automate the labeling and organization of each payload by type. This methodology promotes standardization and ensures that the dataset is ready for use in machine learning pipelines, anomaly detection models, and intrusion detection systems. In addition to being comprehensive, the dataset provides a substantial volume of clean (benign) data, making it well-suited for supervised learning, comparative experiments, and robustness testing in cybersecurity research.

This dataset is intended to facilitate progress in the development of more accurate and generalizable SQL injection detection systems and to serve as a reliable benchmark for the broader security and machine learning communities.
c
Medium articles dataset
crawlfeeds.com
kaggle.com
json, zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Medium articles dataset [Dataset]. https://crawlfeeds.com/datasets/medium-articles-dataset
Explore at:
json, zipAvailable download formats
Dataset updated
Aug 26, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.

Request here for the large dataset Medium datasets

Checkout sample dataset in CSV

Use Cases:

Training language models (LLMs)

Analyzing content trends and engagement

Sentiment and text classification

SEO research and author profiling

Academic or commercial research

Why Choose This Dataset?

High-volume, cleanly structured JSON

Ideal for developers, researchers, and data scientists

Easy integration with Python, R, SQL, and other data pipelines

Affordable and ready-to-use
B
To Estimate and Optimize the Source of Drinking Water for Metro Vancouver...
borealisdata.ca
dataone.org
Updated Feb 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahram Yarmand (2019). To Estimate and Optimize the Source of Drinking Water for Metro Vancouver until 2040 [Dataset]. http://doi.org/10.5683/SP2/6KU4I7
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/6KU4I7
Dataset updated
Feb 28, 2019
Dataset provided by
Borealis
Authors
Shahram Yarmand
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 2017 - Nov 2017
Area covered
Metro Vancouver
Description
The population of Metro Vancouver (20110729Regional Growth Strategy Projections Population, Housing and Employment 2006 – 2041 File) will have increased greatly by 2040, and finding a new source of reservoirs for drinking water (2015_ Water Consumption_ Statistics File) will be essential. This issue of drinking water needs to be optimized and estimated (Data Mining file) with the aim of developing the region. Three current sources of water reservoirs for Metro Vancouver are Capilano, Seymour, and Coquitlam, in which the treated water is being supplied to the customer. The linear optimization (LP) model (Optimization, Sensitivity Report File) illustrates the amount of drinking water for each reservoir and region. In fact, the B.C. government has a specific strategy for the growing population till 2040, which leads them toward their goal. In addition, another factor is the new water source for drinking water that needs to be estimated and monitored to anticipate the feasible water source (wells) until 2040. As such, the government will have to make a decision on how much groundwater is used. The goal of the project is two steps: (1) an optimization model for three water reservoirs, and (2) estimating the new source of water to 2040. The process of data analysis for the project includes: the data is analyzed with six software—Trifacta Wrangler, AMPL, Excel Solver, Arc GIS, and SQL—and is visualized in Tableau. 1. Trifacta Wrangler Software clean data (Data Mining file). 2. AMPL and Solver Excel Software optimize drinking water consumption for Metro Vancouver (data in the Optimization and Sensitivity Report file). 3. ArcMap collaborates the raw data and result of the optimization water reservoir and estimating population till 2040 with the ArcGIS software (GIS Map for Tableau file). 4. Visualizing, estimating, and optimizing the source of drinking water for Metro Vancouver until 2040 with SQL software in Tableau (export tableau data file).
SQL clean Fitbit data
kaggle.com
zip
Updated Jun 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Fonteneau (2021). SQL clean Fitbit data [Dataset]. https://www.kaggle.com/peterfonteneau/sql-clean-fitbit-data
Explore at:
zip(28842 bytes)Available download formats
Dataset updated
Jun 30, 2021
Authors
Peter Fonteneau
Description
Dataset

This dataset was created by Peter Fonteneau

Contents
AB617 "Path to Clean Air" - Permitted Stationary Sources
splitgraph.com
data.bayareametro.gov
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BAAQMD (2023). AB617 "Path to Clean Air" - Permitted Stationary Sources [Dataset]. https://www.splitgraph.com/bayareametro-gov/ab617-path-to-clean-air-permitted-stationary-gzqi-hbu9
Explore at:
application/openapi+json, application/vnd.splitgraph.image, jsonAvailable download formats
Dataset updated
Feb 23, 2023
Dataset provided by
Bay Area Air Quality Management Districthttp://www.baaqmd.gov/
Authors
BAAQMD
Description
A list of permitted facilities in the Richmond/San Pablo "Path to Clean Air" community that was derived from our a larger planning inventory. This list includes location information.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
S
AB 617 "Path to Clean Air" - Jurisdictional Boundaries
splitgraph.com
data.bayareametro.gov
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bayareametro-gov (2023). AB 617 "Path to Clean Air" - Jurisdictional Boundaries [Dataset]. https://www.splitgraph.com/bayareametro-gov/ab-617-path-to-clean-air-jurisdictional-boundaries-pwgz-y7cs/
Explore at:
application/openapi+json, json, application/vnd.splitgraph.imageAvailable download formats
Dataset updated
Feb 23, 2023
Authors
bayareametro-gov
Description
This dataset is based on digitized census data from census-designated places and cities. Jurisdictions included within the Richmond/San Pablo jurisdictional boundary are:

City of Richmond: the full city is included, with the exception of the following small, disconnected areas located on either side of Pinole Valley Park, to the east of the Richmond/San Pablo Boundary:

Greenridge Heights, May Valley, El Sobrante Hills, Greenbriar, Carriage Hills North, Castro Heights, and Carriage Hills south.

City of San Pablo: the full city;

City of Pinole: a small portion of the city (located in the northeast corner of the CERP Boundary);

Unincorporated Contra Costa County:

North Richmond: the full unincorporated place;

Tara Hills: the full unincorporated place;

Montalvin Manor: the full unincorporated place;

Bayview: the full unincorporated place;

East Richmond Heights: most of the unincorporated place is included, other than a small area within a census tract that included El Cerrito (Census Tract #06013384000);

Rollingwood: the full unincorporated place; and

El Sobrante: less than half of the unincorporated place.

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
Z
Worldwide Gender Differences in Public Code Contributions - Replication...
data.niaid.nih.gov
Updated Feb 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davide Rossi; Stefano Zacchiroli (2022). Worldwide Gender Differences in Public Code Contributions - Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6020474
Explore at:
Dataset updated
Feb 9, 2022
Dataset provided by
LTCI, Télécom Paris, Institut Polytechnique de Paris, France
University of Bologna, Italy
Authors
Davide Rossi; Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Worldwide Gender Differences in Public Code Contributions - Replication Package

This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011

This document comes with the software needed to mine and analyze the data presented in the paper.

Prerequisites

These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0

Initial data

swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.

names.tab - forenames and surnames per country with their frequency

zones.acc.tab - countries/territories, timezones, population and world zones

c_c.tab - ccTDL entities - world zones matches

Data preparation

Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst sh> ./export.sh

Run the authors cleanup script to create authors--clean.csv.zst sh> ./cleanup.sh authors.csv.zst

Filter out implausible names and create authors--plausible.csv.zst sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst

Gender detection

Run the gender guessing script to create author-fullnames-gender.csv.zst sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst

Database creation and data ingestion

Create the PostgreSQL DB sh> createdb gender-commit Notice that from now on when prepending the psql> prompt we assume the execution of psql on the gender-commit database.

Import data into PostgreSQL DB sh> ./import_data.sh

Zone detection

Extract commits data from the DB and create commits.tab, that is used as input for the gender detection script sh> psql -f extract_commits.sql gender-commit

Run the world zone detection script to create commit_zones.tab.zst sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.

Read zones assignment data from the file into the DB psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''

Extraction and graphs

Run the script to execute the queries to extract the data to plot from the DB. This creates commits_tz.tab, authors_tz.tab, commits_zones.tab, authors_zones.tab, and authors_zones_1620.tab. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh

Run the script to create the graphs from all the previously extracted tabfiles. This will generate commits_tzs.pdf, authors_tzs.pdf, commits_zones.pdf, authors_zones.pdf, and authors_zones_1620.pdf. sh> ./create_charts.sh

Additional graphs

This package also includes some already-made graphs

authors_zones_1.pdf: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per period

authors_zones_2.pdf: ditto with at least two commits per period

authors_zones_10.pdf: ditto with at least ten commits per period

Facebook

Twitter

Click to copy link

Link copied

Cite

Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project

Data Cleaning Portfolio Project

Cleaning Data with SQL Queries

Explore at:

zip(6053781 bytes)Available download formats

Dataset updated

Apr 2, 2024

Authors

Deepali Sukhdeve

Description

Dataset

This dataset was created by Deepali Sukhdeve

Clear search

Close search

Google apps

Main menu

Data Cleaning Portfolio Project

Dataset

Contents

Cleaning Data in SQL Portfolio Project

SQL Data Cleaning Portfolio V2

Clean Meta Kaggle

Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

August 2023 update

The Problems with the Original Dataset

The Solution

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

Cleaned Retail Customer Dataset (SQL-based ETL)

Rule-Based SQL Injection (RbSQLi) Dataset

IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

Clean Burning Wood Stove Grants

Annual Water Use at Washington Clean Buildings Act Tier 1 and 2 Buildings

UK Power Networks Grid Substation Distribution Areas

AB617 "Path to Clean Air" - Port Terminals

AB617 "Path to Clean Air" - Bus Routes

From Simulation to Classification: A Scalable Rule-Based SQL Injection...

Medium articles dataset

Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

Use Cases:

Why Choose This Dataset?

To Estimate and Optimize the Source of Drinking Water for Metro Vancouver...

SQL clean Fitbit data

Dataset

Contents

AB617 "Path to Clean Air" - Permitted Stationary Sources

AB 617 "Path to Clean Air" - Jurisdictional Boundaries

Worldwide Gender Differences in Public Code Contributions - Replication...

Data Cleaning Portfolio Project

Cleaning Data with SQL Queries

Dataset

Contents