70 datasets found
  1. Data Cleaning Portfolio Project

    • kaggle.com
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project
    Explore at:
    zip(6053781 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    Deepali Sukhdeve
    Description

    Dataset

    This dataset was created by Deepali Sukhdeve

    Contents

  2. Cleaning Data in SQL Portfolio Project

    • kaggle.com
    zip
    Updated Apr 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin Kennell (2023). Cleaning Data in SQL Portfolio Project [Dataset]. https://www.kaggle.com/austinkennell/cleaning-data-in-sql-portfolio-project
    Explore at:
    zip(6054868 bytes)Available download formats
    Dataset updated
    Apr 19, 2023
    Authors
    Austin Kennell
    Description

    The dataset contained information on housing data in the Nashville, TN area. I used SQL Server to clean the data to make it easier to use. For example, I converted some dates to remove unnecessary timestamps; I populated data for null values; I changed address columns from containing all of the address, city and state into separate columns; I changed a column that had different representations of the same data into consistent usage; I removed duplicate rows; and I deleted unused columns.

  3. SQL Data Cleaning Portfolio V2

    • kaggle.com
    zip
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Hurairah (2023). SQL Data Cleaning Portfolio V2 [Dataset]. https://www.kaggle.com/datasets/mohammadhurairah/sql-cleaning-portfolio-v2/discussion
    Explore at:
    zip(6054498 bytes)Available download formats
    Dataset updated
    Jun 16, 2023
    Authors
    Mohammad Hurairah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data Cleaning from Public Nashville Housing Data:

    1. Standardize the Date Format

    2. Populate Property Address data

    3. Breaking out Addresses into Individual Columns (Address, City, State)

    4. Change Y and N to Yes and No in the "Sold as Vacant" field

    5. Remove Duplicates

    6. Delete Unused Columns

  4. Clean Meta Kaggle

    • kaggle.com
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yoni Kremer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cleaned Meta-Kaggle Dataset

    The Original Dataset - Meta-Kaggle

    Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

    https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

    This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

    Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

    August 2023 update

    In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

    We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

    The Problems with the Original Dataset

    • The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.
    • The data is not normalized, so when you join tables you get a lot of errors.
    • Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.
    • There are missing values.
    • There are duplicate values.
    • There are values that are not valid. For example, Ids that are not positive integers.
    • The date and time columns are not in the right format.
    • Some columns only have the same value for all rows, so they are not useful.
    • The boolean columns have string values True or False.
    • Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.
    • Users upvote their own messages.

    The Solution

    • To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.
    • The steps to create the database are:
    • Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.
    • Downloading the CSV files from Kaggle using the Kaggle API.
    • Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:
      • Drops the columns that are not needed.
      • Converts each column to the right data type.
      • Replaces foreign keys that do not exist with NULL.
      • Replaces some of the missing values with default values.
      • Removes rows where there are missing values in the primary key/not null columns.
      • Removes duplicate rows.
    • Loading the data into the database using the LOAD DATA INFILE command.
    • Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.
    • Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.
    • Update the Total columns in the database tables. I do that by running the update_totals.sql script.
    • Backup the database.
  5. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

  6. Cleaned Retail Customer Dataset (SQL-based ETL)

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rizwan Bin Akbar (2025). Cleaned Retail Customer Dataset (SQL-based ETL) [Dataset]. https://www.kaggle.com/datasets/rizwanbinakbar/cleaned-retail-customer-dataset-sql-based-etl
    Explore at:
    zip(1249509 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    Rizwan Bin Akbar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Description

    This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)

    This table contains information about customers, including their unique identifiers and demographic details.

    Columns:
    
      cst_id: Customer ID (Primary Key)
    
      cst_gndr: Gender
    
      cst_marital_status: Marital status
    
      cst_create_date: Customer account creation date
    
    Cleaning Steps:
    
      Removed duplicates and handled missing or null cst_id values.
    
      Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.
    
      Standardized gender values and identified inconsistencies in marital status.
    
    1. Product Information (s_crm_prd_info / b_crm_prd_info)

    This table contains information about products, including product identifiers, names, costs, and lifecycle dates.

    Columns:
    
      prd_id: Product ID
    
      prd_key: Product key
    
      prd_nm: Product name
    
      prd_cost: Product cost
    
      prd_start_dt: Product start date
    
      prd_end_dt: Product end date
    
    Cleaning Steps:
    
      Checked for duplicates and null values in the prd_key column.
    
      Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.
    
      Corrected product costs to remove invalid entries (e.g., negative values).
    
    1. Sales Details (s_crm_sales_details / b_crm_sales_details)

    This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.

    Columns:
    
      sls_order_dt: Sales order date
    
      sls_due_dt: Sales due date
    
      sls_sales: Total sales amount
    
      sls_quantity: Number of products sold
    
      sls_price: Product unit price
    
    Cleaning Steps:
    
      Validated sales order dates and corrected invalid entries.
    
      Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.
    
      Removed null and negative values from sls_sales, sls_quantity, and sls_price.
    
    1. ERP Customer Data (b_erp_cust_az12, s_erp_cust_az12)

    This table contains additional customer demographic data, including gender and birthdate.

    Columns:
    
      cid: Customer ID
    
      gen: Gender
    
      bdate: Birthdate
    
    Cleaning Steps:
    
      Checked for missing or null gender values and standardized inconsistent entries.
    
      Removed leading/trailing spaces from gen and bdate.
    
      Validated birthdates to ensure they were within a realistic range.
    
    1. Location Information (b_erp_loc_a101)

    This table contains country information related to the customers' locations.

    Columns:
    
      cntry: Country
    
    Cleaning Steps:
    
      Standardized country names (e.g., "US" and "USA" were mapped to "United States").
    
      Removed special characters (e.g., carriage returns) and trimmed whitespace.
    
    1. Product Category (b_erp_px_cat_g1v2)

    This table contains product category information.

    Columns:
    
      Product category data (no significant cleaning required).
    

    Key Features:

    Customer demographics, including gender and marital status
    
    Product details such as cost, start date, and end date
    
    Sales data with order dates, quantities, and sales amounts
    
    ERP-specific customer and location data
    

    Data Cleaning Process:

    This dataset underwent extensive cleaning and validation, including:

    Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).
    
    Date Validations: Ensuring correct date ranges and chronological consistency.
    
    Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.
    
    Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.
    

    This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.

  7. m

    Rule-Based SQL Injection (RbSQLi) Dataset

    • data.mendeley.com
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Abu Obaida Mullick (2025). Rule-Based SQL Injection (RbSQLi) Dataset [Dataset]. http://doi.org/10.17632/xz4d5zj5yw.4
    Explore at:
    Dataset updated
    Sep 29, 2025
    Authors
    Mohammad Abu Obaida Mullick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The RbSQLi dataset has been developed to support advanced research and development in the detection of SQL injection (SQLi) vulnerabilities. It contains a total of 10,190,450 structured entries, out of which 2,699,570 are labeled as malicious and 7,490,880 as benign. The malicious entries are categorized into six distinct types of SQL injection attacks: Union-based (398,070 samples), Stackqueries-based (223,800 samples), Time-based (564,900 samples), Meta-based (481,280 samples), Boolean-based (207,900 samples), and Error-based (823,620 samples).

    The malicious payloads for Union-based, Time-based, and Error-based injection types were sourced directly from the widely used and reputable open-source GitHub repository "Payloads All The Things – SQL Injection Payload List" (https://github.com/payloadbox/sql-injection-payload-list). Moreover, ChatGPT was employed to generate additional payloads for Boolean-based, Stack queries-based, and Meta-based injection categories. This hybrid approach ensures that the dataset reflects both known attack patterns and intelligently simulated variants, contributing to a broader representation of SQLi techniques. Again, some queries in the SQLi dataset are syntactically invalid yet contain malicious payloads, enabling models to detect SQL injection attempts even when attackers submit improperly formed or malformed queries. This highlights the importance of training models to recognize semantic intent rather than relying solely on syntactic correctness.

    All payloads were carefully curated, anonymized, and structured during preprocessing. Sensitive data was replaced with secure placeholders, preserving semantic meaning while protecting data integrity and privacy. The dataset also underwent a thorough sanitization process to ensure consistency and usability. To support scalability and reproducibility, a rule-based classification algorithm was used to automate the labeling and organization of each payload by type. This methodology promotes standardization and ensures that the dataset is ready for use in machine learning pipelines, anomaly detection models, and intrusion detection systems. In addition to being comprehensive, the dataset provides a substantial volume of clean (benign) data, making it well-suited for supervised learning, comparative experiments, and robustness testing in cybersecurity research.

    This dataset is intended to facilitate progress in the development of more accurate and generalizable SQL injection detection systems and to serve as a reliable benchmark for the broader security and machine learning communities.

  8. Z

    IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +2more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Indiana University
    Authors
    Cains, Mariana; Anand, Srini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

    Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

    The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

    The companion paper can be found here: doi.org/10.5281/zenodo.814979

    Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

    Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)

  9. S

    Clean Burning Wood Stove Grants

    • splitgraph.com
    • opendata.maryland.gov
    • +5more
    Updated Mar 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opendata-maryland-gov (2020). Clean Burning Wood Stove Grants [Dataset]. https://www.splitgraph.com/opendata-maryland-gov/clean-burning-wood-stove-grants-8aku-y93i
    Explore at:
    application/openapi+json, application/vnd.splitgraph.image, jsonAvailable download formats
    Dataset updated
    Mar 6, 2020
    Authors
    opendata-maryland-gov
    Description

    To help Maryland homeowners invest in clean energy, the Maryland Energy Administration provides grants for clean burning wood stoves that displace electric, non-natural gas fossil fuel heating systems or old wood stoves.

    More information is available on the program's website at: http://energy.maryland.gov/Residential/woodstoves/

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  10. S

    Annual Water Use at Washington Clean Buildings Act Tier 1 and 2 Buildings

    • splitgraph.com
    • open.piercecountywa.gov
    • +1more
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    internal-open-piercecountywa-gov (2024). Annual Water Use at Washington Clean Buildings Act Tier 1 and 2 Buildings [Dataset]. https://www.splitgraph.com/internal-open-piercecountywa-gov/annual-water-use-at-washington-clean-buildings-act-wma7-x5q2
    Explore at:
    application/vnd.splitgraph.image, json, application/openapi+jsonAvailable download formats
    Dataset updated
    Oct 7, 2024
    Authors
    internal-open-piercecountywa-gov
    Area covered
    Washington
    Description

    This data represents all water used at Pierce County owned Washington Clean Buildings Act (WCBA) Tier 1 and 2 buildings, which includes all County buildings that are larger than 20,000 Square Feet.

    All water usage data is collected from utility bills.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  11. o

    UK Power Networks Grid Substation Distribution Areas

    • ukpowernetworks.opendatasoft.com
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). UK Power Networks Grid Substation Distribution Areas [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-grid-postcode-area/
    Explore at:
    Dataset updated
    Mar 31, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThis dataset is a geospatial view of the areas fed by grid substations. The aim is to create an indicative map showing the extent to which individual grid substations feed areas based on MPAN data.

    Methodology

    Data Extraction and Cleaning: MPAN data is queried from SQL Server and saved as a CSV. Invalid values and incorrectly formatted postcodes are removed using a Test Filter in FME.

    Data Filtering and Assignment: MPAN data is categorized into EPN, LPN, and SPN based on the first two digits. Postcodes are assigned a Primary based on the highest number of MPANs fed from different Primary Sites.

    Polygon Creation and Cleaning: Primary Feed Polygons are created and cleaned to remove holes and inclusions. Donut Polygons (holes) are identified, assigned to the nearest Primary, and merged.

    Grid Supply Point Integration: Primaries are merged into larger polygons based on Grid Site relationships. ny Primaries not fed from a Grid Site are marked as NULL and labeled.

      Functional Location Codes (FLOC) Matching: FLOC codes are extracted and matched to Primaries, Grid Sites and Grid Supply Points. Confirmed FLOCs are used to ensure accuracy, with any unmatched sites reviewed by the Open Data Team.
    

    Quality Control Statement

    Quality Control Measures include:

    Verification steps to match features only with confirmed functional locations. Manual review and correct of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology Regular updates and reviews documented in the version history

    Assurance Statement The Open Data Team and Network Data Team worked with the Geospatial Data Engineering Team to ensure data accuracy and consistency.

    Other

    Download dataset information: Metadata (JSON)

    Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/To view this data please register and login.

  12. S

    AB617 "Path to Clean Air" - Port Terminals

    • splitgraph.com
    • data.bayareametro.gov
    Updated Apr 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bayareametro-gov (2023). AB617 "Path to Clean Air" - Port Terminals [Dataset]. https://www.splitgraph.com/bayareametro-gov/ab617-path-to-clean-air-port-terminals-wega-rdzu
    Explore at:
    json, application/vnd.splitgraph.image, application/openapi+jsonAvailable download formats
    Dataset updated
    Apr 24, 2023
    Authors
    bayareametro-gov
    Description

    Port terminals within the AB617 "Path to Clean Air" emissions inventory domain. Locations of port terminals are based on bulk vessel call data from the Marine Exchange of San Francisco.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  13. S

    AB617 "Path to Clean Air" - Bus Routes

    • splitgraph.com
    • data.bayareametro.gov
    Updated Feb 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AC Transit (2023). AB617 "Path to Clean Air" - Bus Routes [Dataset]. https://www.splitgraph.com/bayareametro-gov/ab617-path-to-clean-air-bus-routes-mv7a-yd2h
    Explore at:
    json, application/openapi+json, application/vnd.splitgraph.imageAvailable download formats
    Dataset updated
    Feb 23, 2023
    Dataset authored and provided by
    AC Transit
    Description

    Source URL:https://www.actransit.org/data-api-resource-center

    Where dataset is from: AC Transit, Golden Gate Bus Service (digitized by BAAQMD), Westcat Bus Service (digitized by BAAQMD)

    When obtained: 1/13/2022

    For what purpose it was obtained: General AB 617 Planning

    Additional Information: AC Transit work is licensed under a Creative Commons Attribution 3.0 Unported License

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  14. m

    From Simulation to Classification: A Scalable Rule-Based SQL Injection...

    • data.mendeley.com
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Abu Obaida Mullick (2025). From Simulation to Classification: A Scalable Rule-Based SQL Injection Dataset Creation and Machine Learning Evaluation [Dataset]. http://doi.org/10.17632/xz4d5zj5yw.1
    Explore at:
    Dataset updated
    May 23, 2025
    Authors
    Mohammad Abu Obaida Mullick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset has been developed to support advanced research and development in the detection of SQL injection (SQLi) vulnerabilities. It contains a total of 10,304,026 structured entries, out of which 2,813,146 are labeled as malicious and 7,490,880 as benign. The malicious entries are categorized into six distinct types of SQL injection attacks: Union-based (758,600 samples), Stackqueries-based (746,480 samples), Time-based (531,580 samples), Meta-based (481,280 samples), Boolean-based (226,080 samples), and Error-based (69,126 samples).

    The malicious payloads for Union-based, Time-based, and Error-based injection types were sourced directly from the widely used and reputable open-source GitHub repository "Payloads All The Things – SQL Injection Payload List" (https://github.com/payloadbox/sql-injection-payload-list). Moreover, ChatGPT was employed to generate additional payloads for Boolean-based, Stack queries-based, and Meta-based injection categories. This hybrid approach ensures that the dataset reflects both known attack patterns and intelligently simulated variants, contributing to a broader representation of SQLi techniques.

    All payloads were carefully curated, anonymized, and structured during preprocessing. Sensitive data was replaced with secure placeholders, preserving semantic meaning while protecting data integrity and privacy. The dataset also underwent a thorough sanitization process to ensure consistency and usability. To support scalability and reproducibility, a rule-based classification algorithm was used to automate the labeling and organization of each payload by type. This methodology promotes standardization and ensures that the dataset is ready for use in machine learning pipelines, anomaly detection models, and intrusion detection systems. In addition to being comprehensive, the dataset provides a substantial volume of clean (benign) data, making it well-suited for supervised learning, comparative experiments, and robustness testing in cybersecurity research.

    This dataset is intended to facilitate progress in the development of more accurate and generalizable SQL injection detection systems and to serve as a reliable benchmark for the broader security and machine learning communities.

  15. c

    Medium articles dataset

    • crawlfeeds.com
    • kaggle.com
    json, zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Medium articles dataset [Dataset]. https://crawlfeeds.com/datasets/medium-articles-dataset
    Explore at:
    json, zipAvailable download formats
    Dataset updated
    Aug 26, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Buy Medium Articles Dataset – 500K+ Published Articles in JSON Format

    Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.

    Request here for the large dataset Medium datasets

    Checkout sample dataset in CSV

    Use Cases:

    • Training language models (LLMs)

    • Analyzing content trends and engagement

    • Sentiment and text classification

    • SEO research and author profiling

    • Academic or commercial research

    Why Choose This Dataset?

    • High-volume, cleanly structured JSON

    • Ideal for developers, researchers, and data scientists

    • Easy integration with Python, R, SQL, and other data pipelines

    • Affordable and ready-to-use

  16. B

    To Estimate and Optimize the Source of Drinking Water for Metro Vancouver...

    • borealisdata.ca
    • dataone.org
    Updated Feb 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahram Yarmand (2019). To Estimate and Optimize the Source of Drinking Water for Metro Vancouver until 2040 [Dataset]. http://doi.org/10.5683/SP2/6KU4I7
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2019
    Dataset provided by
    Borealis
    Authors
    Shahram Yarmand
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 2017 - Nov 2017
    Area covered
    Metro Vancouver
    Description

    The population of Metro Vancouver (20110729Regional Growth Strategy Projections Population, Housing and Employment 2006 – 2041 File) will have increased greatly by 2040, and finding a new source of reservoirs for drinking water (2015_ Water Consumption_ Statistics File) will be essential. This issue of drinking water needs to be optimized and estimated (Data Mining file) with the aim of developing the region. Three current sources of water reservoirs for Metro Vancouver are Capilano, Seymour, and Coquitlam, in which the treated water is being supplied to the customer. The linear optimization (LP) model (Optimization, Sensitivity Report File) illustrates the amount of drinking water for each reservoir and region. In fact, the B.C. government has a specific strategy for the growing population till 2040, which leads them toward their goal. In addition, another factor is the new water source for drinking water that needs to be estimated and monitored to anticipate the feasible water source (wells) until 2040. As such, the government will have to make a decision on how much groundwater is used. The goal of the project is two steps: (1) an optimization model for three water reservoirs, and (2) estimating the new source of water to 2040. The process of data analysis for the project includes: the data is analyzed with six software—Trifacta Wrangler, AMPL, Excel Solver, Arc GIS, and SQL—and is visualized in Tableau. 1. Trifacta Wrangler Software clean data (Data Mining file). 2. AMPL and Solver Excel Software optimize drinking water consumption for Metro Vancouver (data in the Optimization and Sensitivity Report file). 3. ArcMap collaborates the raw data and result of the optimization water reservoir and estimating population till 2040 with the ArcGIS software (GIS Map for Tableau file). 4. Visualizing, estimating, and optimizing the source of drinking water for Metro Vancouver until 2040 with SQL software in Tableau (export tableau data file).

  17. SQL clean Fitbit data

    • kaggle.com
    zip
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Fonteneau (2021). SQL clean Fitbit data [Dataset]. https://www.kaggle.com/peterfonteneau/sql-clean-fitbit-data
    Explore at:
    zip(28842 bytes)Available download formats
    Dataset updated
    Jun 30, 2021
    Authors
    Peter Fonteneau
    Description

    Dataset

    This dataset was created by Peter Fonteneau

    Contents

  18. AB617 "Path to Clean Air" - Permitted Stationary Sources

    • splitgraph.com
    • data.bayareametro.gov
    Updated Feb 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BAAQMD (2023). AB617 "Path to Clean Air" - Permitted Stationary Sources [Dataset]. https://www.splitgraph.com/bayareametro-gov/ab617-path-to-clean-air-permitted-stationary-gzqi-hbu9
    Explore at:
    application/openapi+json, application/vnd.splitgraph.image, jsonAvailable download formats
    Dataset updated
    Feb 23, 2023
    Dataset provided by
    Bay Area Air Quality Management Districthttp://www.baaqmd.gov/
    Authors
    BAAQMD
    Description

    A list of permitted facilities in the Richmond/San Pablo "Path to Clean Air" community that was derived from our a larger planning inventory. This list includes location information.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  19. S

    AB 617 "Path to Clean Air" - Jurisdictional Boundaries

    • splitgraph.com
    • data.bayareametro.gov
    Updated Feb 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bayareametro-gov (2023). AB 617 "Path to Clean Air" - Jurisdictional Boundaries [Dataset]. https://www.splitgraph.com/bayareametro-gov/ab-617-path-to-clean-air-jurisdictional-boundaries-pwgz-y7cs/
    Explore at:
    application/openapi+json, json, application/vnd.splitgraph.imageAvailable download formats
    Dataset updated
    Feb 23, 2023
    Authors
    bayareametro-gov
    Description

    This dataset is based on digitized census data from census-designated places and cities. Jurisdictions included within the Richmond/San Pablo jurisdictional boundary are:

    City of Richmond: the full city is included, with the exception of the following small, disconnected areas located on either side of Pinole Valley Park, to the east of the Richmond/San Pablo Boundary:

    Greenridge Heights, May Valley, El Sobrante Hills, Greenbriar, Carriage Hills North, Castro Heights, and Carriage Hills south.

    City of San Pablo: the full city;

    City of Pinole: a small portion of the city (located in the northeast corner of the CERP Boundary);

    Unincorporated Contra Costa County:

    North Richmond: the full unincorporated place;

    Tara Hills: the full unincorporated place;

    Montalvin Manor: the full unincorporated place;

    Bayview: the full unincorporated place;

    East Richmond Heights: most of the unincorporated place is included, other than a small area within a census tract that included El Cerrito (Census Tract #06013384000);

    Rollingwood: the full unincorporated place; and

    El Sobrante: less than half of the unincorporated place.

    Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

    See the Splitgraph documentation for more information.

  20. Z

    Worldwide Gender Differences in Public Code Contributions - Replication...

    • data.niaid.nih.gov
    Updated Feb 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Rossi; Stefano Zacchiroli (2022). Worldwide Gender Differences in Public Code Contributions - Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6020474
    Explore at:
    Dataset updated
    Feb 9, 2022
    Dataset provided by
    LTCI, Télécom Paris, Institut Polytechnique de Paris, France
    University of Bologna, Italy
    Authors
    Davide Rossi; Stefano Zacchiroli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Worldwide Gender Differences in Public Code Contributions - Replication Package

    This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011

    This document comes with the software needed to mine and analyze the data presented in the paper.

    Prerequisites

    These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0

    Initial data

    swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.

    names.tab - forenames and surnames per country with their frequency

    zones.acc.tab - countries/territories, timezones, population and world zones

    c_c.tab - ccTDL entities - world zones matches

    Data preparation

    Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst sh> ./export.sh

    Run the authors cleanup script to create authors--clean.csv.zst sh> ./cleanup.sh authors.csv.zst

    Filter out implausible names and create authors--plausible.csv.zst sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst

    Gender detection

    Run the gender guessing script to create author-fullnames-gender.csv.zst sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst

    Database creation and data ingestion

    Create the PostgreSQL DB sh> createdb gender-commit Notice that from now on when prepending the psql> prompt we assume the execution of psql on the gender-commit database.

    Import data into PostgreSQL DB sh> ./import_data.sh

    Zone detection

    Extract commits data from the DB and create commits.tab, that is used as input for the gender detection script sh> psql -f extract_commits.sql gender-commit

    Run the world zone detection script to create commit_zones.tab.zst sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.

    Read zones assignment data from the file into the DB psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''

    Extraction and graphs

    Run the script to execute the queries to extract the data to plot from the DB. This creates commits_tz.tab, authors_tz.tab, commits_zones.tab, authors_zones.tab, and authors_zones_1620.tab. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh

    Run the script to create the graphs from all the previously extracted tabfiles. This will generate commits_tzs.pdf, authors_tzs.pdf, commits_zones.pdf, authors_zones.pdf, and authors_zones_1620.pdf. sh> ./create_charts.sh

    Additional graphs

    This package also includes some already-made graphs

    authors_zones_1.pdf: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per period

    authors_zones_2.pdf: ditto with at least two commits per period

    authors_zones_10.pdf: ditto with at least ten commits per period

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Deepali Sukhdeve (2024). Data Cleaning Portfolio Project [Dataset]. https://www.kaggle.com/datasets/deepalisukhdeve/data-cleaning-portfolio-project
Organization logo

Data Cleaning Portfolio Project

Cleaning Data with SQL Queries

Explore at:
zip(6053781 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
Deepali Sukhdeve
Description

Dataset

This dataset was created by Deepali Sukhdeve

Contents

Search
Clear search
Close search
Google apps
Main menu