Facebook
TwitterThis dataset was created by Deepali Sukhdeve
Facebook
TwitterThe dataset contained information on housing data in the Nashville, TN area. I used SQL Server to clean the data to make it easier to use. For example, I converted some dates to remove unnecessary timestamps; I populated data for null values; I changed address columns from containing all of the address, city and state into separate columns; I changed a column that had different representations of the same data into consistent usage; I removed duplicate rows; and I deleted unused columns.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data Cleaning from Public Nashville Housing Data:
Standardize the Date Format
Populate Property Address data
Breaking out Addresses into Individual Columns (Address, City, State)
Change Y and N to Yes and No in the "Sold as Vacant" field
Remove Duplicates
Delete Unused Columns
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.
Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.
https://i.imgur.com/2Egeb8R.png" alt="" title="a title">
This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.
Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.
In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here
We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.
UserId column in the ForumMessages table has values that do not exist in the Users table.True or False.Total columns.
For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.db_abd_create_tables.sql script.clean_data.py script.
The script does the following steps for each table:
NULL.add_foreign_keys.sql script.Total columns in the database tables. I do that by running the update_totals.sql script.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
It contains the following files:
- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license
The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.
This dataset is distributed under the CC BY-SA 4.0 license.
If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.
@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}
@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}
@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}
@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}
@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}
@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}
@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}
@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}
@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}
@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description
This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)
This table contains information about customers, including their unique identifiers and demographic details.
Columns:
cst_id: Customer ID (Primary Key)
cst_gndr: Gender
cst_marital_status: Marital status
cst_create_date: Customer account creation date
Cleaning Steps:
Removed duplicates and handled missing or null cst_id values.
Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.
Standardized gender values and identified inconsistencies in marital status.
This table contains information about products, including product identifiers, names, costs, and lifecycle dates.
Columns:
prd_id: Product ID
prd_key: Product key
prd_nm: Product name
prd_cost: Product cost
prd_start_dt: Product start date
prd_end_dt: Product end date
Cleaning Steps:
Checked for duplicates and null values in the prd_key column.
Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.
Corrected product costs to remove invalid entries (e.g., negative values).
This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.
Columns:
sls_order_dt: Sales order date
sls_due_dt: Sales due date
sls_sales: Total sales amount
sls_quantity: Number of products sold
sls_price: Product unit price
Cleaning Steps:
Validated sales order dates and corrected invalid entries.
Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.
Removed null and negative values from sls_sales, sls_quantity, and sls_price.
This table contains additional customer demographic data, including gender and birthdate.
Columns:
cid: Customer ID
gen: Gender
bdate: Birthdate
Cleaning Steps:
Checked for missing or null gender values and standardized inconsistent entries.
Removed leading/trailing spaces from gen and bdate.
Validated birthdates to ensure they were within a realistic range.
This table contains country information related to the customers' locations.
Columns:
cntry: Country
Cleaning Steps:
Standardized country names (e.g., "US" and "USA" were mapped to "United States").
Removed special characters (e.g., carriage returns) and trimmed whitespace.
This table contains product category information.
Columns:
Product category data (no significant cleaning required).
Key Features:
Customer demographics, including gender and marital status
Product details such as cost, start date, and end date
Sales data with order dates, quantities, and sales amounts
ERP-specific customer and location data
Data Cleaning Process:
This dataset underwent extensive cleaning and validation, including:
Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).
Date Validations: Ensuring correct date ranges and chronological consistency.
Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.
Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.
This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The RbSQLi dataset has been developed to support advanced research and development in the detection of SQL injection (SQLi) vulnerabilities. It contains a total of 10,190,450 structured entries, out of which 2,699,570 are labeled as malicious and 7,490,880 as benign. The malicious entries are categorized into six distinct types of SQL injection attacks: Union-based (398,070 samples), Stackqueries-based (223,800 samples), Time-based (564,900 samples), Meta-based (481,280 samples), Boolean-based (207,900 samples), and Error-based (823,620 samples).
The malicious payloads for Union-based, Time-based, and Error-based injection types were sourced directly from the widely used and reputable open-source GitHub repository "Payloads All The Things – SQL Injection Payload List" (https://github.com/payloadbox/sql-injection-payload-list). Moreover, ChatGPT was employed to generate additional payloads for Boolean-based, Stack queries-based, and Meta-based injection categories. This hybrid approach ensures that the dataset reflects both known attack patterns and intelligently simulated variants, contributing to a broader representation of SQLi techniques. Again, some queries in the SQLi dataset are syntactically invalid yet contain malicious payloads, enabling models to detect SQL injection attempts even when attackers submit improperly formed or malformed queries. This highlights the importance of training models to recognize semantic intent rather than relying solely on syntactic correctness.
All payloads were carefully curated, anonymized, and structured during preprocessing. Sensitive data was replaced with secure placeholders, preserving semantic meaning while protecting data integrity and privacy. The dataset also underwent a thorough sanitization process to ensure consistency and usability. To support scalability and reproducibility, a rule-based classification algorithm was used to automate the labeling and organization of each payload by type. This methodology promotes standardization and ensures that the dataset is ready for use in machine learning pipelines, anomaly detection models, and intrusion detection systems. In addition to being comprehensive, the dataset provides a substantial volume of clean (benign) data, making it well-suited for supervised learning, comparative experiments, and robustness testing in cybersecurity research.
This dataset is intended to facilitate progress in the development of more accurate and generalizable SQL injection detection systems and to serve as a reliable benchmark for the broader security and machine learning communities.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.
Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.
The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm
The companion paper can be found here: doi.org/10.5281/zenodo.814979
Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922
Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)
Facebook
TwitterTo help Maryland homeowners invest in clean energy, the Maryland Energy Administration provides grants for clean burning wood stoves that displace electric, non-natural gas fossil fuel heating systems or old wood stoves.
More information is available on the program's website at: http://energy.maryland.gov/Residential/woodstoves/
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
TwitterThis data represents all water used at Pierce County owned Washington Clean Buildings Act (WCBA) Tier 1 and 2 buildings, which includes all County buildings that are larger than 20,000 Square Feet.
All water usage data is collected from utility bills.
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThis dataset is a geospatial view of the areas fed by grid substations. The aim is to create an indicative map showing the extent to which individual grid substations feed areas based on MPAN data.
Methodology
Data Extraction and Cleaning: MPAN data is queried from SQL Server and saved as a CSV. Invalid values and incorrectly formatted postcodes are removed using a Test Filter in FME.
Data Filtering and Assignment: MPAN data is categorized into EPN, LPN, and SPN based on the first two digits. Postcodes are assigned a Primary based on the highest number of MPANs fed from different Primary Sites.
Polygon Creation and Cleaning: Primary Feed Polygons are created and cleaned to remove holes and inclusions. Donut Polygons (holes) are identified, assigned to the nearest Primary, and merged.
Grid Supply Point Integration: Primaries are merged into larger polygons based on Grid Site relationships. ny Primaries not fed from a Grid Site are marked as NULL and labeled.
Functional Location Codes (FLOC) Matching: FLOC codes are extracted and matched to Primaries, Grid Sites and Grid Supply Points. Confirmed FLOCs are used to ensure accuracy, with any unmatched sites reviewed by the Open Data Team.
Quality Control Statement
Quality Control Measures include:
Verification steps to match features only with confirmed functional locations. Manual review and correct of data inconsistencies Use of additional verification steps to ensure accuracy in the methodology Regular updates and reviews documented in the version history
Assurance Statement The Open Data Team and Network Data Team worked with the Geospatial Data Engineering Team to ensure data accuracy and consistency.
Other
Download dataset information: Metadata (JSON)
Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/To view this data please register and login.
Facebook
TwitterPort terminals within the AB617 "Path to Clean Air" emissions inventory domain. Locations of port terminals are based on bulk vessel call data from the Marine Exchange of San Francisco.
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
TwitterSource URL:https://www.actransit.org/data-api-resource-center
Where dataset is from: AC Transit, Golden Gate Bus Service (digitized by BAAQMD), Westcat Bus Service (digitized by BAAQMD)
When obtained: 1/13/2022
For what purpose it was obtained: General AB 617 Planning
Additional Information: AC Transit work is licensed under a Creative Commons Attribution 3.0 Unported License
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset has been developed to support advanced research and development in the detection of SQL injection (SQLi) vulnerabilities. It contains a total of 10,304,026 structured entries, out of which 2,813,146 are labeled as malicious and 7,490,880 as benign. The malicious entries are categorized into six distinct types of SQL injection attacks: Union-based (758,600 samples), Stackqueries-based (746,480 samples), Time-based (531,580 samples), Meta-based (481,280 samples), Boolean-based (226,080 samples), and Error-based (69,126 samples).
The malicious payloads for Union-based, Time-based, and Error-based injection types were sourced directly from the widely used and reputable open-source GitHub repository "Payloads All The Things – SQL Injection Payload List" (https://github.com/payloadbox/sql-injection-payload-list). Moreover, ChatGPT was employed to generate additional payloads for Boolean-based, Stack queries-based, and Meta-based injection categories. This hybrid approach ensures that the dataset reflects both known attack patterns and intelligently simulated variants, contributing to a broader representation of SQLi techniques.
All payloads were carefully curated, anonymized, and structured during preprocessing. Sensitive data was replaced with secure placeholders, preserving semantic meaning while protecting data integrity and privacy. The dataset also underwent a thorough sanitization process to ensure consistency and usability. To support scalability and reproducibility, a rule-based classification algorithm was used to automate the labeling and organization of each payload by type. This methodology promotes standardization and ensures that the dataset is ready for use in machine learning pipelines, anomaly detection models, and intrusion detection systems. In addition to being comprehensive, the dataset provides a substantial volume of clean (benign) data, making it well-suited for supervised learning, comparative experiments, and robustness testing in cybersecurity research.
This dataset is intended to facilitate progress in the development of more accurate and generalizable SQL injection detection systems and to serve as a reliable benchmark for the broader security and machine learning communities.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Get access to a premium Medium articles dataset containing 500,000+ curated articles with metadata including author profiles, publication dates, reading time, tags, claps, and more. Ideal for natural language processing (NLP), machine learning, content trend analysis, and AI model training.
Request here for the large dataset Medium datasets
Checkout sample dataset in CSV
Training language models (LLMs)
Analyzing content trends and engagement
Sentiment and text classification
SEO research and author profiling
Academic or commercial research
High-volume, cleanly structured JSON
Ideal for developers, researchers, and data scientists
Easy integration with Python, R, SQL, and other data pipelines
Affordable and ready-to-use
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The population of Metro Vancouver (20110729Regional Growth Strategy Projections Population, Housing and Employment 2006 – 2041 File) will have increased greatly by 2040, and finding a new source of reservoirs for drinking water (2015_ Water Consumption_ Statistics File) will be essential. This issue of drinking water needs to be optimized and estimated (Data Mining file) with the aim of developing the region. Three current sources of water reservoirs for Metro Vancouver are Capilano, Seymour, and Coquitlam, in which the treated water is being supplied to the customer. The linear optimization (LP) model (Optimization, Sensitivity Report File) illustrates the amount of drinking water for each reservoir and region. In fact, the B.C. government has a specific strategy for the growing population till 2040, which leads them toward their goal. In addition, another factor is the new water source for drinking water that needs to be estimated and monitored to anticipate the feasible water source (wells) until 2040. As such, the government will have to make a decision on how much groundwater is used. The goal of the project is two steps: (1) an optimization model for three water reservoirs, and (2) estimating the new source of water to 2040. The process of data analysis for the project includes: the data is analyzed with six software—Trifacta Wrangler, AMPL, Excel Solver, Arc GIS, and SQL—and is visualized in Tableau. 1. Trifacta Wrangler Software clean data (Data Mining file). 2. AMPL and Solver Excel Software optimize drinking water consumption for Metro Vancouver (data in the Optimization and Sensitivity Report file). 3. ArcMap collaborates the raw data and result of the optimization water reservoir and estimating population till 2040 with the ArcGIS software (GIS Map for Tableau file). 4. Visualizing, estimating, and optimizing the source of drinking water for Metro Vancouver until 2040 with SQL software in Tableau (export tableau data file).
Facebook
TwitterThis dataset was created by Peter Fonteneau
Facebook
TwitterA list of permitted facilities in the Richmond/San Pablo "Path to Clean Air" community that was derived from our a larger planning inventory. This list includes location information.
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
TwitterThis dataset is based on digitized census data from census-designated places and cities. Jurisdictions included within the Richmond/San Pablo jurisdictional boundary are:
City of Richmond: the full city is included, with the exception of the following small, disconnected areas located on either side of Pinole Valley Park, to the east of the Richmond/San Pablo Boundary:
Greenridge Heights, May Valley, El Sobrante Hills, Greenbriar, Carriage Hills North, Castro Heights, and Carriage Hills south.
City of San Pablo: the full city;
City of Pinole: a small portion of the city (located in the northeast corner of the CERP Boundary);
Unincorporated Contra Costa County:
North Richmond: the full unincorporated place;
Tara Hills: the full unincorporated place;
Montalvin Manor: the full unincorporated place;
Bayview: the full unincorporated place;
East Richmond Heights: most of the unincorporated place is included, other than a small area within a census tract that included El Cerrito (Census Tract #06013384000);
Rollingwood: the full unincorporated place; and
El Sobrante: less than half of the unincorporated place.
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Worldwide Gender Differences in Public Code Contributions - Replication Package
This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011
This document comes with the software needed to mine and analyze the data presented in the paper.
Prerequisites
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0
Initial data
swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.
names.tab - forenames and surnames per country with their frequency
zones.acc.tab - countries/territories, timezones, population and world zones
c_c.tab - ccTDL entities - world zones matches
Data preparation
Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst sh> ./export.sh
Run the authors cleanup script to create authors--clean.csv.zst sh> ./cleanup.sh authors.csv.zst
Filter out implausible names and create authors--plausible.csv.zst sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
Gender detection
Run the gender guessing script to create author-fullnames-gender.csv.zst sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst
Database creation and data ingestion
Create the PostgreSQL DB sh> createdb gender-commit Notice that from now on when prepending the psql> prompt we assume the execution of psql on the gender-commit database.
Import data into PostgreSQL DB sh> ./import_data.sh
Zone detection
Extract commits data from the DB and create commits.tab, that is used as input for the gender detection script sh> psql -f extract_commits.sql gender-commit
Run the world zone detection script to create commit_zones.tab.zst sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.
Read zones assignment data from the file into the DB psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
Extraction and graphs
Run the script to execute the queries to extract the data to plot from the DB. This creates commits_tz.tab, authors_tz.tab, commits_zones.tab, authors_zones.tab, and authors_zones_1620.tab. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh
Run the script to create the graphs from all the previously extracted tabfiles. This will generate commits_tzs.pdf, authors_tzs.pdf, commits_zones.pdf, authors_zones.pdf, and authors_zones_1620.pdf. sh> ./create_charts.sh
Additional graphs
This package also includes some already-made graphs
authors_zones_1.pdf: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per period
authors_zones_2.pdf: ditto with at least two commits per period
authors_zones_10.pdf: ditto with at least ten commits per period
Facebook
TwitterThis dataset was created by Deepali Sukhdeve