Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📘 Dataset Card: SWE‑Dev
📝 Dataset Summary
SWE‑Dev (Software Engineering - Feature-driven Development) is the first large-scale dataset tailored for realistic, feature-driven software development using large language models (LLMs). Each example consists of a natural language product requirement, partial source code, and developer-authored unit tests—designed to simulate real-world software feature implementation tasks within large codebases. The dataset enables LLMs to… See the full description on the dataset page: https://huggingface.co/datasets/Dorothydu/SWE-Dev.
According to the survey, just under 18 percent of respondents identified PostgreSQQL as one of the most-wanted database skills. MongoDB ranked second with 17.89 percent stating they are not developing with it, but want to.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Language | Number of Samples |
Java | 153,119 |
Ruby | 233,710 |
Go | 137,998 |
JavaScript | 373,598 |
Python | 472,469 |
PHP | 294,394 |
At CompanyData.com (BoldData), we provide verified company data sourced directly from official trade registers. Our global IT company dataset gives you access to 6 million IT businesses worldwide, including software firms, tech consultancies, system integrators, SaaS providers, and other IT service companies. Every record is sourced from authoritative local registries, ensuring unmatched accuracy, coverage, and compliance.
This dataset is built for professionals who need reliable, structured insights into the global technology sector. Each company profile includes firmographic details such as legal entity name, registration number, business structure, size, revenue range, and industry classification (NACE/SIC). In addition, you'll find direct contact information for decision-makers—emails, mobile numbers, job titles, and department roles—helping you connect with the right people instantly.
Whether you're validating suppliers for compliance, identifying high-potential leads for sales, enriching your CRM data, or building AI models with clean and segmented business intelligence, our IT dataset is designed to support a wide range of critical use cases. From global enterprises to fast-scaling startups, our data empowers businesses to move faster and smarter.
We offer multiple delivery methods tailored to your needs. Choose from custom bulk files, access data through our self-service platform, integrate it directly into your systems via real-time API, or let us enrich your existing database with missing fields and decision-maker insights.
With a database spanning 380 million companies globally, deep IT sector segmentation, and proven expertise in sourcing from local trade registers, CompanyData.com (BoldData) helps your team identify opportunities, ensure compliance, and scale efficiently—wherever your growth takes you.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides comprehensive, up-to-date information about the top 100 Software-as-a-Service (SaaS) companies globally as of 2025. It includes detailed financial metrics, company fundamentals, and operational data that are crucial for market research, competitive analysis, investment decisions, and academic studies.
Key Features
Use Cases
Industries Covered
Enterprise Software (CRM, ERP, HR) Developer Tools & DevOps Cybersecurity Data Analytics & Business Intelligence Marketing & Sales Technology Financial Technology Communication & Collaboration E-commerce Platforms Design & Creative Tools Infrastructure & Cloud Services
Why This Dataset? The SaaS industry has grown to over $300 billion globally, with companies achieving unprecedented valuations and growth rates. This dataset captures the current state of the industry leaders, providing insights into what makes successful SaaS companies tick.
Sources/Proof of Data: Data Sources The data has been meticulously compiled from multiple authoritative sources:
Company Financial Reports (Q4 2024 - Q1 2025)
Official earnings releases and investor relations documents SEC filings for public companies
Investment Databases
Crunchbase, PitchBook, and CB Insights for funding data Venture capital and private equity announcements
Market Research Reports
Gartner, Forrester, and IDC industry analyses SaaS Capital Index and valuation reports
Industry Publications
TechCrunch, Forbes, Wall Street Journal coverage Company press releases and official announcements
Product Review Platforms
G2 Crowd ratings and reviews Capterra and GetApp user feedback
Data Verification
Cross-referenced across multiple sources for accuracy Updated with latest available information as of May 2025 Validated against official company statements where available
Success.ai’s B2B Contact Data and App Developer Data for Engineering Professionals Worldwide is a trusted resource for connecting with engineers and technical managers across industries and regions. This dataset draws from over 170 million verified professional profiles, ensuring you have access to high-quality contact data tailored to your business needs. From sales outreach to recruitment, Success.ai enables you to build meaningful relationships with engineering professionals at every level.
Why Choose Success.ai’s Engineering Professionals Data?
Data is AI-validated, ensuring 99% accuracy for your campaigns.
Global Engineering Coverage:
Includes engineers and technical managers from sectors like manufacturing, IT, construction, aerospace, automotive, and more.
Regions covered include North America, Europe, Asia-Pacific, South America, and the Middle East.
Real-Time Updates:
Continuous updates ensure you stay connected to current roles and decision-makers in engineering.
Compliance and Security:
Fully adheres to GDPR, CCPA, and other global data privacy standards, ensuring legal and ethical use.
Data Highlights: - 170M+ Verified Professional Profiles: Comprehensive data from various industries, including engineering. - 50M Work Emails: Accurate and AI-validated for reliable communication. - 30M Company Profiles: Detailed insights to support targeted outreach. - 700M Global Professional Profiles: A rich dataset designed to meet diverse business needs.
Key Features of the Dataset: - Extensive Engineer Profiles: Covers various roles, including mechanical, software, civil, and electrical engineers, as well as engineering managers and directors. - Customizable Filters: Segment profiles by location, industry, job title, and company size for precise targeting. - AI-Powered Insights: Enriches profiles with contextual details to support personalization.
Strategic Use Cases:
Reach technical decision-makers to accelerate your sales cycles.
Recruitment and Talent Acquisition:
Source skilled engineers and managers for specialized roles.
Use updated profiles to connect with potential candidates effectively.
Targeted Marketing Campaigns:
Launch precision-driven marketing campaigns aimed at engineers and engineering teams.
Personalize outreach with accurate and detailed contact data.
Engineering Services and Solutions:
Pitch your engineering tools, software, or consulting services to professionals who can benefit the most.
Establish connections with managers who influence procurement decisions.
Why Success.ai Stands Out:
Best Price Guarantee: Gain access to high-quality datasets at competitive prices.
Flexible Integration Options: Choose between API access or downloadable formats for seamless integration into your systems.
High Accuracy and Coverage: Benefit from AI-validated contact data for impactful results.
Customizable Datasets: Filter and refine datasets to focus on specific engineering roles, industries, or regions.
APIs for Enhanced Functionality:
Empower your business with B2B Contact Data for Engineering Professionals Worldwide from Success.ai. With verified work emails, phone numbers, and decision-maker profiles, you can confidently target engineers and managers in any sector.
Experience the Best Price Guarantee and unlock the potential of precise, AI-validated datasets. Contact us today and start connecting with engineering leaders worldwide!
No one beats us on price. Period.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Code Review Execution Dataset
This dataset contains comprehensive code review data including pull requests, AI-generated code suggestions, human feedback, and static analysis results. It represents real-world software development workflows and code quality processes.
Dataset Details
Dataset Description
This dataset captures the complete lifecycle of code review processes in software development, including:
Pull request metadata and context… See the full description on the dataset page: https://huggingface.co/datasets/Nutanix/codereview-dataset.
Software Market Size 2025-2029
The software market size is forecast to increase by USD 30.7 billion, at a CAGR of 8.2% between 2024 and 2029.
The market is experiencing significant growth, driven primarily by the increasing volume of enterprise data and the shift towards cloud computing. Businesses are recognizing the value of leveraging data to gain insights and make informed decisions, leading to a surge in demand for software solutions that can manage and analyze large data sets. Additionally, cloud computing is becoming the preferred deployment model for software, as it offers cost savings, flexibility, and scalability. However, the market also faces challenges that require careful navigation. High costs of licensing and support continue to be a significant obstacle for many organizations, particularly smaller businesses and startups. These costs can limit their ability to implement and maintain the software solutions they need to remain competitive. Furthermore, ensuring data security and privacy in a cloud environment is a major concern, as sensitive information is increasingly being stored and processed digitally. Companies must address these challenges effectively to capitalize on the opportunities presented by the market's growth and remain competitive in the evolving software landscape.
What will be the Size of the Software Market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, with dynamic market activities unfolding across various sectors. Entities such as version control systems, software quality assurance, software licensing, API integration, software maintenance, data warehousing, unit testing, project management, database management, cost optimization, and others, are seamlessly integrated into the software development lifecycle. Cloud computing is transforming the way software is deployed and accessed, while user experience remains a key focus for developers. Agile methodologies and the waterfall methodology coexist, with the former gaining popularity for its flexibility and the latter for its structured approach. Data mining and data analytics are increasingly being used to gain insights from vast amounts of data, while software security and bug tracking are essential components of any development process.
Machine learning and artificial intelligence are also making their mark, enhancing software functionality and improving user experience. Proprietary software and open source software each have their unique advantages, with CI/CD and DevOps streamlining the development process. Requirements gathering and user acceptance testing are crucial steps in ensuring software meets user needs, while code review and integration testing help maintain software quality. Technical support and software updates are ongoing requirements, with risk management and cost optimization essential for businesses to effectively manage their software investments. Business intelligence and software architecture are critical for making informed decisions and building scalable systems.
How is this Software Industry segmented?
The software industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. TypeSubscriptionsIdentity and access managementEndpoint/network/messaging/web securityRisk managementDeploymentCloud-basedOn-premisesSectorLarge enterprisesSmall and medium enterprisesApplicationCRMERPCybersecurityCollaboration ToolsGeographyNorth AmericaUSCanadaMexicoEuropeFranceGermanyItalyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)
By Type Insights
The subscriptions segment is estimated to witness significant growth during the forecast period.In the ever-evolving the market, subscription-based models are gaining significant traction as a key growth driver. This shift is driven by the increasing recognition of the benefits offered by these models, enabling businesses to adapt to their evolving needs. Subscription models provide flexibility, allowing companies to scale their software usage efficiently, adapting to expanding operations or streamlined processes. Additionally, these models promote cost optimization, enabling businesses to spread their software expenses over time, making it a more viable option for organizations of all sizes. The software development lifecycle is undergoing a transformation, with both waterfall and agile methodologies being adopted. Waterfall methodology, with its linear approach, is ideal for projects with well-defined requirements. In contrast, agile methodologies, with their iterative and collaborative nature, are more suitable for projects wit
The 2010 Report of the Presidents Council of Advisors on Science and Technology PCAST, entitled ?Designing a Digital Future: Federally Funded Research and Development in Networking and Information Technology,? documents the transformation of our society driven by advances in networking and information technology, catalyzed by our nation
s past investments in research. Our world today relies to an astonishing degree on systems, tools, and services that belong to a vast and still growing domain known as Networking and Information Technology NIT...
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
In 2023, the global market size for Database Development and Management Tools Software was valued at approximately $XX billion. With a projected CAGR of X.XX% during the forecast period, this market is expected to reach around $XX billion by 2032. The growth of the market can be attributed to the increasing volume of data generated across various industries, the rising importance of data-driven decision-making, and the need for efficient data management solutions.
One of the primary growth factors for the Database Development and Management Tools Software market is the exponential increase in data generated by various industries. In today's digital age, organizations produce vast amounts of data through various channels including social media, e-commerce, and IoT devices. This surge in data necessitates the use of advanced database management tools that can efficiently store, process, and analyze data to derive meaningful insights. Furthermore, the need for real-time data processing and analytics has driven the demand for sophisticated database tools that can handle large volumes of data with high speed and accuracy.
Another significant growth factor is the increasing adoption of cloud-based solutions. Cloud computing has revolutionized the way data is stored and managed, offering numerous advantages such as scalability, cost-effectiveness, and flexibility. Many organizations are migrating their database management systems to cloud platforms to leverage these benefits. Cloud-based database tools allow businesses to scale their operations without the need for significant capital investment in IT infrastructure. Additionally, the cloud provides a more secure environment for data storage and management, which is crucial in an era where data breaches and cyber threats are prevalent.
Moreover, the growing emphasis on regulatory compliance and data security is driving the demand for advanced database security tools. With stringent regulations such as GDPR, HIPAA, and CCPA in place, organizations are compelled to adopt robust database security measures to protect sensitive information and avoid hefty fines. Database security tools offer features such as data encryption, access control, and activity monitoring, which help organizations safeguard their data and comply with regulatory requirements. The increasing number of cyber-attacks and data breaches further underscores the importance of database security, thereby fueling the market growth.
The role of Enterprise Database Software in this evolving landscape cannot be overstated. As businesses continue to expand and generate vast amounts of data, the need for robust and scalable database solutions becomes increasingly critical. Enterprise Database Software provides organizations with the tools necessary to manage complex data environments efficiently. These solutions offer advanced features such as data integration, real-time analytics, and automated management, which are essential for handling large datasets and ensuring data accuracy. Furthermore, Enterprise Database Software enables businesses to maintain high levels of data security and compliance, which is crucial in today's regulatory environment. By leveraging these tools, organizations can optimize their data management processes, improve operational efficiency, and drive strategic decision-making.
Regionally, North America is expected to dominate the Database Development and Management Tools Software market during the forecast period. The presence of major technology companies, high adoption of advanced technologies, and a strong focus on research and development contribute to the market growth in this region. Additionally, the Asia Pacific region is anticipated to witness significant growth due to the increasing digitalization, rapid economic development, and the growing number of small and medium enterprises (SMEs) that require efficient database management solutions.
The Database Development and Management Tools Software market can be segmented by type into Database Design Tools, Database Management Tools, Database Monitoring Tools, Database Security Tools, and others. Database Design Tools are essential for creating and structuring databases that meet the specific needs of an organization. These tools help in designing the architecture, schema, and relationships between various data entities. The demand for database design tools is drive
As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Modern programming languages are constantly evolving, introducing new language features and APIs to enhance software development practices. Software developers often face the tedious task of upgrading their codebase to new programming language versions. Recently, large language models (LLMs) have demonstrated potential in automating various code generation and editing tasks, suggesting their applicability in automating code upgrade. However, there exists no benchmark for evaluating the code upgrade ability of LLMs, as distilling code changes related to programming language evolution from real-world software repositories’ commit histories is a complex challenge.
In this work, we introduce CoUpJava, the first large-scale dataset for code upgrade, focusing on the code changes related to the evolution of Java. CoUpJava comprises 10,697 code upgrade samples, distilled from the commit histories of 1,379 open-source Java repositories and covering Java versions 7–23. The dataset is divided into two subsets: CoUpJava-Fine, which captures fine-grained method-level refactorings towards new language features; and CoUpJava-Coarse, which includes coarse-grained repository-level changes encompassing new language features, standard library APIs, and build configurations. Our proposed dataset provides high-quality samples by filtering irrelevant and noisy changes and verifying the compilability of upgraded code. Moreover, CoUpJava reveals diversity in code upgrade scenarios, ranging from small, fine-grained refactorings to large-scale repository modifications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
950 Global exporters importers export import shipment records of Software development companies with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Perspectives of novice software engineers (NSEs) regarding hybrid work, examining their views on hybrid work conditions and their experiences with hybrid tools.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: The global Open- Source Software Input Output (OSSIO) tables were built including five different programming languages and 15 countries. The researchers used knowledge of geographical location of software developers and linkages between software projects (dependencies) to aggregate these to flows between countries. The OSSIO tables were built as part of the EU-funded research project 'Rethinking Global Supply Chains: Measurement, Impact and Policy' (RETHINK-GSC; https://rethink-gsc.eu/), which captures the impact of knowledge flows and service inputs in global supply chains (GSCs).
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global database automation software market size in 2023 is projected at approximately USD 1.8 billion, and it is anticipated to reach around USD 3.9 billion by 2032, growing at a CAGR of 9.2% during the forecast period. The robust growth can be attributed to various factors, including the increasing need for businesses to manage large volumes of data efficiently, the rise of cloud computing, and the rapid adoption of automation technologies in a variety of industries.
The growing emphasis on reducing operational costs is one of the primary factors propelling the market. Organizations are continuously looking for ways to enhance productivity while minimizing costs. Database automation software helps in achieving this by automating routine database management tasks such as backup, recovery, and performance tuning. This automation leads to significant time and cost savings, thereby driving the market. Additionally, the software minimizes human errors, which can be costly and detrimental to business operations, further fueling its adoption.
Another critical growth driver is the increasing complexity of database environments. The surge in big data, IoT, and artificial intelligence applications has led to more complex and large-scale database systems. Managing these vast and complex databases manually can be incredibly challenging and prone to errors. Database automation software simplifies these processes by providing automated solutions for database configuration, monitoring, and maintenance, thereby making it easier to manage and optimize database performance.
Furthermore, the rapid adoption of cloud computing is significantly boosting the database automation software market. Cloud-based databases are becoming increasingly popular due to their scalability, flexibility, and cost-effectiveness. Database automation software provides seamless integration with cloud services, enabling businesses to efficiently manage their cloud databases. The capabilities of database automation tools to offer real-time analytics and ensure data accuracy in cloud environments are some of the other factors driving the market growth.
As organizations continue to navigate the complexities of modern data environments, the role of Database Development and Management Tools Software becomes increasingly vital. These tools are designed to streamline the process of database creation, modification, and maintenance, allowing businesses to focus on strategic objectives rather than routine database tasks. By leveraging such software, companies can ensure that their databases are not only efficient but also scalable and secure. This is particularly important in today's data-driven world, where the ability to quickly adapt to changing data requirements can provide a competitive edge. The integration of these tools with database automation software further enhances their capabilities, providing a comprehensive solution for managing complex database environments.
Regionally, North America holds a significant share of the database automation software market due to the early adoption of advanced technologies and the presence of key market players. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by the rapid industrialization, increasing investments in IT infrastructure, and the growing adoption of cloud-based solutions in countries like China and India.
The database automation software market can be segmented into two primary components: software and services. The software segment includes tools and platforms specifically designed for automating database tasks. These tools typically feature functionalities such as automated provisioning, configuration, patching, upgrades, and monitoring. The growing need for efficient database management solutions that can handle complex and large-scale database environments is driving the demand for database automation software. Companies are increasingly investing in advanced software solutions to optimize their database performance and ensure data accuracy.
On the other hand, the services segment encompasses various services associated with the implementation, integration, and maintenance of database automation software. This includes consulting services, managed services, and training and support services. As organizations seek to leverage the full
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
slocount.py: This script calculates the number of comment lines, total lines of code (TLOC) and source lines of code (SLOC). It uses a code line counter developed by Ben Boyter, which must be installed (https://github.com/boyter/scc.). The source code links to the global impact models (GIMs) can be found in the 'ISIMIP_models.xlsx' file.
active_dev.py: This script plots the number of active developers for each GIM across 10 sectors. It utilizes data from the 'active_dev.csv' file, which lists the GIMs and their respective number of developers.
cocomo.py: This script estimates the effort required for software development using the methodology proposed by Sachan et al. 2016 (https://doi.org/10.1016/j.procs.2016.06.107). It also generates plots for these estimates.
comment_density_modularity.py: This script calculates the comment density and evaluates the modularity of the modules. It also produces plots for these metrics.
code_standard.py: This script uses Pylint (https://pylint.readthedocs.io/en/latest/user_guide/usage/output.html) to check if the source code, either in part or in its entirety, adheres to the PEP8 coding standard. It also generates lint scores for the source code.
line_count.zip: This file contains the results of counting the number of comment lines, TLOC and SLOC for each GIM.
lint_score.zip: This file contains the results of running pylint on GIMs that include Python in their source code. Results also include lint score per GIM
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Worldwide Gender Differences in Public Code Contributions - Replication Package
This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011
This document comes with the software needed to mine and analyze the data presented in the paper.
Prerequisites
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0
Initial data
swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.
names.tab - forenames and surnames per country with their frequency
zones.acc.tab - countries/territories, timezones, population and world zones
c_c.tab - ccTDL entities - world zones matches
Data preparation
Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst sh> ./export.sh
Run the authors cleanup script to create authors--clean.csv.zst sh> ./cleanup.sh authors.csv.zst
Filter out implausible names and create authors--plausible.csv.zst sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
Gender detection
Run the gender guessing script to create author-fullnames-gender.csv.zst sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst
Database creation and data ingestion
Create the PostgreSQL DB sh> createdb gender-commit Notice that from now on when prepending the psql> prompt we assume the execution of psql on the gender-commit database.
Import data into PostgreSQL DB sh> ./import_data.sh
Zone detection
Extract commits data from the DB and create commits.tab, that is used as input for the gender detection script sh> psql -f extract_commits.sql gender-commit
Run the world zone detection script to create commit_zones.tab.zst sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.
Read zones assignment data from the file into the DB psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
Extraction and graphs
Run the script to execute the queries to extract the data to plot from the DB. This creates commits_tz.tab, authors_tz.tab, commits_zones.tab, authors_zones.tab, and authors_zones_1620.tab. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh
Run the script to create the graphs from all the previously extracted tabfiles. This will generate commits_tzs.pdf, authors_tzs.pdf, commits_zones.pdf, authors_zones.pdf, and authors_zones_1620.pdf. sh> ./create_charts.sh
Additional graphs
This package also includes some already-made graphs
authors_zones_1.pdf: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per period
authors_zones_2.pdf: ditto with at least two commits per period
authors_zones_10.pdf: ditto with at least ten commits per period
Success.ai’s Technographic Data for IT Decision-makers in Europe offers a comprehensive and reliable dataset designed to connect businesses with key technology leaders and professionals across Europe. Covering roles such as CIOs, IT managers, software engineers, and infrastructure specialists, this dataset provides verified LinkedIn profiles, work emails, phone numbers, and detailed decision-maker insights.
With access to over 700 million verified global profiles, Success.ai ensures your outreach, marketing, and sales strategies are powered by accurate, continuously updated, and AI-validated data. Supported by our Best Price Guarantee, this solution is ideal for businesses aiming to engage with Europe’s most influential IT professionals.
Why Choose Success.ai’s Technographic Data?
Verified Contact Data for Precision Outreach
Comprehensive Coverage Across Europe
Continuously Updated Datasets
Ethical and Compliant
Data Highlights:
Key Features of the Dataset:
Comprehensive IT Professional Profiles
Advanced Filters for Precision Campaigns
Regional and Sector-specific Insights
AI-Driven Enrichment
Strategic Use Cases:
Marketing Campaigns and Lead Generation
Sales and Business Development
Partnership Development and Collaboration
Market Research and Competitive Analysis
Why Choose Success.ai?
Best Price Guarantee
Seamless Integration
Data Accuracy with AI Validation
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📘 Dataset Card: SWE‑Dev
📝 Dataset Summary
SWE‑Dev (Software Engineering - Feature-driven Development) is the first large-scale dataset tailored for realistic, feature-driven software development using large language models (LLMs). Each example consists of a natural language product requirement, partial source code, and developer-authored unit tests—designed to simulate real-world software feature implementation tasks within large codebases. The dataset enables LLMs to… See the full description on the dataset page: https://huggingface.co/datasets/Dorothydu/SWE-Dev.