50 datasets found

Stack Overflow Developer Survey Dataset
kaggle.com
zip
Updated Jan 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palvinder (2024). Stack Overflow Developer Survey Dataset [Dataset]. https://www.kaggle.com/datasets/palvinder2006/stackoverflow
Explore at:
zip(9459089 bytes)Available download formats
Dataset updated
Jan 8, 2024
Authors
Palvinder
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview The Stack Overflow Developer Survey Dataset represents one of the most trusted and comprehensive sources of information about the global developer community. Collected by Stack Overflow through its annual survey, the dataset provides insights into the demographics, preferences, habits, and career paths of developers.

This dataset is frequently used for: - Analyzing trends in programming languages, tools, and technologies. - Understanding developer job satisfaction, compensation, and work environments. - Studying global and regional differences in developer demographics and experience.

The data has of two CSV files, "survey_results_public" that consist of data and "survey_results_schema" that describes each column in detail.

Data Dictionary: All the details are in "survey_results_schema.csv"

Features of the Stack Overflow Developer Survey Dataset

Demographic & Background Information - Respondent: A unique identifier for each survey participant. - MainBranch: Describes whether the respondent is a professional developer, student, hobbyist, etc. - Country: The country where the respondent lives. - Age: The respondent's age. - Gender: The gender identity of the respondent. - Ethnicity: Ethnic background (when available). - EdLevel: The highest level of formal education completed. - UndergradMajor: The respondent's undergraduate major. - Hobbyist: Indicates whether the person codes as a hobby (Yes/No).

Employment & Professional Experience - Employment: Employment status (full-time, part-time, unemployed, student, etc.). - DevType: Types of developer roles the respondent identifies with (e.g., Web Developer, Data Scientist). - YearsCode: Number of years the respondent has been coding. - YearsCodePro: Number of years coding professionally. - JobSat: Job satisfaction level. - CareerSat: Career satisfaction level. - WorkWeekHrs: Approximate hours worked per week. - RemoteWork: Whether the respondent works remotely and how frequently.

Compensation - CompTotal: Total compensation in USD (including salary, bonuses, etc.). - CompFreq: Frequency of compensation (e.g., yearly, monthly).

Learning & Education - LearnCode: How the respondent first learned to code (e.g., online courses, university). - LearnCodeOnline: Online resources used (e.g., YouTube, freeCodeCamp). - LearnCodeCoursesCert: Whether the respondent has taken online courses or earned certifications.

Technology & Tools - LanguageHaveWorkedWith: Programming languages the respondent has used. - LanguageWantToWorkWith: Languages the respondent is interested in learning or using more. - DatabaseHaveWorkedWith: Databases the respondent has experience with. - PlatformHaveWorkedWith: Platforms used (e.g., Linux, AWS, Android). - OpSys: The operating system used most often. - NEWCollabToolsHaveWorkedWith: Collaboration tools used (e.g., Slack, Teams, Zoom). - NEWStuck: How often the respondent feels stuck when coding. - ToolsTechHaveWorkedWith: Frameworks and technologies respondents have worked with.

Online Presence & Community - SOAccount: Whether the respondent has a Stack Overflow account. - SOPartFreq: How often the respondent participates on Stack Overflow. - SOVisitFreq: Frequency of visiting Stack Overflow. - SOComm: Whether the respondent feels welcome in the Stack Overflow community. - OpenSourcer: Level of involvement in open-source contributions.

Opinions & Preferences - WorkChallenge: Challenges faced at work (e.g., unclear requirements, unrealistic expectations). - JobFactors: Important job factors (e.g., salary, work-life balance, technologies used). - MentalHealth: Questions on how mental health affects or is affected by their job.
SWE-Bench Coding Tasks Dataset
kaggle.com
zip
Updated Oct 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata (2025). SWE-Bench Coding Tasks Dataset [Dataset]. https://www.kaggle.com/datasets/unidpro/fermatix-swe-bench
Explore at:
zip(146556 bytes)Available download formats
Dataset updated
Oct 3, 2025
Authors
Unidata
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
SWE-Bench Dataset

The dataset comprises 8,712 files across 6 programming languages, featuring verified tasks and benchmarks for evaluating coding agents and language models. It introduces new benchmarks with real-world coding tasks, providing datasets for software engineering problems and tests. It builds upon the original swe-bench by evaluating repository-level challenges and scoring performances.

By utilizing this dataset with its multi-language test sets and golden patches, researchers and developers can advance their understanding of large language models and developer tools, comparing their performances on real software engineering challenges. - Get the data

Specifically engineered for evaluating advanced coding and software development, SWE-Bench Dataset supports research in code generation, automated patching, and fixing GitHub issues.

💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

Example of the data

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F27063537%2F6876a1091e5e4e12d330177c6ec3a0e6%2F1.PNG?generation=1759494538704549&alt=media" alt="">

The dataset provides a robust foundation for achieving higher accuracy in code generation and advancing automated software development tools, which are essential for improving developer productivity and software quality.

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects
Enterprise-Driven Open Source Software
zenodo.org
data.europa.eu
application/gzip
Updated Apr 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas (2020). Enterprise-Driven Open Source Software [Dataset]. http://doi.org/10.5281/zenodo.3653878
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3653878
Dataset updated
Apr 22, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diomidis Spinellis; Diomidis Spinellis; Zoe Kotti; Zoe Kotti; Konstantinos Kravvaritis; Konstantinos Kravvaritis; Georgios Theodorou; Georgios Theodorou; Panos Louridas; Panos Louridas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,252 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.

The main dataset is provided as a 17,252 record tab-separated file named enterprise_projects.txt with the following 27 fields.

url: the project's GitHub URL

project_id: the project's GHTorrent identifier

sdtc: true if selected using the same domain top committers heuristic (9,006 records)

mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,289 records)

mcve: true if selected using the multiple committers from a probable company heuristic (7,990 records),

star_number: number of GitHub watchers

commit_count: number of commits

files: number of files in current main branch

lines: corresponding number of lines in text files

pull_requests: number of pull requests

most_recent_commit: date of the most recent commit

committer_count: number of different committers

author_count: number of different authors

dominant_domain: the projects dominant email domain

dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain

dominant_domain_author_commits: corresponding number for commit authors

dominant_domain_committers: number of committers whose email matches the project's dominant domain

dominant_domain_authors: corresponding number of commit authors

cik: SEC's EDGAR "central index key"

fg500: true if this is a Fortune Global 500 company (2,232 records)

sec10k: true if the company files SEC 10-K forms (4,178 records)

sec20f: true if the company files SEC 20-F forms (429 records)

project_name: GitHub project name

owner_login: GitHub project's owner login

company_name: company name as derived from the SEC and Fortune 500 data

owner_company: GitHub project's owner company name

license: SPDX license identifier

The file cohost_project_details.txt provides the full set of 309,531 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.

url: the project's GitHub URL

project_id: the project's GHTorrent identifier

stars: number of GitHub watchers

commit_count: number of commits
d
WebAutomation Employee Data | Github Developer Profiles | Global 40M+...
datarade.ai
.json, .csv
Updated Dec 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Webautomation (2022). WebAutomation Employee Data | Github Developer Profiles | Global 40M+ Developer Records | Explore Developer Repositories, Contributions and more [Dataset]. https://datarade.ai/data-products/webautomation-github-developer-profiles-dataset-global-webautomation
Explore at:
.json, .csvAvailable download formats
Dataset updated
Dec 5, 2022
Dataset authored and provided by
Webautomation
Area covered
Greenland, Montserrat, Canada, Estonia, Uruguay, Guadeloupe, Suriname, Paraguay, Ukraine, Falkland Islands (Malvinas)
Description
Extensive Developer Coverage: Our employee dataset includes a diverse range of developer profiles from GitHub, spanning various skill levels, industries, and expertise. Access information on developers from all corners of the software development world.

Developer Profiles: Explore detailed developer profiles, including user bios, locations, company affiliations, and skills. Understand developer backgrounds, experiences, and areas of expertise.

Repositories and Contributions: Access information about the repositories created by developers and their contributions to open-source projects. Analyze the projects they've worked on, their coding activity, and the impact they've made on the developer community.

Programming Languages: Gain insights into the programming languages that developers are proficient in. Identify skilled developers in specific programming languages that align with your project needs.

Customizable Data Delivery: The dataset is available in flexible formats, such as CSV, JSON, or API integration, allowing seamless integration with your existing data infrastructure. Customize the data to meet your specific research and analysis requirements.
d
Data from: Global Fintech Market Dataset
decipherzone.com
csv
Updated Sep 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Decipher Zone (2025). Global Fintech Market Dataset [Dataset]. https://www.decipherzone.com/blog-detail/fintech-software-development
Explore at:
csvAvailable download formats
Dataset updated
Sep 22, 2025
Dataset authored and provided by
Decipher Zone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset of fintech market growth showing $44.7B funding in H1 2025, projected to reach USD 394.88B in 2025 and USD 1,126.64B by 2032 at a CAGR of 16.2%.
An Analysis of Engineering-as-Marketing Tools
kaggle.com
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). An Analysis of Engineering-as-Marketing Tools [Dataset]. https://www.kaggle.com/datasets/thedevastator/an-analysis-of-engineering-as-marketing-tools
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
An Analysis of Engineering-as-Marketing Tools

Strategies for Expanding Business Reach

By Ian Greenleigh [source]

About this dataset

The engineering-as-marketing tools available today allow startups to maximize and take advantage of the engineering talents they possess. By creating useful tools such as calculators, widgets and microsites, businesses can get in front of potential customers and lead them to their products or services.

This dataset provides a comprehensive list of companies who are using engineering as a marketing strategy and the respective tools these companies have created for it. For each company you get information about their name, product/service, tool name, what the tool does and a URL for further information about it. Additionally there is an extra notes field providing more details about each company’s market habit or any other additional facts that could be relevant in understanding better the use cases these companies are leading with this new way of doing marketing through engineering driven strategies.

With this data you will be able to take a closer look at how effectively this strategy is working while being able to compare different approaches taken inside each industry vertical in order to maximize conversions among leads generated by all these amazing pieces work made possible by software engineers everywhere devoted every day making our lives easier constantly!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Analyzing this data allows users to gain insights into how successful companies are using engineering-as-marketing techniques to generate leads and expand their customer base. It also provides a valuable resource for other organizations wanting to learn more about how other organizations have achieved success with such practices.

This dataset can be used in many ways such as:

Analyzing different trends in which engineering-as-marketing techniques are being used across multiple industries

Examining whether certain techniques lead to higher lead generation or increased customer base

Comparing effectiveness between companies using different types of tools etc.

To get started with this dataset, simply load it up into some kind of data analysis software package that supports csv file processing capabilities such as Tableau or R Studio. Then define each column appropriately by adding appropriate labels onto them so that they can be understood easily when looked at from a first glance perspective by yourself or other members on your team who are looking over your datasets before any analyses start happening on those files within your chosen data analysis software package . Now you should be all set up for analyzing this dataset!

Research Ideas

Leveraging this data to understand the effectiveness of engineering-as-marketing for various companies.

Creating a sentiment analysis of customers’ responses to engineering-as-marketing tools in order to determine which tools are most popular and successful.

Analyzing what types of engineering-as-marketing tools have been most successful with specific customer segments, to inform future product development and marketing tactics

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Engineering as Marketing.csv | Column name | Description | |:-------------------|:-------------------------------------------------------------------| | Company name | The name of the company. (String) | | What co does | A brief description of what the company does. (String) | | Tool name | The name of the engineering-as-marketing tool. (String) | | What tool does | A brief description of what the tool does. (String) | | URL | The URL of the engineering-as-marketing tool. (String) | | Notes | Additional notes about the engineering-as-marketing tool. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Ian Greenleigh.
App Developer Data | Engineering Professionals Worldwide Contact Data |...
datarade.ai
Updated Oct 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Success.ai (2021). App Developer Data | Engineering Professionals Worldwide Contact Data | Verified Contact Data for Engineers & IT Managers | Best Price Guaranteed [Dataset]. https://datarade.ai/data-products/app-developer-data-engineering-professionals-worldwide-cont-success-ai
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Oct 27, 2021
Dataset provided by
Area covered
Grenada, Tuvalu, Uganda, Bangladesh, Turkmenistan, Norway, Suriname, Poland, Burkina Faso, Liberia
Description
Success.ai’s B2B Contact Data and App Developer Data for Engineering Professionals Worldwide is a trusted resource for connecting with engineers and technical managers across industries and regions. This dataset draws from over 170 million verified professional profiles, ensuring you have access to high-quality contact data tailored to your business needs. From sales outreach to recruitment, Success.ai enables you to build meaningful relationships with engineering professionals at every level.

Why Choose Success.ai’s Engineering Professionals Data?

Accurate and Comprehensive Contact Information:

Access work emails, direct phone numbers, and LinkedIn profiles of engineers and technical managers globally.

Data is AI-validated, ensuring 99% accuracy for your campaigns.

Global Engineering Coverage:

Includes engineers and technical managers from sectors like manufacturing, IT, construction, aerospace, automotive, and more.

Regions covered include North America, Europe, Asia-Pacific, South America, and the Middle East.

Real-Time Updates:

Continuous updates ensure you stay connected to current roles and decision-makers in engineering.

Compliance and Security:

Fully adheres to GDPR, CCPA, and other global data privacy standards, ensuring legal and ethical use.

Data Highlights: - 170M+ Verified Professional Profiles: Comprehensive data from various industries, including engineering. - 50M Work Emails: Accurate and AI-validated for reliable communication. - 30M Company Profiles: Detailed insights to support targeted outreach. - 700M Global Professional Profiles: A rich dataset designed to meet diverse business needs.

Key Features of the Dataset: - Extensive Engineer Profiles: Covers various roles, including mechanical, software, civil, and electrical engineers, as well as engineering managers and directors. - Customizable Filters: Segment profiles by location, industry, job title, and company size for precise targeting. - AI-Powered Insights: Enriches profiles with contextual details to support personalization.

Strategic Use Cases:

Sales and Business Development:

Engage directly with engineering professionals to present tailored solutions.

Reach technical decision-makers to accelerate your sales cycles.

Recruitment and Talent Acquisition:

Source skilled engineers and managers for specialized roles.

Use updated profiles to connect with potential candidates effectively.

Targeted Marketing Campaigns:

Launch precision-driven marketing campaigns aimed at engineers and engineering teams.

Personalize outreach with accurate and detailed contact data.

Engineering Services and Solutions:

Pitch your engineering tools, software, or consulting services to professionals who can benefit the most.

Establish connections with managers who influence procurement decisions.

Why Success.ai Stands Out:

Best Price Guarantee: Gain access to high-quality datasets at competitive prices.

Flexible Integration Options: Choose between API access or downloadable formats for seamless integration into your systems.

High Accuracy and Coverage: Benefit from AI-validated contact data for impactful results.

Customizable Datasets: Filter and refine datasets to focus on specific engineering roles, industries, or regions.

APIs for Enhanced Functionality:

Data Enrichment API: Enhance your CRM with verified engineering contact details.

Lead Generation API: Seamlessly integrate new engineering leads into your existing workflow.

Empower your business with B2B Contact Data for Engineering Professionals Worldwide from Success.ai. With verified work emails, phone numbers, and decision-maker profiles, you can confidently target engineers and managers in any sector.

Experience the Best Price Guarantee and unlock the potential of precise, AI-validated datasets. Contact us today and start connecting with engineering leaders worldwide!

No one beats us on price. Period.
Software Market Analysis, Size, and Forecast 2025-2029: North America (US,...
technavio.com
pdf
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Software Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, Italy, and UK), Middle East and Africa (UAE), APAC (China, India, and Japan), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/software-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Feb 21, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
Germany, United States, Canada
Description
Snapshot img

Software Market Size 2025-2029

The software market size is forecast to increase by USD 30.7 billion, at a CAGR of 8.2% between 2024 and 2029.

The market is experiencing significant growth, driven primarily by the increasing volume of enterprise data and the shift towards cloud computing. Businesses are recognizing the value of leveraging data to gain insights and make informed decisions, leading to a surge in demand for software solutions that can manage and analyze large data sets. Additionally, cloud computing is becoming the preferred deployment model for software, as it offers cost savings, flexibility, and scalability. However, the market also faces challenges that require careful navigation. High costs of licensing and support continue to be a significant obstacle for many organizations, particularly smaller businesses and startups. These costs can limit their ability to implement and maintain the software solutions they need to remain competitive. Furthermore, ensuring data security and privacy in a cloud environment is a major concern, as sensitive information is increasingly being stored and processed digitally. Companies must address these challenges effectively to capitalize on the opportunities presented by the market's growth and remain competitive in the evolving software landscape.

What will be the Size of the Software Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, with dynamic market activities unfolding across various sectors. Entities such as version control systems, software quality assurance, software licensing, API integration, software maintenance, data warehousing, unit testing, project management, database management, cost optimization, and others, are seamlessly integrated into the software development lifecycle. Cloud computing is transforming the way software is deployed and accessed, while user experience remains a key focus for developers. Agile methodologies and the waterfall methodology coexist, with the former gaining popularity for its flexibility and the latter for its structured approach. Data mining and data analytics are increasingly being used to gain insights from vast amounts of data, while software security and bug tracking are essential components of any development process. Machine learning and artificial intelligence are also making their mark, enhancing software functionality and improving user experience. Proprietary software and open source software each have their unique advantages, with CI/CD and DevOps streamlining the development process. Requirements gathering and user acceptance testing are crucial steps in ensuring software meets user needs, while code review and integration testing help maintain software quality. Technical support and software updates are ongoing requirements, with risk management and cost optimization essential for businesses to effectively manage their software investments. Business intelligence and software architecture are critical for making informed decisions and building scalable systems.

How is this Software Industry segmented?

The software industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. TypeSubscriptionsIdentity and access managementEndpoint/network/messaging/web securityRisk managementDeploymentCloud-basedOn-premisesSectorLarge enterprisesSmall and medium enterprisesApplicationCRMERPCybersecurityCollaboration ToolsGeographyNorth AmericaUSCanadaMexicoEuropeFranceGermanyItalyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

By Type Insights

The subscriptions segment is estimated to witness significant growth during the forecast period.In the ever-evolving the market, subscription-based models are gaining significant traction as a key growth driver. This shift is driven by the increasing recognition of the benefits offered by these models, enabling businesses to adapt to their evolving needs. Subscription models provide flexibility, allowing companies to scale their software usage efficiently, adapting to expanding operations or streamlined processes. Additionally, these models promote cost optimization, enabling businesses to spread their software expenses over time, making it a more viable option for organizations of all sizes. The software development lifecycle is undergoing a transformation, with both waterfall and agile methodologies being adopted. Waterfall methodology, with its linear approach, is ideal for projects with well-defined requirements. In contrast, agile methodologies, with their iterative and collaborative nature, are more suitable for projects with evolving requirements. C
h
Data from: VibeCoding
huggingface.co
Updated Oct 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quixi AI (2025). VibeCoding [Dataset]. https://huggingface.co/datasets/QuixiAI/VibeCoding
Explore at:
Dataset updated
Oct 31, 2025
Dataset authored and provided by
Quixi AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🪩 VibeCoding Dataset Project

Collecting the vibes of coding — one log at a time.

📢 Call for Volunteers

We’re building an open dataset to capture real-world coding interactions between developers and AI coding assistants — and we need your help! This dataset will help researchers and developers better understand how humans and code models interact across different tools, and improve the future of AI-assisted software development.

🎯 Project Overview

The… See the full description on the dataset page: https://huggingface.co/datasets/QuixiAI/VibeCoding.
Data from: CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java...
zenodo.org
application/gzip, bin
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaihang Jiang; Jin Bihui; Nie Pengyu; Kaihang Jiang; Jin Bihui; Nie Pengyu (2025). CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java Repositories [Dataset]. http://doi.org/10.5281/zenodo.15293313
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15293313
Dataset updated
Apr 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kaihang Jiang; Jin Bihui; Nie Pengyu; Kaihang Jiang; Jin Bihui; Nie Pengyu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Modern programming languages are constantly evolving, introducing new language features and APIs to enhance software development practices. Software developers often face the tedious task of upgrading their codebase to new programming language versions. Recently, large language models (LLMs) have demonstrated potential in automating various code generation and editing tasks, suggesting their applicability in automating code upgrade. However, there exists no benchmark for evaluating the code upgrade ability of LLMs, as distilling code changes related to programming language evolution from real-world software repositories’ commit histories is a complex challenge.
In this work, we introduce CoUpJava, the first large-scale dataset for code upgrade, focusing on the code changes related to the evolution of Java. CoUpJava comprises 10,697 code upgrade samples, distilled from the commit histories of 1,379 open-source Java repositories and covering Java versions 7–23. The dataset is divided into two subsets: CoUpJava-Fine, which captures fine-grained method-level refactorings towards new language features; and CoUpJava-Coarse, which includes coarse-grained repository-level changes encompassing new language features, standard library APIs, and build configurations. Our proposed dataset provides high-quality samples by filtering irrelevant and noisy changes and verifying the compilability of upgraded code. Moreover, CoUpJava reveals diversity in code upgrade scenarios, ranging from small, fine-grained refactorings to large-scale repository modifications.
Most popular database management systems worldwide 2024
statista.com
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Most popular database management systems worldwide 2024 [Dataset]. https://www.statista.com/statistics/809750/worldwide-popularity-ranking-database-management-systems/
Explore at:
Dataset updated
Jun 15, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jun 2024
Area covered
Worldwide
Description
As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
Job Dataset
kaggle.com
zip
Updated Sep 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravender Singh Rana (2023). Job Dataset [Dataset]. https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset
Explore at:
zip(479575920 bytes)Available download formats
Dataset updated
Sep 17, 2023
Authors
Ravender Singh Rana
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Job Dataset

This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.

Descriptions for each of the columns in the dataset:

Job Id: A unique identifier for each job posting.

Experience: The required or preferred years of experience for the job.

Qualifications: The educational qualifications needed for the job.

Salary Range: The range of salaries or compensation offered for the position.

Location: The city or area where the job is located.

Country: The country where the job is located.

Latitude: The latitude coordinate of the job location.

Longitude: The longitude coordinate of the job location.

Work Type: The type of employment (e.g., full-time, part-time, contract).

Company Size: The approximate size or scale of the hiring company.

Job Posting Date: The date when the job posting was made public.

Preference: Special preferences or requirements for applicants (e.g., Only Male or Only Female, or Both)

Contact Person: The name of the contact person or recruiter for the job.

Contact: Contact information for job inquiries.

Job Title: The job title or position being advertised.

Role: The role or category of the job (e.g., software developer, marketing manager).

Job Portal: The platform or website where the job was posted.

Job Description: A detailed description of the job responsibilities and requirements.

Benefits: Information about benefits offered with the job (e.g., health insurance, retirement plans).

Skills: The skills or qualifications required for the job.

Responsibilities: Specific responsibilities and duties associated with the job.

Company Name: The name of the hiring company.

Company Profile: A brief overview of the company's background and mission.

Potential Use Cases:

Building predictive models to forecast job market trends.

Enhancing job recommendation systems for job seekers.

Developing NLP models for resume parsing and job matching.

Analyzing regional job market disparities and opportunities.

Exploring salary prediction models for various job roles.

Acknowledgements:

We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.

Note:

Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com
Global Open-Source Database Software Market Size By Product, By Application,...
verifiedmarketresearch.com
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VERIFIED MARKET RESEARCH (2024). Global Open-Source Database Software Market Size By Product, By Application, By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/open-source-database-software-market/
Explore at:
Dataset updated
Mar 21, 2024
Dataset provided by
Verified Market Researchhttps://www.verifiedmarketresearch.com/
Authors
VERIFIED MARKET RESEARCH
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2024 - 2030
Area covered
Global
Description
Open-Source Database Software Market size was valued at USD 10.00 Billion in 2024 and is projected to reach USD 35.83 Billion by 2032, growing at a CAGR of 20% during the forecast period 2026-2032.

Global Open-Source Database Software Market Drivers

The market drivers for the Open-Source Database Software Market can be influenced by various factors. These may include:

Cost-Effectiveness: Compared to proprietary systems, open-source databases frequently have lower initial expenses, which attracts organizations—especially startups and small to medium-sized enterprises (SMEs) with tight budgets. Flexibility and Customisation: Open-source databases provide more possibilities for customization and flexibility, enabling businesses to modify the database to suit their unique needs and grow as necessary. Collaboration and Community Support: Active developer communities that share best practices, support, and contribute to the continued development of open-source databases are beneficial. This cooperative setting can promote quicker problem solving and innovation. Performance and Scalability: A lot of open-source databases are made to scale horizontally across several nodes, which helps businesses manage expanding data volumes and keep up performance levels as their requirements change. Data Security and Sovereignty: Open-source databases provide businesses more control over their data and allow them to decide where to store and use it, which helps to allay worries about compliance and data sovereignty. Furthermore, open-source code openness can improve security by making it simpler to find and fix problems. Compatibility with Contemporary Technologies: Open-source databases are well-suited for contemporary application development and deployment techniques like microservices, containers, and cloud-native architectures since they frequently support a broad range of programming languages, frameworks, and platforms. Growing Cloud Computing Adoption: Open-source databases offer a flexible and affordable solution for managing data in cloud environments, whether through self-managed deployments or via managed database services provided by cloud providers. This is because more and more organizations are moving their workloads to the cloud. Escalating Need for Real-Time Insights and Analytics: Organizations are increasingly adopting open-source databases with integrated analytics capabilities, like NoSQL and NewSQL databases, as a means of instantly obtaining actionable insights from their data.
Technical Leverage Dataset for Java Dependencies in Maven
data.europa.eu
data-staging.niaid.nih.gov
+1more
unknown
Updated Jan 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2021). Technical Leverage Dataset for Java Dependencies in Maven [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6796849?locale=cs
Explore at:
unknown(2977323)Available download formats
Dataset updated
Jan 21, 2021
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In finance, leverage is the ratio between assets borrowed from others and one's own assets. A matching situation is present in software: by using free open-source software (FOSS) libraries a developer leverages on other people's code to multiply the offered functionalities with a much smaller own codebase. In finance as in software, leverage magnifies profits when returns from borrowing exceed costs of integration, but it may also magnify losses, in particular in the presence of security vulnerabilities. We aim to understand the level of technical leverage in the FOSS ecosystem and whether it can be a potential source of security vulnerabilities. Also, we introduce two metrics change distance and change direction to capture the amount and the evolution of the dependency on third-party libraries. Our analysis published in [1] shows that small and medium libraries (less than 100KLoC) have disproportionately more leverage on FOSS dependencies in comparison to large libraries. We show that leverage pays off as leveraged libraries only add a 4% delay in the time interval between library releases while providing four times more code than their own. However, libraries with such leverage (i.e., 75% of libraries in our sample) also have 1.6 higher odds of being vulnerable in comparison to the libraries with lower leverage. This dataset is the original dataset used in the publication [1]. It includes 8494 distinct library versions from the FOSS Maven-based Java libraries An online demo for computing the proposed metrics for real-world software libraries is also available under the following URL: https://techleverage.eu/. The original publication is [1]. An executive summary of the results is avialble as the publication [2]. This work has been funded by the European Union with the project AssureMOSS (https://www.assuremoss.eu). [1] Massacci, F., & Pashchenko, I. (2021, May). Technical leverage in a software ecosystem: Development opportunities and security risks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (pp. 1386-1397). IEEE. [2] Massacci, F., & Pashchenko, I. (2021). Technical Leverage: Dependencies Are a Mixed Blessing. IEEE Secur. Priv., 19(3), 58-62.
G
Database Development and Management Tools Software Market Research Report...
growthmarketreports.com
csv, pdf, pptx
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Database Development and Management Tools Software Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/database-development-and-management-tools-software-market-global-industry-analysis
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Aug 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Database Development and Management Tools Software Market Outlook

According to our latest research, the global database development and management tools software market size reached USD 15.8 billion in 2024, reflecting robust demand across diverse sectors. The market is anticipated to expand at a CAGR of 13.2% during the forecast period, propelling the market to an estimated USD 44.2 billion by 2033. This impressive growth is driven by the escalating need for efficient data management, the proliferation of cloud-based solutions, and the increasing complexity of enterprise data environments. As organizations worldwide continue to digitize their operations and harness big data analytics, the demand for advanced database development and management tools software is set to surge.

One of the primary growth factors for the database development and management tools software market is the exponential increase in data volumes generated by businesses, governments, and individuals alike. The digital transformation wave sweeping across industries necessitates robust solutions for storing, organizing, and retrieving vast datasets with high reliability and speed. Organizations are increasingly leveraging data-driven insights to enhance decision-making, optimize operations, and personalize customer experiences. This reliance on data has compelled enterprises to invest in sophisticated database development and management tools that can handle complex queries, streamline data modeling, and ensure data integrity. As a result, both established enterprises and emerging startups are prioritizing investments in this market, further fueling its expansion.

Another significant driver of market growth is the rapid adoption of cloud computing technologies. Cloud-based database management solutions offer unparalleled scalability, flexibility, and cost-effectiveness compared to traditional on-premises systems. With organizations seeking to minimize IT infrastructure costs and improve accessibility, cloud deployment models are gaining substantial traction. This shift is particularly pronounced among small and medium enterprises (SMEs), which benefit from the reduced upfront investment and operational agility provided by cloud solutions. Additionally, the integration of artificial intelligence and machine learning capabilities into database tools is enabling automated performance monitoring, predictive maintenance, and advanced security management, further enhancing the value proposition of these solutions.

The growing emphasis on data security and regulatory compliance is also shaping the trajectory of the database development and management tools software market. With the rising incidence of cyberattacks and stringent data protection regulations such as GDPR, HIPAA, and CCPA, organizations are under pressure to safeguard sensitive information and ensure compliance. Advanced database management tools now incorporate robust security features, including encryption, access controls, and real-time threat detection, to address these concerns. Vendors are continuously innovating to provide end-to-end security management and automated compliance reporting, making their solutions indispensable for businesses operating in highly regulated industries such as BFSI, healthcare, and government.

The role of a Database Management System (DBMS) is becoming increasingly pivotal as organizations strive to manage and leverage their growing data assets effectively. A DBMS provides a systematic way to create, retrieve, update, and manage data, ensuring that data remains consistent, organized, and easily accessible. With the exponential growth in data volumes, the ability to efficiently handle complex queries and transactions has become a cornerstone for businesses aiming to derive actionable insights and maintain a competitive edge. The integration of advanced functionalities such as automated backup, recovery, and real-time analytics within DBMS solutions is further enhancing their appeal, making them indispensable tools in the modern data-driven landscape.

Regionally, North America continues to dominate the market, accounting for the largest revenue share in 2024, followed closely by Europe and the Asia Pacific. The presence of leading technology providers, early adoption of digital technologies, and a strong focus on innovation
h
CodeChat
huggingface.co
Updated Dec 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suzhen Zhong (2023). CodeChat [Dataset]. https://huggingface.co/datasets/Suzhen/CodeChat
Explore at:
Dataset updated
Dec 23, 2023
Authors
Suzhen Zhong
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
CodeChat: Developer–LLM Conversations Dataset

Paper: https://arxiv.org/abs/2509.10402
GitHub: https://github.com/Software-Evolution-Analytics-Lab-SEAL/CodeChat

CodeChat is a large-scale dataset comprising 82,845 real-world developer–LLM conversations, containing 368,506 code snippets generated across more than 20 programming languages, derived from the WildChat (i.e., general Human-LLMs conversations dataset). The dataset enables empirical analysis of how developers interact… See the full description on the dataset page: https://huggingface.co/datasets/Suzhen/CodeChat.
Data from: Embracing the Future: Novice Software Engineers’ Perspective on...
figshare.com
zip
Updated Mar 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
emre ilgin; ESRA KIDIMAN; Murat YILMAZ; Filiz Mumcu (2024). Embracing the Future: Novice Software Engineers’ Perspective on the Rise of Hybrid Work Models in a Post-Pandemic World [Dataset]. http://doi.org/10.6084/m9.figshare.25331593.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25331593.v1
Dataset updated
Mar 3, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
emre ilgin; ESRA KIDIMAN; Murat YILMAZ; Filiz Mumcu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Perspectives of novice software engineers (NSEs) regarding hybrid work, examining their views on hybrid work conditions and their experiences with hybrid tools.
Z
Geographic Diversity in Public Code Contributions — Replication Package
data.niaid.nih.gov
Updated Mar 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davide Rossi; Stefano Zacchiroli (2022). Geographic Diversity in Public Code Contributions — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6390354
Explore at:
Dataset updated
Mar 31, 2022
Dataset provided by
University of Bologna, Italy
LTCI, Télécom Paris, Institut Polytechnique de Paris
Authors
Davide Rossi; Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Geographic Diversity in Public Code Contributions - Replication Package

This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471

This document comes with the software needed to mine and analyze the data presented in the paper.

Prerequisites

These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages:

click==8.0.4 cycler==0.11.0 fonttools==4.31.2 kiwisolver==1.4.0 matplotlib==3.5.1 numpy==1.22.3 packaging==21.3 pandas==1.4.1 patsy==0.5.2 Pillow==9.0.1 pyparsing==3.0.7 python-dateutil==2.8.2 pytz==2022.1 scipy==1.8.0 six==1.16.0 statsmodels==0.13.2

Initial data

swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.

names.tab - forenames and surnames per country with their frequency

zones.acc.tab - countries/territories, timezones, population and world zones

c_c.tab - ccTDL entities - world zones matches

Data preparation

Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst

sh> ./export.sh

Run the authors cleanup script to create authors--clean.csv.zst

sh> ./cleanup.sh authors.csv.zst

Filter out implausible names and create authors--plausible.csv.zst

sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst

Zone detection by email

Run the email detection script to create author-country-by-email.tab.zst

sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst

Database creation and initial data ingestion

Create the PostgreSQL DB

sh> createdb zones-commit

Notice that from now on when prepending the psql> prompt we assume the execution of psql on the zones-commit database.

Import data into PostgreSQL DB

sh> ./import_data.sh

Zone detection by name

Extract commits data from the DB and create commits.tab, that is used as input for the zone detection script

sh> psql -f extract_commits.sql zones-commit

Run the world zone detection script to create commit_zones.tab.zst

sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.

Ingest zones assignment data into the DB

psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''

Extraction and graphs

Run the script to execute the queries to extract the data to plot from the DB. This creates commit_zones_7120.tab, author_zones_7120_t5.tab, commit_zones_7120.grid and author_zones_7120_t5.grid. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, …).

sh> ./extract_data.sh

Run the script to create the graphs from all the previously extracted tabfiles.

sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf
Generative AI In Coding Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
pdf
Updated Jul 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Generative AI In Coding Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/generative-ai-in-coding-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jul 26, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
United States
Description
Snapshot img

Generative AI In Coding Market Size 2025-2029

The generative AI in coding market size is forecast to increase by USD 10.22 billion, at a CAGR of 32.7% between 2024 and 2029.

The market is experiencing significant growth, driven by the increasing demand for increased developer productivity and accelerated innovation cycles. Companies are recognizing the potential of generative AI to automate coding tasks, reducing the time and effort required for software development. However, this shift towards AI-driven coding is not without challenges. Navigating concerns of security, accuracy, and intellectual property are key obstacles in the adoption of generative AI in coding. Ensuring the security of code generated by AI is essential, as any vulnerabilities could lead to significant risks. Semantic reasoning and predictive analytics are transforming decision making, while AI-powered chatbots and virtual assistants enhance customer service. Lastly, addressing intellectual property concerns is necessary to ensure ownership and control over the generated code. As the market continues to evolve, companies must adapt to these challenges and focus on integrating generative AI into enterprise platforms rather than relying on individual tools. By doing so, they can mitigate risks, improve efficiency, and drive innovation in their software development processes. Overall, the market presents significant opportunities for businesses seeking to streamline their development processes and stay competitive in the rapidly evolving tech landscape. Real-time anomaly detection and latency reduction techniques are critical for maintaining the reliability and accuracy of these systems.

What will be the Size of the Generative AI In Coding Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample

The market for generative AI in coding continues to evolve, with applications spanning various sectors including finance, healthcare, and manufacturing. Deployment scalability and model performance benchmarking are critical factors as organizations seek to optimize their AI models. Training dataset size plays a significant role in model accuracy, with larger datasets often leading to improved results. Ethical AI considerations, such as model explainability and fairness metrics, are increasingly important as AI becomes more prevalent in business operations. One example of the market's dynamic nature can be seen in the use of code readability assessment and accuracy measurements in software development. Model bias, data privacy, and data security remain critical concerns.

By analyzing code complexity and vulnerability detection, organizations can improve code quality and reduce the risk of security flaws. Neural network training and model fine-tuning are ongoing processes, with AI models requiring continuous updates to maintain optimal performance. According to recent industry reports, the generative AI market in coding is expected to grow by over 25% annually in the coming years, driven by advancements in explainable AI, bias mitigation strategies, and the increasing demand for more efficient and accurate coding solutions. Additionally, techniques such as data augmentation, AUC calculation, and ROC curve analysis are becoming increasingly important for improving model performance and reducing the need for large training datasets.

How is this Generative AI In Coding Market segmented?

The generative AI in coding market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Application Code generation Code enhancement Language translation Code reviews End-user Data science and analytics Web and application development Game development and design IoT and smart devices Others Type Python JavaScript Java Others Geography North America US Canada Mexico Europe France Germany UK APAC China India Japan South Korea Rest of World (ROW)

By Application Insights

The Code generation segment is estimated to witness significant growth during the forecast period. The market is witnessing significant advancements in automating software development processes. Code generation AI, a key segment, automates the creation of new source code from user inputs, addressing the time-consuming aspect of writing boilerplate or repetitive code. This technology has evolved from simple code completions to generating complex functions, classes, and even entire application scaffolds. Integration with version control systems and IDEs, such as GitHub Copilot, enhances developer productivity. Program synthesis
R
Open Poetry Vision Object Detection Dataset - 512x512
public.roboflow.com
zip
Updated Apr 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brad Dwyer (2022). Open Poetry Vision Object Detection Dataset - 512x512 [Dataset]. https://public.roboflow.com/object-detection/open-poetry-vision/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 7, 2022
Dataset authored and provided by
Brad Dwyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Bounding Boxes of text
Description
Overview

The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.

It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.

Example Image: https://i.imgur.com/sZT516a.png" alt="Example Image">

Use Cases

A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.

Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.

Using this Dataset

Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

Introducing the New Roboflow Train

What to Think About When Choosing Model Sizes

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

Facebook

Twitter

Click to copy link

Link copied

Cite

Palvinder (2024). Stack Overflow Developer Survey Dataset [Dataset]. https://www.kaggle.com/datasets/palvinder2006/stackoverflow

Stack Overflow Developer Survey Dataset

Data from world's largest and most trusted community of software developers.

Explore at:

zip(9459089 bytes)Available download formats

Dataset updated

Jan 8, 2024

Authors

Palvinder

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Overview The Stack Overflow Developer Survey Dataset represents one of the most trusted and comprehensive sources of information about the global developer community. Collected by Stack Overflow through its annual survey, the dataset provides insights into the demographics, preferences, habits, and career paths of developers.

This dataset is frequently used for: - Analyzing trends in programming languages, tools, and technologies. - Understanding developer job satisfaction, compensation, and work environments. - Studying global and regional differences in developer demographics and experience.

The data has of two CSV files, "survey_results_public" that consist of data and "survey_results_schema" that describes each column in detail.

Data Dictionary: All the details are in "survey_results_schema.csv"

Features of the Stack Overflow Developer Survey Dataset

Demographic & Background Information - Respondent: A unique identifier for each survey participant. - MainBranch: Describes whether the respondent is a professional developer, student, hobbyist, etc. - Country: The country where the respondent lives. - Age: The respondent's age. - Gender: The gender identity of the respondent. - Ethnicity: Ethnic background (when available). - EdLevel: The highest level of formal education completed. - UndergradMajor: The respondent's undergraduate major. - Hobbyist: Indicates whether the person codes as a hobby (Yes/No).

Employment & Professional Experience - Employment: Employment status (full-time, part-time, unemployed, student, etc.). - DevType: Types of developer roles the respondent identifies with (e.g., Web Developer, Data Scientist). - YearsCode: Number of years the respondent has been coding. - YearsCodePro: Number of years coding professionally. - JobSat: Job satisfaction level. - CareerSat: Career satisfaction level. - WorkWeekHrs: Approximate hours worked per week. - RemoteWork: Whether the respondent works remotely and how frequently.

Compensation - CompTotal: Total compensation in USD (including salary, bonuses, etc.). - CompFreq: Frequency of compensation (e.g., yearly, monthly).

Learning & Education - LearnCode: How the respondent first learned to code (e.g., online courses, university). - LearnCodeOnline: Online resources used (e.g., YouTube, freeCodeCamp). - LearnCodeCoursesCert: Whether the respondent has taken online courses or earned certifications.

Technology & Tools - LanguageHaveWorkedWith: Programming languages the respondent has used. - LanguageWantToWorkWith: Languages the respondent is interested in learning or using more. - DatabaseHaveWorkedWith: Databases the respondent has experience with. - PlatformHaveWorkedWith: Platforms used (e.g., Linux, AWS, Android). - OpSys: The operating system used most often. - NEWCollabToolsHaveWorkedWith: Collaboration tools used (e.g., Slack, Teams, Zoom). - NEWStuck: How often the respondent feels stuck when coding. - ToolsTechHaveWorkedWith: Frameworks and technologies respondents have worked with.

Online Presence & Community - SOAccount: Whether the respondent has a Stack Overflow account. - SOPartFreq: How often the respondent participates on Stack Overflow. - SOVisitFreq: Frequency of visiting Stack Overflow. - SOComm: Whether the respondent feels welcome in the Stack Overflow community. - OpenSourcer: Level of involvement in open-source contributions.

Opinions & Preferences - WorkChallenge: Challenges faced at work (e.g., unclear requirements, unrealistic expectations). - JobFactors: Important job factors (e.g., salary, work-life balance, technologies used). - MentalHealth: Questions on how mental health affects or is affected by their job.

Clear search

Close search

Google apps

Main menu

Stack Overflow Developer Survey Dataset

Features of the Stack Overflow Developer Survey Dataset

SWE-Bench Coding Tasks Dataset

SWE-Bench Dataset

💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

Example of the data

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

Enterprise-Driven Open Source Software

WebAutomation Employee Data | Github Developer Profiles | Global 40M+...

Data from: Global Fintech Market Dataset

An Analysis of Engineering-as-Marketing Tools

An Analysis of Engineering-as-Marketing Tools

Strategies for Expanding Business Reach

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

App Developer Data | Engineering Professionals Worldwide Contact Data |...

Software Market Analysis, Size, and Forecast 2025-2029: North America (US,...

Snapshot img

Data from: VibeCoding

Data from: CoUpJava: A Dataset of Code Upgrade Histories in Open-Source Java...

Most popular database management systems worldwide 2024

Job Dataset

Job Dataset

Descriptions for each of the columns in the dataset:

Potential Use Cases:

Acknowledgements:

Note:

Global Open-Source Database Software Market Size By Product, By Application,...

Technical Leverage Dataset for Java Dependencies in Maven

Database Development and Management Tools Software Market Research Report...

Database Development and Management Tools Software Market Outlook

CodeChat

Data from: Embracing the Future: Novice Software Engineers’ Perspective on...

Geographic Diversity in Public Code Contributions — Replication Package

Generative AI In Coding Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Open Poetry Vision Object Detection Dataset - 512x512

Overview

Use Cases

Using this Dataset

Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

About Roboflow

Stack Overflow Developer Survey Dataset

Data from world's largest and most trusted community of software developers.

Features of the Stack Overflow Developer Survey Dataset