Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.
The main dataset is provided as a 17,264 record tab-separated file named enterprise_projects.txt with the following 29 fields.
url: the project's GitHub URL
project_id: the project's GHTorrent identifier
sdtc: true if selected using the same domain top committers heuristic (9,016 records)
mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,314 records)
mcve: true if selected using the multiple committers from a probable company heuristic (8,015 records),
star_number: number of GitHub watchers
commit_count: number of commits
files: number of files in current main branch
lines: corresponding number of lines in text files
pull_requests: number of pull requests
github_repo_creation: timestamp of the GitHub repository creation
earliest_commit: timestamp of the earliest commit
most_recent_commit: date of the most recent commit
committer_count: number of different committers
author_count: number of different authors
dominant_domain: the projects dominant email domain
dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain
dominant_domain_author_commits: corresponding number for commit authors
dominant_domain_committers: number of committers whose email matches the project's dominant domain
dominant_domain_authors: corresponding number for commit authors
cik: SEC's EDGAR "central index key"
fg500: true if this is a Fortune Global 500 company (2,233 records)
sec10k: true if the company files SEC 10-K forms (4,180 records)
sec20f: true if the company files SEC 20-F forms (429 records)
project_name: GitHub project name
owner_login: GitHub project's owner login
company_name: company name as derived from the SEC and Fortune 500 data
owner_company: GitHub project's owner company name
license: SPDX license identifier
The file cohost_project_details.txt provides the full set of 311,223 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.
url: the project's GitHub URL
project_id: the project's GHTorrent identifier
stars: number of GitHub watchers
commit_count: number of commits
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The presentation explains in the simplest possible way what you need to know about open source licenses when starting from scratch. It also sums up the course "Open Source Licensing Basics for Software Developers (LFC191)" (Linux Foundation)
The development of business products and services underpinned by open source (OS) software and digital infrastructure is widespread. This raises important questions about how that work is resourced and managed within and beyond organisational boundaries. Our study explored how Open Source is located within organisations from both public and commercial sectors and the implications of this for work practices and organisational models. It aimed to provide valuable insights into both the sustainability of OS digital infrastructure and how digital technologies are transforming work in diverse ways. We set out to understand where, why and how organisations from different sectors develop open source software and digital infrastructure as part of their delivery of products and services. And how staff work to develop products and services and maintain OS digital infrastructure within and beyond organisational boundaries.
Our methods involved qualitative interviews to explore different experiences of open source. Innovative web-scraping, online research and snowball techniques were used to purposively identify organisations that were aware of and using open source. We conducted 20 online interviews with staff from organisations in four broad fields across the public and commercial sectors; global technology corporations, UK public sector (local government), UK Higher education, and Open Source first companies. Interviewees were mostly senior technical staff managing the development of products and services. Key informant interviews were conducted with those in open source community and policy roles. Transcripts were pseudonymised and imported into Nvivo and coded thematically using both inductive and deductive codes.
Our key findings in the first stage of analysis focused on providing a comparative picture of the 4 groups of organisations, the location of open source, its role in the delivery of products and services and organisational infrastructure, how open source was used and maintained, and where contributions were made to communities. Emerging themes indicated the embedding of open source in the commercial global technology industry with various structures set up internally to support communities and manage licencing and contributions. OS first organisations had put open development at the heart of their mission creating innovative practices and organisational structures to facilitate community support and contribution. In the public sector open source was used in an ad hoc way by universities and local authorities, but increasingly off the shelf-products with support packages had taken precedence as a result of resourcing crises and concerns about risk compatibility and disruption caused by implementation. Further analysis will be looking in detail at structures and models that facilitate or prevent work beyond organisational boundaries and the implications of these new ways of working for the future of work. As such the research contributes to Digit’s goal of understanding how digital technologies are transforming work and the theme of Employers’ and employees’ experiences of digital work across sectors.
The data collections consists of 17 interview transcripts with workers in four industries UKHE, Global Technology Corporations, UK public sector bodies and Open source first organisations.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Electronic commerce and technology, use of open source software by North American Industry Classification System (NAICS), for Canada from 2005 to 2007. (Terminated)
This dataset lists out all software in use by NASA.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
The Dark Side of Openness: How Open Source Data Can Be Abused to Harm Human Life
First draft partially generated using Perplexity AI, then written and edited manually. Introduction Open-source data—the vast troves of information freely available to the public—has transformed how we innovate, collaborate, and solve problems. From scientific research to civic technology, the benefits are clear. However, the same openness that drives progress can also create serious risks. When… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/open-source-data-abuse.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Thanks to a variety of software services, it has never been easier to produce, manage and publish Linked Open Data. But until now, there has been a lack of an accessible overview to help researchers make the right choice for their use case. This dataset release will be regularly updated to reflect the latest data published in a comparison table developed in Google Sheets [1]. The comparison table includes the most commonly used LOD management software tools from NFDI4Culture to illustrate what functionalities and features a service should offer for the long-term management of FAIR research data, including:
The table presents two views based on a comparison system of categories developed iteratively during workshops with expert users and developers from the respective tool communities. First, a short overview with field values coming from controlled vocabularies and multiple-choice options; and a second sheet allowing for more descriptive free text additions. The table and corresponding dataset releases for each view mode are designed to provide a well-founded basis for evaluation when deciding on a LOD management service. The Google Sheet table will remain open to collaboration and community contribution, as well as updates with new data and potentially new tools, whereas the datasets released here are meant to provide stable reference points with version control.
The research for the comparison table was first presented as a paper at DHd2023, Open Humanities – Open Culture, 13-17.03.2023, Trier and Luxembourg [2].
[1] Non-editing access is available here: docs.google.com/spreadsheets/d/1FNU8857JwUNFXmXAW16lgpjLq5TkgBUuafqZF-yo8_I/edit?usp=share_link To get editing access contact the authors.
[2] Full paper will be made available open access in the conference proceedings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global open source database market size was valued at approximately USD 15.5 billion in 2023 and is projected to reach around USD 40.6 billion by 2032, expanding at a compound annual growth rate (CAGR) of 11.5% during the forecast period. The growth of this market is primarily driven by the increasing adoption of open-source databases by both SMEs and large enterprises due to their cost-effectiveness and flexibility.
A significant growth factor for the open source database market is the rising demand for data analytics and business intelligence across various industries. Organizations are increasingly leveraging big data to gain actionable insights, enhance decision-making processes, and improve operational efficiency. Open source databases provide the scalability and performance required to handle large volumes of data, making them an attractive option for businesses looking to maximize their data-driven strategies. Additionally, the continuous advancements and contributions from the open-source community help in keeping these databases at the cutting edge of technology.
Another driving factor is the cost-efficiency associated with open-source databases. Unlike proprietary databases, which can be expensive due to licensing fees, open-source databases are usually free to use, offering a significant cost advantage. This factor is especially crucial for small and medium enterprises (SMEs), which often operate with limited budgets. The lower total cost of ownership, combined with the flexibility to customize the database according to specific needs, makes open-source solutions highly appealing for businesses of all sizes.
The increasing trend of digital transformation is also playing a crucial role in the growth of the open source database market. As businesses across various sectors accelerate their digital initiatives, the need for robust, scalable, and efficient data management solutions becomes paramount. Open-source databases provide the agility and innovation that organizations require to keep up with the rapidly changing digital landscape. Moreover, the support for cloud deployment further enhances their appeal, providing businesses with the scalability and flexibility needed to adapt to evolving technological demands.
From a regional perspective, North America holds a significant share in the open source database market, driven by the presence of major technology companies and a highly developed IT infrastructure. The region's focus on technological innovation and early adoption of advanced technologies contributes to its dominant position. Europe follows closely, with increasing investments in digital transformation initiatives. The Asia Pacific region is expected to witness the highest growth rate during the forecast period, fueled by rapid technological advancements, a burgeoning IT sector, and increased adoption of open-source solutions by businesses.
Relational Databases Software plays a crucial role in the open-source database market, offering structured data management solutions that are essential for various business applications. These databases are known for their ability to handle complex queries and transactions, making them ideal for industries that require high levels of data integrity and consistency. The flexibility and robustness of relational databases software allow organizations to efficiently manage large volumes of structured data, which is critical for applications such as financial systems, enterprise resource planning, and customer relationship management. As businesses continue to prioritize data-driven decision-making, the demand for relational databases software is expected to grow, further driving the expansion of the open-source database market.
The open source database market is segmented into SQL, NoSQL, and NewSQL databases. SQL databases are the most widely used and have been the backbone of data management for decades. They offer robust transaction management and are ideal for structured data storage and retrieval. The ongoing improvements in SQL databases, such as enhanced performance and security features, continue to make them a preferred choice for many organizations. Additionally, the availability of various SQL-based open-source solutions like MySQL, PostgreSQL, and MariaDB provides organizations with reliable options to manage their data effectively.
NoSQL databases are gainin
This study provides an evidence-based understanding on etiological issues related to school shootings and rampage shootings. It created a national, open-source database that includes all publicly known shootings that resulted in at least one injury that occurred on K-12 school grounds between 1990 and 2016. The investigators sought to better understand the nature of the problem and clarify the types of shooting incidents occurring in schools, provide information on the characteristics of school shooters, and compare fatal shooting incidents to events where only injuries resulted to identify intervention points that could be exploited to reduce the harm caused by shootings. To accomplish these objectives, the investigators used quantitative multivariate and qualitative case studies research methods to document where and when school violence occurs, and highlight key incident and perpetrator level characteristics to help law enforcement and school administrators differentiate between the kinds of school shootings that exist, to further policy responses that are appropriate for individuals and communities.
A JSON that is used to build the content on code.nasa.gov. This JSON contains names, descriptions, links, and keyword tags for all NASA open-sourced code projects released through the SRA (Software Release Authority) and available on code.nasa.gov. It was updated on August, 2019.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Data Open Source is a dataset for object detection tasks - it contains Pest annotations for 476 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset list the dependencies from the repositories contributed by the Public Sector in Luxembourg. The data has been crawled with codegouvfr-fetch-data. If you wish to contribute to this dataset, feel free to contribute the following Github project via issues or pull requests: Open Source Software contributed by the Public sector in Luxembourg, a list of organization accounts
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The open-source tools market is experiencing robust growth, driven by increasing demand for cost-effective, flexible, and customizable solutions across diverse sectors. The market, encompassing tools for data cleaning, visualization, mining, and applications like machine learning, natural language processing, and computer vision, is projected to witness substantial expansion over the forecast period (2025-2033). Factors such as the rising adoption of cloud computing, the growing need for data-driven decision-making, and the increasing preference for collaborative development models are key drivers. While the specific CAGR isn't provided, a conservative estimate based on industry trends suggests a compound annual growth rate of around 15-20% is realistic for the period. This growth is anticipated across all segments, with the data science and machine learning sectors exhibiting particularly strong performance. Geographic expansion is also a prominent trend, with North America and Europe leading the market initially, followed by a significant increase in adoption across Asia Pacific and other regions as digital transformation initiatives accelerate. However, challenges remain. Security concerns surrounding open-source software and the need for robust support and maintenance infrastructure could potentially restrain market growth. Nevertheless, ongoing improvements in security protocols and the burgeoning community support surrounding many open-source projects are mitigating these challenges. The diverse range of applications and tool types within the open-source market ensures its versatility. Universal tools, catering to broad needs, and specialized tools like data visualization and mining software are all experiencing increased demand. The presence of established players like IBM and Oracle alongside a large community of contributors ensures a dynamic market ecosystem. The continued development of innovative tools, improved documentation, and enhanced community support are expected to further fuel market growth, making open-source solutions increasingly attractive to businesses of all sizes. Specific segmentation data, while not explicitly provided, shows a spread across applications indicating a healthy, diversified market that is expected to evolve rapidly within the forecast period.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The open-source data labeling tool market is experiencing robust growth, driven by the increasing demand for high-quality training data in machine learning and artificial intelligence applications. The market's expansion is fueled by several key factors: the rising adoption of AI across various industries, the need for cost-effective data annotation solutions, and the growing preference for flexible and customizable tools. While precise market sizing data is unavailable, considering the substantial growth in the broader data annotation market and the increasing popularity of open-source solutions, we can reasonably estimate the 2025 market size to be approximately $500 million. This signifies a significant opportunity for providers of open-source tools, particularly those offering innovative features and strong community support. Assuming a conservative Compound Annual Growth Rate (CAGR) of 25% for the forecast period (2025-2033), the market is projected to reach approximately $4.8 billion by 2033. This growth trajectory is supported by the continuous advancements in AI and the ever-increasing volume of data requiring labeling. Several challenges restrain market growth, including the need for specialized technical expertise to effectively implement and manage open-source tools, and the potential for inconsistencies in data quality compared to commercial solutions. However, the inherent advantages of open-source tools—cost-effectiveness, customization, and community-driven improvements—are expected to outweigh these challenges. The increasing availability of user-friendly interfaces and pre-trained models is further enhancing the accessibility and appeal of open-source solutions. The market segmentation encompasses various tool types based on functionality and applications (image annotation, text annotation, video annotation etc.), deployment models (cloud-based, on-premise), and target industries (healthcare, automotive, finance etc.). Leading players are continuously enhancing their offerings, fostering community engagement, and expanding their service portfolios to capitalize on this expanding market.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
datasets metrics
This dataset contains metrics about the huggingface/datasets package. Number of repositories in the dataset: 4997 Number of packages in the dataset: 215
Package dependents
This contains the data available in the used-by tab on GitHub.
Package & Repository star count
This section shows the package and repository star count, individually.
Package Repository
There are 22 packages that have more than 1000 stars. There are 43… See the full description on the dataset page: https://huggingface.co/datasets/open-source-metrics/datasets-dependents.
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The open-source data acquisition (DAQ) instrument market, currently valued at $545 million in 2025, is projected to experience robust growth, fueled by a Compound Annual Growth Rate (CAGR) of 5.5% from 2025 to 2033. This growth is driven by several key factors. The increasing demand for customizable and cost-effective data acquisition solutions across diverse sectors like research, education, and industrial automation is a significant driver. Open-source DAQ instruments offer flexibility and community support, allowing users to adapt them to specific needs and integrate them seamlessly into existing workflows. Furthermore, the rising adoption of Internet of Things (IoT) devices and the need for real-time data processing are contributing to market expansion. The availability of readily accessible software libraries and extensive online resources further enhances the accessibility and appeal of these instruments, making them attractive alternatives to expensive proprietary solutions. Companies like OpenBCI, Red Pitaya, LabJack, Arduino, National Instruments, and ADLINK Technology are key players shaping the market landscape, each contributing unique features and functionalities to this dynamic sector. The market segmentation is likely diverse, with variations based on hardware capabilities (e.g., sampling rate, number of channels, input types), software interfaces (e.g., Python, MATLAB, LabVIEW), and application-specific configurations (e.g., biosignal processing, environmental monitoring). Geographic distribution will also play a crucial role; we anticipate stronger growth in regions with burgeoning technological advancements and a high concentration of research institutions and industrial automation sectors. Restraints on market growth might include the need for users to possess a reasonable level of technical expertise for setup and configuration, and the potential for variations in device quality among different open-source manufacturers. Nonetheless, the overall trend points toward sustained and significant growth for the open-source DAQ instrument market over the next decade.
Open Source Application Development Portal (OSADP). The system provides a place for programmers to share software code and solutions.
These data and code successfully reproduce nearly all cross-sectional stock return predictors. The 319 characteristics draw from previous meta-studies, but authors differ by comparing their t-stats to the original papers' results. For the 161 characteristics that were clearly significant in the original papers, 98% of their long-short portfolios find t-stats above 1.96. For the 44 characteristics that had mixed evidence, authors' reproductions find t-stats of 2 on average. A regression of reproduced t-stats on original longshort t-stats finds a slope of 0.90 and an R2 of 83%. Mean returns aremonotonic in predictive signals at the characteristic level. The remaining 114 characteristics were insignificant in the original papers or are modifications of the originals created by Hou, Xue, and Zhang (2020). These remaining characteristics are almost always significant if the original characteristic was also significant.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.
The main dataset is provided as a 17,264 record tab-separated file named enterprise_projects.txt with the following 29 fields.
url: the project's GitHub URL
project_id: the project's GHTorrent identifier
sdtc: true if selected using the same domain top committers heuristic (9,016 records)
mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,314 records)
mcve: true if selected using the multiple committers from a probable company heuristic (8,015 records),
star_number: number of GitHub watchers
commit_count: number of commits
files: number of files in current main branch
lines: corresponding number of lines in text files
pull_requests: number of pull requests
github_repo_creation: timestamp of the GitHub repository creation
earliest_commit: timestamp of the earliest commit
most_recent_commit: date of the most recent commit
committer_count: number of different committers
author_count: number of different authors
dominant_domain: the projects dominant email domain
dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain
dominant_domain_author_commits: corresponding number for commit authors
dominant_domain_committers: number of committers whose email matches the project's dominant domain
dominant_domain_authors: corresponding number for commit authors
cik: SEC's EDGAR "central index key"
fg500: true if this is a Fortune Global 500 company (2,233 records)
sec10k: true if the company files SEC 10-K forms (4,180 records)
sec20f: true if the company files SEC 20-F forms (429 records)
project_name: GitHub project name
owner_login: GitHub project's owner login
company_name: company name as derived from the SEC and Fortune 500 data
owner_company: GitHub project's owner company name
license: SPDX license identifier
The file cohost_project_details.txt provides the full set of 311,223 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.
url: the project's GitHub URL
project_id: the project's GHTorrent identifier
stars: number of GitHub watchers
commit_count: number of commits