Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Data Open Source is a dataset for object detection tasks - it contains Pest annotations for 476 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Thanks to a variety of software services, it has never been easier to produce, manage and publish Linked Open Data. But until now, there has been a lack of an accessible overview to help researchers make the right choice for their use case. This dataset release will be regularly updated to reflect the latest data published in a comparison table developed in Google Sheets [1]. The comparison table includes the most commonly used LOD management software tools from NFDI4Culture to illustrate what functionalities and features a service should offer for the long-term management of FAIR research data, including:
The table presents two views based on a comparison system of categories developed iteratively during workshops with expert users and developers from the respective tool communities. First, a short overview with field values coming from controlled vocabularies and multiple-choice options; and a second sheet allowing for more descriptive free text additions. The table and corresponding dataset releases for each view mode are designed to provide a well-founded basis for evaluation when deciding on a LOD management service. The Google Sheet table will remain open to collaboration and community contribution, as well as updates with new data and potentially new tools, whereas the datasets released here are meant to provide stable reference points with version control.
The research for the comparison table was first presented as a paper at DHd2023, Open Humanities – Open Culture, 13-17.03.2023, Trier and Luxembourg [2].
[1] Non-editing access is available here: docs.google.com/spreadsheets/d/1FNU8857JwUNFXmXAW16lgpjLq5TkgBUuafqZF-yo8_I/edit?usp=share_link To get editing access contact the authors.
[2] Full paper will be made available open access in the conference proceedings.
Facebook
TwitterThese data and code successfully reproduce nearly all cross-sectional stock return predictors. The 319 characteristics draw from previous meta-studies, but authors differ by comparing their t-stats to the original papers' results. For the 161 characteristics that were clearly significant in the original papers, 98% of their long-short portfolios find t-stats above 1.96. For the 44 characteristics that had mixed evidence, authors' reproductions find t-stats of 2 on average. A regression of reproduced t-stats on original longshort t-stats finds a slope of 0.90 and an R2 of 83%. Mean returns aremonotonic in predictive signals at the characteristic level. The remaining 114 characteristics were insignificant in the original papers or are modifications of the originals created by Hou, Xue, and Zhang (2020). These remaining characteristics are almost always significant if the original characteristic was also significant.
Facebook
TwitterThis dataset lists out all software in use by NASA.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Discover the explosive growth of the open-source big data tools market, projected at a 18% CAGR to reach $55.7 billion by 2033. This in-depth analysis explores key drivers, trends, restraints, and regional market shares, highlighting leading companies and applications. Learn how open-source solutions are revolutionizing data management and analysis.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The open-source data annotation tool market is experiencing robust growth, driven by the increasing demand for high-quality training data in the burgeoning fields of artificial intelligence (AI) and machine learning (ML). The market's expansion is fueled by the need for efficient and cost-effective annotation solutions, particularly for large datasets. Organizations across various sectors, including automotive, healthcare, and finance, are leveraging these tools to improve the accuracy and performance of their AI models. The availability of open-source alternatives offers a significant advantage over proprietary solutions, enabling developers and researchers to customize tools according to their specific needs and avoid vendor lock-in. Furthermore, the collaborative nature of open-source projects fosters innovation and continuous improvement, resulting in a more dynamic and rapidly evolving ecosystem. While the market is relatively nascent, it exhibits a substantial growth trajectory, attracting numerous companies and developers, as evidenced by the active participation of organizations such as Alecion, Amazon Mechanical Turk, and Appen Limited. This competitive landscape further accelerates innovation and accessibility. The open-source nature of these tools also democratizes access to advanced AI development capabilities. Smaller companies and individual researchers can now participate in the development and deployment of AI solutions, leveling the playing field and fostering wider adoption. However, the market faces challenges such as the need for ongoing community support and maintenance of these tools, ensuring their long-term viability and preventing fragmentation. Despite these challenges, the future outlook for the open-source data annotation tool market remains positive, with continued growth driven by increased adoption in various industries and advancements in AI and ML technologies. The market is predicted to maintain a healthy compound annual growth rate (CAGR) over the forecast period, reflecting the sustained demand for efficient and accessible data annotation solutions.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
The Dark Side of Openness: How Open Source Data Can Be Abused to Harm Human Life
First draft partially generated using Perplexity AI, then written and edited manually and revised using agentlans/granite-3.3-2b-reviser. Open-source data, a vast resource for innovation and collaboration, offers significant benefits. However, the same openness that empowers progress can also create serious risks. The potential for harm arises when personal and sensitive data is exposed, potentially… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/open-source-data-abuse.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE
Facebook
TwitterCode and data to reproduce the results and datasets from "Tools for Open Source, Subnational CGE Modeling with an Illustrative Analysis of Carbon Leakage" by Andrew Schreiber and Thomas F. Rutherford, in the Journal of Global Economic Analysis. Citation information for this dataset can be found in the EDG's Metadata Reference Information section and Data.gov's References section.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Due to the cost of developing and training deep learning models from scratch, machine learning engineers have begun to reuse pre-trained models (PTMs) and fine-tune them for downstream tasks. PTM registries known as “model hubs” support engineers in distributing and reusing deep learning models. PTM packages include pre-trained weights, documentation, model architectures, datasets, and metadata. Mining the information in PTM packages will enable the discovery of engineering phenomena and tools to support software engineers. However, accessing this information is difficult — there are many PTM registries, and both the registries and the individual packages may have rate limiting for accessing the data.
We present an open-source dataset, PTMTorrent, to facilitate the evaluation and understanding of PTM packages. This paper describes the creation, structure, usage, and limitations of the dataset. The dataset includes a snapshot of 5 model hubs and a total of 15,913 PTM packages. These packages are represented in a uniform data schema for cross-hub mining. We describe prior uses of this data and suggest research opportunities for mining using our dataset.
We provide links to the PTM Dataset and PTM Torrent Source Code.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Market research dataset covering growth of the global open-source software market, including benefits, adoption, and enterprise usage in 2025.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The open-source data labeling tool market is experiencing robust growth, driven by the increasing demand for high-quality training data in various AI applications. The market's expansion is fueled by several key factors: the rising adoption of machine learning and deep learning algorithms across industries, the need for efficient and cost-effective data annotation solutions, and a growing preference for customizable and flexible tools that can adapt to diverse data types and project requirements. While proprietary solutions exist, the open-source ecosystem offers advantages including community support, transparency, cost-effectiveness, and the ability to tailor tools to specific needs, fostering innovation and accessibility. The market is segmented by tool type (image, text, video, audio), deployment model (cloud, on-premise), and industry (automotive, healthcare, finance). We project a market size of approximately $500 million in 2025, with a compound annual growth rate (CAGR) of 25% from 2025 to 2033, reaching approximately $2.7 billion by 2033. This growth is tempered by challenges such as the complexities associated with data security, the need for skilled personnel to manage and use these tools effectively, and the inherent limitations of certain open-source solutions compared to their commercial counterparts. Despite these restraints, the open-source model's inherent flexibility and cost advantages will continue to attract a significant user base. The market's competitive landscape includes established players like Alecion and Appen, alongside numerous smaller companies and open-source communities actively contributing to the development and improvement of these tools. Geographical expansion is expected across North America, Europe, and Asia-Pacific, with the latter projected to witness significant growth due to the increasing adoption of AI and machine learning in developing economies. Future market trends point towards increased integration of automated labeling techniques within open-source tools, enhanced collaborative features to improve efficiency, and further specialization to cater to specific data types and industry-specific requirements. Continuous innovation and community contributions will remain crucial drivers of growth in this dynamic market segment.
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The open-source big data tools market is experiencing robust growth, driven by the increasing need for scalable, cost-effective, and flexible data management and analysis solutions across diverse sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 18% from 2025 to 2033. This significant expansion is fueled by several key factors. Firstly, the rising volume and velocity of data generated across industries necessitates sophisticated tools capable of handling massive datasets efficiently. Secondly, the cost-effectiveness of open-source solutions compared to proprietary alternatives is a major attraction for businesses of all sizes, particularly startups and SMEs. Thirdly, the active and collaborative open-source community ensures continuous innovation and improvement in these tools, making them highly adaptable to evolving technological landscapes. The increasing adoption of cloud computing further contributes to market growth, as open-source tools seamlessly integrate with cloud platforms. Growth is segmented across various tools, with data analysis tools experiencing the highest demand due to the growing focus on data-driven decision-making. Key application areas include banking, manufacturing, and government, reflecting the wide applicability of these tools across sectors. While geographical distribution is diverse, North America and Europe currently hold significant market share, though rapid growth is anticipated in the Asia-Pacific region driven by increasing digitalization and adoption of advanced analytics. However, the market faces challenges including the complexity of implementation and maintenance of some open-source tools, requiring specialized expertise, and the need for robust security measures to protect sensitive data. Despite these hurdles, the inherent advantages of cost-effectiveness, flexibility, and community support position the open-source big data tools market for sustained and considerable expansion in the coming years.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Retail point-of-sale (POS) transactions and operator performance logs from a real store environment. Includes timestamps, product details, quantities, and operator IDs — enabling analysis of sales trends, product performance, and staff efficiency.
Applications • Sales forecasting & trend analysis • Market basket analysis • Employee productivity insights • Business analytics & ML modeling
Source: MDPI Data Journal License: CC BY-NC 4.0 — non-commercial use only.
Cite:
Alves, T.M.F.; de Carvalho, A.C.P.L.F.; Cardoso, J.M.P. (2019). An Open-Source Point of Sale Dataset for the Analysis of Sales Transactions and Operator Efficiency. Data, 4(2), 67.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,252 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.
The main dataset is provided as a 17,252 record tab-separated file named enterprise_projects.txt with the following 27 fields.
The file cohost_project_details.txt provides the full set of 309,531 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The presentation explains in the simplest possible way what you need to know about open source licenses when starting from scratch. It also sums up the course "Open Source Licensing Basics for Software Developers (LFC191)" (Linux Foundation)
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming open-source tools market! This comprehensive analysis reveals key trends, drivers, and restraints impacting growth from 2025-2033, covering applications like machine learning & data science across major regions. Explore market size, CAGR projections, and leading companies shaping the future of open-source technology.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming open-source big data tools market! This comprehensive analysis reveals key trends, growth drivers, and regional insights for 2025-2033, featuring leading companies like MongoDB and Apache. Learn about market segmentation, application areas, and future projections.
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The global financial database market is experiencing robust growth, driven by increasing demand for real-time data and advanced analytics across various sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 8% from 2025 to 2033, reaching approximately $28 billion by 2033. This expansion is fueled by several key factors: the proliferation of algorithmic trading and quantitative analysis necessitating high-frequency data feeds; the growing adoption of cloud-based solutions enhancing accessibility and scalability; and the increasing regulatory scrutiny demanding robust and reliable financial data for compliance purposes. The market segmentation reveals a strong preference for real-time databases across both personal and commercial applications, reflecting the time-sensitive nature of financial decisions. Key players like Bloomberg, Refinitiv (formerly Thomson Reuters), and FactSet maintain significant market share due to their established brand reputation and comprehensive data offerings. However, the emergence of innovative fintech companies and the increasing availability of open-source data platforms are expected to intensify competition and foster market disruption. The geographical distribution of the market reveals North America as the dominant region, followed by Europe and Asia-Pacific. However, the Asia-Pacific region is poised for significant growth, driven by expanding financial markets in countries like China and India. While the market faces restraints such as data security concerns, increasing data costs, and complexities in data integration, the overall trend points toward sustained expansion. The continuous development of sophisticated analytical tools and the growing need for data-driven decision-making will continue to drive the adoption of financial databases across various user segments and geographies, shaping the competitive landscape in the coming years.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Data Open Source is a dataset for object detection tasks - it contains Pest annotations for 476 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).