Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Digital data from the political sphere is abundant, omnipresent, and more and more directly accessible through the Internet. Project Vote Smart (PVS) is a prominent example of this big public data and covers various aspects of U.S. politics in astonishing detail. Despite the vast potential of PVS’ data for political science, economics, and sociology, it is hardly used in empirical research. The systematic compilation of semi-structured data can be complicated and time consuming as the data format is not designed for conventional scientific research. This paper presents a new tool that makes the data easily accessible to a broad scientific community. We provide the software called pvsR as an add-on to the R programming environment for statistical computing. This open source interface (OSI) serves as a direct link between a statistical analysis and the large PVS database. The free and open code is expected to substantially reduce the cost of research with PVS’ new big public data in a vast variety of possible applications. We discuss its advantages vis-à-vis traditional methods of data generation as well as already existing interfaces. The validity of the library is documented based on an illustration involving female representation in local politics. In addition, pvsR facilitates the replication of research with PVS data at low costs, including the pre-processing of data. Similar OSIs are recommended for other big public databases.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Open-Source LLM Market Size 2025-2029
The open-source LLM market size is valued to increase by USD 54 billion, at a CAGR of 33.7% from 2024 to 2029. Increasing democratization and compelling economics will drive the open-source LLM market.
Market Insights
North America dominated the market and accounted for a 37% growth during the 2025-2029.
By Application - Technology and software segment was valued at USD 4.02 billion in 2023
By Deployment - On-premises segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 575.60 million
Market Future Opportunities 2024: USD 53995.50 million
CAGR from 2024 to 2029 : 33.7%
Market Summary
The Open-Source Large Language Model (LLM) market has experienced significant growth due to the increasing democratization of artificial intelligence (AI) technology and its compelling economics. This global trend is driven by the proliferation of smaller organizations seeking to leverage advanced language models for various applications, including supply chain optimization, compliance, and operational efficiency. Open-source LLMs offer several advantages over proprietary models. They provide greater flexibility, as users can modify and adapt the models to their specific needs. Additionally, open-source models often have larger training datasets, leading to improved performance and accuracy. However, there are challenges to implementing open-source LLMs, such as the prohibitive computational costs and critical hardware dependency. These obstacles necessitate the development of more efficient algorithms and the exploration of cloud computing solutions.
A real-world business scenario illustrates the potential benefits of open-source LLMs. A manufacturing company aims to optimize its supply chain by implementing an AI-powered system to analyze customer demand patterns and predict inventory needs. The company chooses an open-source LLM due to its flexibility and cost-effectiveness. By integrating the LLM into its supply chain management system, the company can improve forecasting accuracy and reduce inventory costs, ultimately increasing operational efficiency and customer satisfaction. Despite the challenges, the market continues to grow as organizations recognize the potential benefits of advanced language models. The democratization of AI technology and the compelling economics of open-source solutions make them an attractive option for businesses of all sizes.
What will be the size of the Open-Source LLM Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free Sample
The Open-Source Large Language Model (LLM) Market continues to evolve, offering businesses innovative solutions for various applications. One notable trend is the increasing adoption of explainable AI (XAI) methods in LLMs. XAI models provide transparency into the reasoning behind their outputs, addressing concerns around bias mitigation and interpretability. This transparency is crucial for industries with stringent compliance requirements, such as finance and healthcare. For instance, a recent study reveals that companies implementing XAI models have achieved a 25% increase in model acceptance rates among stakeholders, leading to more informed decisions. This improvement can significantly impact product strategy and budgeting, as businesses can confidently invest in AI solutions that align with their ethical and regulatory standards.
Moreover, advancements in LLM architecture include encoder-decoder architectures, multi-head attention, and self-attention layers, which enhance feature extraction and model scalability. These improvements contribute to better performance and more accurate results, making LLMs an essential tool for businesses seeking to optimize their operations and gain a competitive edge. In summary, the market is characterized by continuous innovation and a strong focus on delivering human-centric solutions. The adoption of explainable AI methods and advancements in neural network architecture are just a few examples of how businesses can benefit from these technologies. By investing in Open-Source LLMs, organizations can improve efficiency, enhance decision-making, and maintain a responsible approach to AI implementation.
Unpacking the Open-Source LLM Market Landscape
In the dynamic landscape of large language models (LLMs), open-source solutions have gained significant traction, offering businesses competitive advantages through data augmentation and few-shot learning capabilities. Compared to traditional models, open-source LLMs enable a 30% reduction in optimizer selection time and a 25% improvement in model accuracy for summarization tasks. Furthermore, distributed training and model compression techniques allow businesses to process larger training dataset sizes with minimal tokenization process disruptions, result
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.
Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation
Languages: English Version: 1.0
Owner: Databricks, Inc.
databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language
models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the
types of questions and instructions appropriate to each category.
Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.
For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]) which we recommend users remove for downstream applications.
While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories.
Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets.
As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.
To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous compliance to an annotation rubric that concretely and reliably operationalizes the specific task. Caveat emptor.
The annotation guidelines for each of the categories are as follows:
Facebook
Twitterhttps://www.expertmarketresearch.com/privacy-policyhttps://www.expertmarketresearch.com/privacy-policy
The global open-source intelligence market attained a value of USD 12.13 Billion in 2024 and is projected to expand at a CAGR of 24.70% through 2034. The market is further expected to be valued at USD 110.29 Billion by 2034. The rapid expansion of cross-border digital ecosystems is pushing enterprises to deploy OSINT platforms that centralize multilingual threat discovery, enabling faster responses to cyber, political, and operational risks.
The global open-source intelligence (OSINT) industry is entering a new growth phase as enterprises lean heavily on real-time, AI-driven threat discovery. For instance, Dataminr offers AI-native multimodal fusion engine, enabling security teams to process with improved contextual accuracy. This enhancement is especially relevant as global cyber incidents surged considerably, compelling large enterprises and government agencies to invest in systems that can detect early-stage risks from open-source feeds before they escalate. The open-source intelligence market’s reliance on platforms that blend neural search, geospatial mapping, and cross-language analytics is becoming a critical priority for B2B buyers with distributed operations.
Open-source intelligence adoption is accelerating across defense, financial services, telecom, and critical infrastructure sectors, as organizations modernize their security stacks. Vendors are diversifying product roadmaps toward deep web coverage, AI-assisted narrative detection, and automated influence-operation monitoring, technologies that were once limited to government intelligence programs. For example, Cyware announced the launch of a new Cyware Quarterback AI solution delivering an AI Fabric to address security use cases. In November 2025, Companies like Recorded Future and Fivecast are advancing multilingual data-fusion capabilities, with some platforms now scanning content in 100+ languages while automatically categorizing emerging geopolitical, cyber, and physical threats. This reflects a broader open-source intelligence market trend where end users want not just data collection, but actionable intelligence pipelines that integrate cleanly into SIEM, SOC, and national security workflows.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The explosion in biological data generation challenges the available technologies and methodologies for data interrogation. Moreover, highly rich and complex datasets together with diverse linked data are difficult to explore when provided in flat files. Here we provide a way to filter and analyse in a systematic way a dataset with more than 18 thousand data points using Zegami (link), a solution for interactive data visualisation and exploration. The primary data we use are derived from a systematic analysis of 200 YFP gene traps reveals common discordance between mRNA and protein across the nervous system which is submitted elsewhere. This manual provides the raw image data together with annotations and associated data and explains how to use Zegami for exploring all these data types together by providing specific examples. We also provide the open source python code (github link) used to annotate the figures.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/).
In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.
Format
The dataset is organized as follows:
blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6’859’189 blobs, for a total uncompressed size on disk of 66 GiB.
The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:
blobs/ is the root directory containing all license blobs
8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:
$ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007
$ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1
One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):
swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"
blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing “only” 20’000 randomly selected license blobs
license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"
where:
SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory
NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).
blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:
SHA1: blob SHA1
MIME_TYPE: blob MIME type, as detected by libmagic
ENCODING: blob character encoding, as detected by libmagic
LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)
WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)
SIZE: blob size in bytes
blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:
SHA1: blob SHA1
LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)
SCORE: confidence score in the result, as a decimal number between 0 and 100
There may be zero or arbitrarily many lines for each blob.
blobs-scancode.ndjson.zst a Zst-compressed line-delimited JSON, containing a superset of the information in blobs-scancode.csv.zst. Each line is a JSON dictionary with three keys:
sha1: blob SHA1
licenses: output of scancode.api.get_licenses(..., min_score=0)
copyrights: output of scancode.api.get_copyrights(...)
There is exactly one line for each blob. licenses and copyrights keys are omitted for files not detected as plain text.
blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHIDURL, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 https://github.com/pombreda/Artemis
Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.
If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.
blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a license blob to this count in the format SWHIDNUMBER, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 2822260
Two blobs are missing because the computation crashes:
swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc
This issue will be fixed in a future version of the dataset
blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHIDEARLIEST_SWHIDEARLIEST_TSOCCURRENCES, where:
SWHID: blob SWHID
EARLIEST_SWHID: SWHID of the earliest known commit containing the blob
EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer
OCCURRENCES: number of known commits containing the blob
replication-package.tar.gz: code and scripts used to produce the dataset
licenses-annotated-sample.tar.gz: ground truth, i.e., manually annotated random sample of license blobs, with details about the kind of information they contain.
Changes since the 2021-03-23 dataset
More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.
Values in the NAME column of license-blobs.csv.zst are quoted, as some file names now contain commas.
Replication package now contains all the steps needed to reproduce all artefacts including the licenseblobs/fetch.py script.
blobs-nb-origins.csv.zst is added.
blobs-origins.csv.zst is now generated using the first origin returned by swh-graph’s leaves endpoint, instead of its randomwalk endpoint. This should have no impact on the result, other than a different distribution of “random” origins being picked.
blobs-origins.csv.zst was missing ~10% of its results in previous versions of the dataset, due to errors and/or timeouts in its generation, this is now down to 0.02% (1254 of the 6859445 unique blobs). Blobs with no known origins are now present, with a blank instead of URL.
blobs-earliest.csv.zst was missing ~10% of its results in previous versions of the dataset. It is complete now.
blobs-scancode.csv.zst is generated with a newer scancode-toolkit version (31.2.1)
blobs-scancode.ndjson.zst is added.
Errata
A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:
pv blobs-fileinfo.csv.zst | zstdcat | grep -v ".tmp" | zstd -19 pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12
The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.
Citation
If you use this dataset for research purposes, please acknowledge its use by citing one or both of the following papers:
[pdf, bib] Jesús M. González-Barahona, Sergio Raúl Montes León, Gregorio Robles, Stefano Zacchiroli. The software heritage license dataset (2022 edition). Empirical Software Engineering, Volume 28, Number 6, Article number 147 (2023).
[pdf, bib] Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
References
The dataset has been built using primarily the data sources described in the following papers:
[pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.
[pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.
Errata (v2, 2024-01-09)
licenses-annotated-sample.tar.gz: some comments not intended for publication were removed, and 4
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset for the paper "Large Language Models for Structuring and Integration of Heterogeneous Data" (add DOI).
It contains:
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset includes classes with code smells, acquired from Qualitas Corpus (QC). Folder 'all' contains data coming from the QC rev.20130901 (92 systems). Folder 'domains' contains data coming from QC rev.20111026 (76 systems updated to their most recent releases from rev.20130901). Folder 'pca' includes results of the PCA analysis, generated with the R prcomp() function for regular PCA, and logisticPCA() function for the binary data.
Filenames include information about the base release of the QC, and a number (25, 50 or 75) that specifies the minimum number of detectors that identified a specific smell instance (25%, 50%, and 75%, respectively). For example, if a given code smell in a class X has been identified by 1 out of 4 available detecting tools, then the smell for the class X will be reported in the respective file 25, but not in 50 or 75. Please note, that for smells detected with only one tool, the values would be equal in all datasets (in that case, the smell was detected by 0% or 100% of tools)
In all files, "1" denotes that the smell was identified (subject to the limitations with the number of detectors, described above), and “0” that the smell was not found in a given class.
The filename also includes the domain abbreviation (app, css, dev, dgdv) or a keyword ALL, which indicates that the dataset includes data from all domains.
The smells have been detected by 11 tools. Most of the tools detect more than one smell. Information about the tool used to detect a given smell is given in headers of each file. Additionally, in 'smell detectors.csv' file we present the information about smells detected by a specific tool.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The repository contains an extensive dataset of PV power measurements and a python package (qcpv) for quality controlling PV power measurements. The dataset features four years (2014-2017) of power measurements of 175 rooftop mounted residential PV systems located in Utrecht, the Netherlands. The power measurements have a 1-min resolution.
PV power measurements
Three different versions of the power measurements are included in three data-subsets in the repository. Unfiltered power measurements are enclosed in unfiltered_pv_power_measurements.csv. Filtered power measurements are included as filtered_pv_power_measurements_sc.csv and filtered_pv_power_measurements_ac.csv. The former dataset contains the quality controlled power measurements after running single system filters only, the latter dataset considers the output after running both single and across system filters. The metadata of the PV systems is added in metadata.csv. This file holds for each PV system a unique ID, start and end time of registered power measurements, estimated DC and AC capacity, tilt and azimuth angle, annual yield and mapped grids of the system location (north, south, west and east boundary).
Quality control routine
An open-source quality control routine that can be applied to filter erroneous PV power measurements is added to the repository in the form of the Python package qcpv (qcpv.py). Sample code to call and run the functions in the qcpv package is available as example.py.
Objective
By publishing the dataset we provide access to high quality PV power measurements that can be used for research experiments on several topics related to PV power and the integration of PV in the electricity grid.
By publishing the qcpv package we strive to set a next step into developing a standardized routine for quality control of PV power measurements. We hope to stimulate others to adopt and improve the routine of quality control and work towards a widely adopted standardized routine.
Data usage
If you use the data and/or python package in a published work please cite: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.
Units
Timestamps are in UTC (YYYY-MM-DD HH:MM:SS+00:00).
Power measurements are in Watt.
Installed capacities (DC and AC) are in Watt-peak.
Additional information
A detailed discussion of the data and qcpv package is presented in: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.
Acknowledgements
This work is part of the Energy Intranets (NEAT: ESI-BiDa 647.003.002) project, which is funded by the Dutch Research Council NWO in the framework of the Energy Systems Integration & Big Data programme. The authors would especially like to thank the PV owners who volunteered to take part in the measurement campaign.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
This is the source code package for the labbench python module, version 0.20, which is its first public release. The purpose of labbench is to streamline and organize complicated laboratory automation tasks that involve large-scale benchtop automation, concurrency, and/or data management. It is built around a system of wrappers that facilitate robust, concise exception handling, type checking, API conventions, and synchronized device connection through python context blocks. The wrappers also provide convenient new functionality, such as support for automated status displays in jupyter notebooks, simplified threaded concurrency, and automated, type-safe logging to relational databases. Together, these features help to minimize the amount of "copy-and-paste" code that can make your lab automation scripts error-prone and difficult to maintain. The python code that results can be clear, concise, reusable and maintainable, and provide consistent formatting for stored data. The result helps researchers to meet NIST's open data obligations, even for complicated, large, and heterogeneous datasets. Several past and ongoing projects in the NIST Communication Technology Laboratory (CTL) published data that were acquired by automation in labbench. We release it here both for transparency and to invite public use and feedback. Ongoing updates to this source code will be maintained on the NIST github page at https://github.com/usnistgov/labbench. The code was developed in python, documented with the python sphinx package and markdown, and shared through the USNISTGOV organization on GitHub. INSTALLATION labbench can run on any computer that supports python 3.6. The hardware requirements are discussed here: https://docs.anaconda.com/anaconda/install/#requirements 1. Install your favorite distribution of a python version 3.6 or greater 2. In a command prompt, pip install git+https://gitlab.nist.gov/gitlab/ssm/labbench 3. (Optional) install an NI VISA [1] runtime, for example this one for windows. USAGE The source distribution contains detailed information including * README.md - documentation to get started using labbench * LICENSE.md - license and redistribution information * doc/labbench-api.pdf - complete listing of the module and documentation
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a trace archive of metrics collected from the Lisa cluster at SURFsara associated with the article that will be published in USENIX;login: in July 2020.
Github repository which contains documentation as well as scripts required to replicate the work from the login paper: https://github.com/sara-nl/SURFace
Real-world data can be instrumental in answering detailed questions: How do we know which assumptions regarding large-scale systems are realistic? How do we know that the systems we build are practical? How do we know which metrics are important to assess when analyzing performance? To answer such questions, we need to collect and share operational traces containing real-world, detailed data. Not only is the presence of low-level metrics significant, but they also help avoid biases through their variety. To address variety, there exist several types of archives, such as the Parallel Workloads Archive, the Grid Workloads Archive, and the Google or Microsoft logs (the Appendix gives a multi-decade overview). However, such traces mostly focus on higher-level scheduling decisions and high-level, job-based resource utilization (e.g., consumed CPU and memory). Thus, they do not provide vital information to system administrators or researchers analyzing the full-stack or the OS-level operation of datacenters.
The traces we are sharing have the finest granularity of all other open-source traces published so far. In addition to scheduler-level logs, they contain over 100 low-level, server-based metrics, going to the granularity of page-faults or bytes transferred through a NIC.
The SURF archive
Datacenters already exhibit unprecedented scale and are becoming increasingly more complex. Moreover, such computer systems have begun having a significant impact on the environment, for example, training some machine learning models has sizable carbon footprints. As our recent work on modern datacenter networks shows, low-level data is key to understanding full-stack operation, including high-level application behavior. We advocate it is time to start using such data more systematically, unlocking its potential in helping us understand how to make (datacenter) systems more efficient. We advocate that our data can contribute to a more holistic approach, looking at how the multitude of these systems work together in a large-scale datacenter.
This archive contains data from the Dutch National Infrastructure, Lisa.
Description of the Lisa system
Description of the Cartesius system
We gather metrics, at 15-second intervals, from several data sources:
Slurm: all job, task, and scheduler related data, such as running time, queueing time, failures, servers involved in the execution, organization in partitions, and scheduling policies.
NVIDIA Management Library (NVML): per GPU, data such as power metrics, temperature, fan speed, or used memory.
IPMI: per server, data such as power metrics and temperature.
OS-level: from either procfs, sockstat, or netstat data: low-level OS metrics, regarding the state of each server, including CPU, disk, memory, network utilization, context switches, and interrupts.
We also release other kinds of novel information, related to datacenter topology and organization.
The audience we envision using these metrics is composed of systems researchers, infrastructure developers and designers, system administrators, and software developers for large-scale infrastructure. The frequency of collecting data is uniquely high for open-source data, which could allow these experts unprecedented views into the operation of a real datacenter.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV: https://mdv.molbiol.ox.ac.uk), a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP gene traps reveals common discordance between mRNA and protein across the nervous system (https://doi.org/10.1083/jcb.202205129). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.
Facebook
Twitter
According to our latest research, the global self-supervised learning market size reached USD 10.2 billion in 2024, demonstrating rapid adoption across multiple sectors. The market is set to expand at a strong CAGR of 33.1% from 2025 to 2033, propelled by the growing need for advanced artificial intelligence solutions that minimize dependency on labeled data. By 2033, the market is forecasted to achieve an impressive size of USD 117.2 billion, underscoring the transformative potential of self-supervised learning in revolutionizing data-driven decision-making and automation across industries. This growth trajectory is supported by increasing investments in AI research, the proliferation of big data, and the urgent demand for scalable machine learning models.
The primary growth driver for the self-supervised learning market is the exponential surge in data generation across industries and the corresponding need for efficient data labeling techniques. Traditional supervised learning requires vast amounts of labeled data, which is both time-consuming and expensive to annotate. Self-supervised learning, by contrast, leverages unlabeled data to train models, significantly reducing operational costs and accelerating the deployment of AI systems. This paradigm shift is particularly critical in sectors like healthcare, finance, and autonomous vehicles, where large datasets are abundant but labeled examples are scarce. As organizations seek to unlock value from their data assets, self-supervised learning is emerging as a cornerstone technology, enabling more robust, scalable, and generalizable AI applications.
Another significant factor fueling market expansion is the rapid advancement in computing infrastructure and algorithmic innovation. The availability of high-performance hardware, such as GPUs and TPUs, coupled with breakthroughs in neural network architectures, has made it feasible to train complex self-supervised models on massive datasets. Additionally, the open-source movement and collaborative research have democratized access to state-of-the-art self-supervised learning frameworks, fostering innovation and lowering barriers to entry for enterprises of all sizes. These technological advancements are empowering organizations to experiment with self-supervised learning at scale, driving adoption across a wide range of applications, from natural language processing to computer vision and robotics.
The market is also benefiting from the growing emphasis on ethical AI and data privacy. Self-supervised learning methods, which minimize the need for sensitive labeled data, are increasingly being adopted to address privacy concerns and regulatory compliance requirements. This is particularly relevant in regions with stringent data protection regulations, such as the European Union. Furthermore, the ability of self-supervised learning to generalize across domains and tasks is enabling businesses to build more resilient and adaptable AI systems, further accelerating market growth. The convergence of these factors is positioning self-supervised learning as a key enabler of next-generation AI solutions.
Transfer Learning is emerging as a pivotal technique in the realm of self-supervised learning, offering a bridge between different domains and tasks. By leveraging knowledge from pre-trained models, transfer learning allows for the adaptation of AI systems to new, related tasks with minimal additional data. This approach is particularly beneficial in scenarios where labeled data is scarce, enabling models to generalize better and learn more efficiently. The integration of transfer learning into self-supervised frameworks is enhancing the ability of AI systems to tackle complex problems across various industries, from healthcare diagnostics to autonomous driving. As the demand for versatile and efficient AI solutions grows, transfer learning is set to play a crucial role in the evolution of self-supervised learning technologies.
From a regional perspective, North America currently leads the self-supervised learning market, accounting for the largest share due to its robust AI research ecosystem, significant investments from technology giants, and early adoption across verticals. However, Asia Pacific is projected to witness the fastest growth over the forecast period, driven by the rapid digital tran
Facebook
TwitterAs of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains the reference reconstructions and segmentation of slices 2,001 – 3,000 from the data collection described in
Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)
Abstract: "Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."
The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, (74.8\mu m^2) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.
Please refer to the paper for all further technical details.
The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD. The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.
The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.
Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.
For more information or guidance in using the data collection, please get in touch with
Maximilian.Kiss [at] cwi.nl
Felix.Lucka [at] cwi.nl
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains slices 4,001 – 5,000 from the data collection described in
Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)
Abstract: "Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."
The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, (74.8\mu m^2) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.
Please refer to the paper for all further technical details.
The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD. The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.
The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.
Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.
For more information or guidance in using the data collection, please get in touch with
Maximilian.Kiss [at] cwi.nl
Felix.Lucka [at] cwi.nl
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for The Stack
Changelog
Release Description
v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.
v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Cloud computing enables users to create virtual computers, each one with the optimal configuration of hardware and software for a job. The number of virtual computers can be increased to process large data sets or reduce processing time. Large scale scientific applications of the cloud, in many cases, are still in development.
For example, in the event of an environmental crisis, such as the Deepwater Horizon oil spill, tornadoes, Mississippi River flooding, or a hurricane, up to date information is one of the most important commodities for decision makers. The volume of remote sensing data that is needed to be processed to accurately retrieve ocean properties from satellite measurements can easily exceed a terabyte, even for a small region such as the Mississippi Sound. Often, with current infrastructure, the time required to download, process and analyze the large volumes of remote sensing data, limits data processing capabilities to provide timely information to emergency responders. The use of a cloud computing platform, like NASA’s Nebula, can help eliminate those barriers.
NASA Nebula was developed as an open-source cloud computing platform to provide an easily quantifiable and improved alternative to building additional expensive data centers and to provide an easier way for NASA scientists and researchers to share large, complex data sets with external partners and the public. Nebula was designed as an Infrastructure-as-a-Service (IaaS) implementation that provided scalable computing and storage for science data and Web-based applications. Nebula IaaS allowed users to unilaterally provision, manage, and decommission computing capabilities (virtual machine instances, storage, etc.) on an as-needed basis through a Web interface or a set of command-line tools.
This project demonstrated a novel way to conduct large scale scientific data processing utilizing NASA’s cloud computer, Nebula. Remote sensing data from the Deepwater Horizon oil spill site was analyzed to assess changes in concentration of suspended sediments in the area surrounding the spill site.
Software for processing time series of satellite remote sensing data was packaged together with a computer code that uses web services to download the data sets from a NASA data archive and distribution system. The new application package was able to be quickly deployed on a cloud computing platform when, and only for as long as, processing of the time series data is required to support emergency response. Fast network connection between the cloud system and the data archive enabled remote processing of the satellite data without the need for downloading the input data to a local computer system: only the output data products are transferred for further analysis.
NASA was a pioneer in cloud computing by having established its own private cloud computing data center called Nebula in 2009 at the Ames Research Center (Ames). Nebula provided high-capacity computing and data storage services to NASA Centers, Mission Directorates, and external customers. In 2012, NASA shut down Nebula based on the results of a 5-month test that benchmarked Nebula’s capabilities against those of Amazon and Microsoft. The test found that public clouds were more reliable and cost effective and offered much greater computing capacity and better IT support services than Nebula.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.