https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The Stack v2
The dataset consists of 4 versions:
bigcode/the-stack-v2: the full "The Stack v2" dataset bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories. bigcode/the-stack-v2-train-smol-ids: based on the bigcode/the-stack-v2-dedup… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for The Stack
Changelog
Release Description
v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.
v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
This data dump is sourced from the various sites in the Stack Exchange network of Q&A sites. This dump contains data up to and including 2025-06-30. This revision contains one bugfix from rev. 1: one site lacked content in its posts.xml: The exact licenses for each bit of content is embedded in each entry. For license date ranges, see the root-level license.txt, or . For the schema, see the sede-and-data-dump-schema.md file within each .7z This torrent has also been archived at
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This package contains SlowOps, and industrial dataset of stack traces introduced in our paper "Stack Trace Deduplication: Faster, More Accurately, and in More Realistic Scenarios".
SlowOps is an dataset of stack traces (reports) and their categories (issues), aimed at evaluating different models for stack trace deduplication. The dataset includes reports related to Slow Operation Assertion, collected at JetBrains from IntelliJ-based products in the time from 26.01.2021 to 29.02.2024. It contains 886,730 reports in 1,361 categories.
For more information about the dataset, please refer to the README.
A variety of studies and disparate data sets track state energy storage policies, but these datasets do not cover all BTM-related storage policy. Moreover, these databases do not align policies with the policy stacking framework. Thus, it is unclear which BTM storage policies are adopted across the country, what should comprise a complete storage policy framework or stack, or how states policies compare with that stack. This first-of-its-kind BTM storage policy stack includes 11 parent policy categories and 31 policies across the market preparation, creation, and expansion policy components.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A standardised collection of environmental spatial layers for biodiversity modelling and site characterisation. These spatial layers were published in: McCarthy, J. K., Leathwick, J. R., Roudier, P., Barringer, J. R. F., Etherington, T. R., Morgan, F. J., Odgers, N. P., Price, R. H., Wiser, S. K., Richardson, S. J. (2021) New Zealand Environmental Data Stack (NZEnvDS): A standardised collection of environmental spatial layers for biodiversity modelling and site characterisation. New Zealand Journal of Ecology, 45(2): 3440 https://dx.doi.org/10.20417/nzjecol.45.31 Please cite this paper when using NZEnvDS. These layers can also be downloaded from LRIS: https://lris.scinfo.org.nz/group/nzenvds/data/ Version 1.1 We were advised of a problem with the alignment of the NZTM layers that was caused by an issue with the way the “raster” R-package reprojects spatial layers. This resulted in the NZTM layers being shifted about 200 m south from their intended location. We have corrected this in Version 1.1 of the data by using the replacement for the “raster” R package (“terra”) to perform the reprojections from NZMG to NZTM. The updated scripts used to reproject the layers and updated NZTM layers are included in this version of the data. The NZMG layers are also included, and the scripts used to generate them, but they remain unchanged from Version 1.0.
A tech stack represents a combination of technologies a company uses in order to build and run an application or project. The most popular technology skill in the data science tech stack in 2024 was Python 3.x, chosen by **** percent of respondents. ETL ranked second, being used by *** percent of respondents. This comes as no surprise due to Python's importance in building artificial intelligence (AI) solutions and machine learning products.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
This data dump is sourced from the various sites in the Stack Exchange network of Q&A sites. This dump contains data up to and including 2025-06-30. The exact licenses for each bit of content is embedded in each entry. For license date ranges, see the root-level license.txt, or . For the schema, see the sede-and-data-dump-schema.md file within each .7z This torrent has also been archived at
This dataset (known as "Stack Exchange Data Dump", Mathematics site) has been obtained from https://math.stackexchange.com/ (see also https://math.stackexchange.com/help/data-dumps). The dataset is very similar in structure to Stack Overflow Data Dump.
This dataset is licensed under the Creative Commons CC BY-SA licence (2.5, 3.0, and/or 4.0). For the text of the licence(s) and other details, please see license.txt
and https://stackoverflow.com/help/licensing.
The Stack Overflow dataset, a detailed archive of posts, votes, tags, and badges from the world’s largest programmer community.
Aim Species distribution models (SDMs) are powerful tools for assessing suitable habitats across large areas and at fine spatial resolution. Yet, the usefulness of SDMs for mapping species' realised distributions is often limited since data biases or missing information on dispersal barriers or biotic interactions hinder them from accurately delineating species' range limits. One way to overcome this limitation is to integrate SDMs with expert range maps, which provide coarse-scale information on the extent of species' ranges and thereby range limits that are complementary to information offered by SDMs.
Innovation Here, we propose a new approach for integrating expert range maps in SDMs based on an ensemble method called stacked generalisation. Specifically, our approach relies on training a meta-learner regression model using predictions from one or more SDM algorithms alongside the distance of training points to expert-defined ranges as predictor variables. We demonstrate our app...
In summer 2014, the U.S. Geological Survey conducted a 21-day geophysical program in deep water along the Atlantic continental margin by using R/V Marcus G. Langseth (Field Activity Number 2014-011-FA). The purpose of the seismic program was to collect multichannel seismic reflection and refraction data to determine sediment thickness. These data enable the United States to delineate its Extended Continental Shelf (ECS) along the Atlantic margin. The same data can also be used to understand large submarine landslides and therefore assess their potential tsunami hazard for infrastructure and communities living along the eastern seaboard. Supporting geophysical data were collected as marine magnetic data, gravity data, 3.5-kilohertz shallow seismic reflections, multibeam echo sounder bathymetry, and multibeam backscatter. The survey was conducted from water depths of approximately 1,500 meters to abyssal seafloor depths greater than 5,000 meters. Approximately 2,761 kilometers of multi-channel seismic data was collected along with 30 sonobuoy profiles. This field program had two primary objectives: (1) to collect some of the data necessary to establish the outer limits of the U.S. Continental Shelf, or Extended Continental Shelf, as defined by Article 76 of the United Nations Convention of the Law of the Sea and (2) to study the sudden mass transport of sediments down the continental margin as submarine landslides that pose potential tsunamigenic hazards to the Atlantic and Caribbean coastal communities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication package for the paper "What do Developers Discuss about Code Comments?"
Appendix.pdf
Tags-topics.md
Stack-exchange-query.md
RQ1/
LDA_input/
combined-so-quora-mallet-metadata.csv
topic-input.mallet
LDA_output/
Mallet/
output_csv/
docs-in-topics.csv
topic-words.csv
topics-in-docs.csv
topics-metadata.csv
output_html/
all_topics.html
Docs/
Topics/
RQ2/
datasource_rawdata/
quora.csv
stackoverflow.csv
manual_analysis_output/
stackoverflow_quora_taxonomy.xlsx
Appendix.pdf- Appendix of the paper containing supplement tables
Tags-topics.md tags selected from Stack overflow and topics selected from Quora for the study (RQ1 & RQ2)
Stack-exchange-query.md the query interface used to extract the posts from stack exchnage explorer.
RQ1/ - contains the data used to answer RQ1
combined-so-quora-mallet-metadata.csv
- Stack overflow and Quora questions used to perform LDA analysistopic-input.mallet
- input file to the mallet tooldocs-in-topics.csv
- documents per topictopic-words.csv
- most relevant topic wordstopics-in-docs.csv
- topic probability per documenttopics-metadata.csv
- metadata per document and topic probabilityall_topics.html
Docs/
Topics/
RQ2/ - contains the data used to answer RQ2
quora.csv
- contains the processed dataset (like removing html tags). To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool.stackoverflow.csv
- contains the processed stackoverflow dataset. To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool.stackoverflow_quora_taxonomy.xlsx
- contains the classified dataset of stackoverflow and quora and description of taxonomy.
Taxonomy
- contains the description of the first dimension and second dimension categories. Second dimension categories are further divided into levels, separated by |
symbol. stackoverflow-posts
- the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.
quota-posts
- the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories. Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This bar chart displays employees (people) by stack using the aggregation sum. The data is filtered where the industry is Software. The data is about companies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We used historical data for crop yield in 27 Indian states and 3 Union Territories of India, covering the years 1997 to 2020. The dataset consists of 19,689 data points, each with ten features including Crop, Season, Crop_Year, State, Annual_Rainfall, Area, Production, Pesticide, Fertilizer, and Yield. The dataset encompasses 55 different types of crops cultivated across India. The crop yield dataset was used to prediction of crop yield using regression with stacking ensemble model. The dataset is split into training 80% and testing 20%.
This statistic shows data on media meshing and stacking in the United States in 2016. During the survey period, it was found that ** percent of U.S. internet users accessed program-related content on mobile devices while watching TV.
This data set consists of 1000+ z-stack of actin and nucleus stained cells. The cells reside on 10 different scaffold types and are segmented from the z-stacks to investigate their shape changes.
No description was included in this Dataset collected from the OSF
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This bar chart displays employees (people) by stack using the aggregation sum. The data is filtered where the sector is Communication Services. The data is about companies.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The Stack v2
The dataset consists of 4 versions:
bigcode/the-stack-v2: the full "The Stack v2" dataset bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories. bigcode/the-stack-v2-train-smol-ids: based on the bigcode/the-stack-v2-dedup… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids.