100+ datasets found

h
the-stack-v2-train-smol-ids
huggingface.co
Updated Mar 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2024). the-stack-v2-train-smol-ids [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids
Explore at:
Dataset updated
Mar 1, 2024
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The Stack v2

The dataset consists of 4 versions:

bigcode/the-stack-v2: the full "The Stack v2" dataset bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories. bigcode/the-stack-v2-train-smol-ids: based on the bigcode/the-stack-v2-dedup… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids.
h
the-stack
huggingface.co
opendatalab.com
Updated Oct 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
Explore at:
Dataset updated
Oct 27, 2022
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for The Stack

Changelog

Release Description

v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.
Stack Exchange Data Dump (2025-06-30, revision 2)
academictorrents.com
bittorrent
Updated Aug 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stack Exchange Community (2025). Stack Exchange Data Dump (2025-06-30, revision 2) [Dataset]. https://academictorrents.com/details/53d504734619bc57bf4f4ec81fdf2a2536b3b501
Explore at:
bittorrent(97361592423)Available download formats
Dataset updated
Aug 1, 2025
Dataset provided by
Stack Exchangehttp://stackexchange.com/
Authors
Stack Exchange Community
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
This data dump is sourced from the various sites in the Stack Exchange network of Q&A sites. This dump contains data up to and including 2025-06-30. This revision contains one bugfix from rev. 1: one site lacked content in its posts.xml: The exact licenses for each bit of content is embedded in each entry. For license date ranges, see the root-level license.txt, or . For the schema, see the sede-and-data-dump-schema.md file within each .7z This torrent has also been archived at
SlowOps: An Industrial Dataset of Stack Traces
zenodo.org
bin, zip
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Egor Shibaev; Egor Shibaev; Denis Sushentsev; Denis Sushentsev; Yaroslav Golubev; Yaroslav Golubev; Aleksandr Khvorov; Aleksandr Khvorov (2024). SlowOps: An Industrial Dataset of Stack Traces [Dataset]. http://doi.org/10.5281/zenodo.14364858
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14364858
Dataset updated
Dec 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Egor Shibaev; Egor Shibaev; Denis Sushentsev; Denis Sushentsev; Yaroslav Golubev; Yaroslav Golubev; Aleksandr Khvorov; Aleksandr Khvorov
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
This package contains SlowOps, and industrial dataset of stack traces introduced in our paper "Stack Trace Deduplication: Faster, More Accurately, and in More Realistic Scenarios".

SlowOps is an dataset of stack traces (reports) and their categories (issues), aimed at evaluating different models for stack trace deduplication. The dataset includes reports related to Slow Operation Assertion, collected at JetBrains from IntelliJ-based products in the time from 26.01.2021 to 29.02.2024. It contains 886,730 reports in 1,361 categories.

For more information about the dataset, please refer to the README.
d
Data from: Behind-the-Meter Storage Policy Stack
catalog.data.gov
data.openei.org
+1more
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Renewable Energy Laboratory (2025). Behind-the-Meter Storage Policy Stack [Dataset]. https://catalog.data.gov/dataset/behind-the-meter-storage-policy-stack-164ba
Explore at:
Dataset updated
Jan 22, 2025
Dataset provided by
National Renewable Energy Laboratory
Description
A variety of studies and disparate data sets track state energy storage policies, but these datasets do not cover all BTM-related storage policy. Moreover, these databases do not align policies with the policy stacking framework. Thus, it is unclear which BTM storage policies are adopted across the country, what should comprise a complete storage policy framework or stack, or how states policies compare with that stack. This first-of-its-kind BTM storage policy stack includes 11 parent policy categories and 31 policies across the market preparation, creation, and expansion policy components.
l
New Zealand Environmental Data Stack (NZEnvDS) - Dataset - DataStore
datastore.landcareresearch.co.nz
Updated Apr 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). New Zealand Environmental Data Stack (NZEnvDS) - Dataset - DataStore [Dataset]. https://datastore.landcareresearch.co.nz/dataset/nzenvds
Explore at:
Dataset updated
Apr 8, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New Zealand
Description
A standardised collection of environmental spatial layers for biodiversity modelling and site characterisation. These spatial layers were published in: McCarthy, J. K., Leathwick, J. R., Roudier, P., Barringer, J. R. F., Etherington, T. R., Morgan, F. J., Odgers, N. P., Price, R. H., Wiser, S. K., Richardson, S. J. (2021) New Zealand Environmental Data Stack (NZEnvDS): A standardised collection of environmental spatial layers for biodiversity modelling and site characterisation. New Zealand Journal of Ecology, 45(2): 3440 https://dx.doi.org/10.20417/nzjecol.45.31 Please cite this paper when using NZEnvDS. These layers can also be downloaded from LRIS: https://lris.scinfo.org.nz/group/nzenvds/data/ Version 1.1 We were advised of a problem with the alignment of the NZTM layers that was caused by an issue with the way the “raster” R-package reprojects spatial layers. This resulted in the NZTM layers being shifted about 200 m south from their intended location. We have corrected this in Version 1.1 of the data by using the replacement for the “raster” R package (“terra”) to perform the reprojections from NZMG to NZTM. The updated scripts used to reproject the layers and updated NZTM layers are included in this version of the data. The NZMG layers are also included, and the scripts used to generate them, but they remain unchanged from Version 1.0.
Most used technologies in the data science tech stack worldwide 2024
statista.com
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most used technologies in the data science tech stack worldwide 2024 [Dataset]. https://www.statista.com/statistics/1292394/popular-technologies-in-the-data-science-tech-stack/
Explore at:
Dataset updated
Jun 26, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 1, 2024 - Jun 30, 2024
Area covered
Worldwide
Description
A tech stack represents a combination of technologies a company uses in order to build and run an application or project. The most popular technology skill in the data science tech stack in 2024 was Python 3.x, chosen by **** percent of respondents. ETL ranked second, being used by *** percent of respondents. This comes as no surprise due to Python's importance in building artificial intelligence (AI) solutions and machine learning products.
Stack Exchange Data Dump (2025-06-30)
academictorrents.com
bittorrent
Updated Jul 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stack Exchange Community (2025). Stack Exchange Data Dump (2025-06-30) [Dataset]. https://academictorrents.com/details/7c8c9a8ffff4d962e052674e236ea0b7390cd9c0
Explore at:
bittorrent(97339542034)Available download formats
Dataset updated
Jul 17, 2025
Dataset provided by
Stack Exchangehttp://stackexchange.com/
Authors
Stack Exchange Community
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
This data dump is sourced from the various sites in the Stack Exchange network of Q&A sites. This dump contains data up to and including 2025-06-30. The exact licenses for each bit of content is embedded in each entry. For license date ranges, see the root-level license.txt, or . For the schema, see the sede-and-data-dump-schema.md file within each .7z This torrent has also been archived at
Math StackExchange Dump .parquet
kaggle.com
Updated Aug 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey (2025). Math StackExchange Dump .parquet [Dataset]. https://www.kaggle.com/datasets/andreyvm/math-stackexchange-dump-parquet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Andrey
Description
This dataset (known as "Stack Exchange Data Dump", Mathematics site) has been obtained from https://math.stackexchange.com/ (see also https://math.stackexchange.com/help/data-dumps). The dataset is very similar in structure to Stack Overflow Data Dump.

Licence information

This dataset is licensed under the Creative Commons CC BY-SA licence (2.5, 3.0, and/or 4.0). For the text of the licence(s) and other details, please see license.txt and https://stackoverflow.com/help/licensing.

Schema

See https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede.

Related notebooks and datasets

https://www.kaggle.com/code/andreyvm/mathse-download-dump-from-se

https://www.kaggle.com/datasets/andreyvm/math-stackexchange-dump-raw

https://www.kaggle.com/code/andreyvm/mathse-dump-to-parquet

https://www.kaggle.com/datasets/andreyvm/math-stackexchange-dump-parquet

https://www.kaggle.com/code/andreyvm/mathse-recover-rephistory

https://www.kaggle.com/datasets/andreyvm/math-stackexchange-rephistory

https://www.kaggle.com/code/andreyvm/mathse-get-usersforbonus-from-sede

https://www.kaggle.com/datasets/andreyvm/math-stackexchange-usersforbonus

https://www.kaggle.com/code/andreyvm/mathse-get-postsdeleted-from-sede

https://www.kaggle.com/datasets/andreyvm/math-stackexchange-postsdeleted

https://www.kaggle.com/code/andreyvm/mathse-get-misc-data-from-sede

https://www.kaggle.com/datasets/andreyvm/math-stackexchange-misc
g
Data from: Stack Overflow Dataset
gts.ai
json
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2024). Stack Overflow Dataset [Dataset]. https://gts.ai/dataset-download/stack-overflow-dataset/
Explore at:
jsonAvailable download formats
Dataset updated
Dec 19, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
Description
The Stack Overflow dataset, a detailed archive of posts, votes, tags, and badges from the world’s largest programmer community.
d
Data from: The best of two worlds: using stacked generalisation for...
datadryad.org
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Oeser; Damaris Zurell; Frieder Mayer; Emrah Çoraman; Niya Toshkova; Stanimira Deleva; Ioseb Natradze; Petr Benda; Astghik Ghazaryan; Sercan Irmak; Nijat Hasanov; Gulnar Guliyeva; Mariya Gritsina; Tobias Kuemmerle (2024). The best of two worlds: using stacked generalisation for integrating expert range maps in species distribution models [Dataset]. http://doi.org/10.5061/dryad.6q573n65m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.6q573n65m
Dataset updated
Sep 25, 2024
Dataset provided by
Dryad
Authors
Julian Oeser; Damaris Zurell; Frieder Mayer; Emrah Çoraman; Niya Toshkova; Stanimira Deleva; Ioseb Natradze; Petr Benda; Astghik Ghazaryan; Sercan Irmak; Nijat Hasanov; Gulnar Guliyeva; Mariya Gritsina; Tobias Kuemmerle
Time period covered
Mar 18, 2024
Description
Aim Species distribution models (SDMs) are powerful tools for assessing suitable habitats across large areas and at fine spatial resolution. Yet, the usefulness of SDMs for mapping species' realised distributions is often limited since data biases or missing information on dispersal barriers or biotic interactions hinder them from accurately delineating species' range limits. One way to overcome this limitation is to integrate SDMs with expert range maps, which provide coarse-scale information on the extent of species' ranges and thereby range limits that are complementary to information offered by SDMs.

Innovation Here, we propose a new approach for integrating expert range maps in SDMs based on an ensemble method called stacked generalisation. Specifically, our approach relies on training a meta-learner regression model using predictions from one or more SDM algorithms alongside the distance of training points to expert-defined ranges as predictor variables. We demonstrate our app...
d
Post-stack migrated SEG-Y multi-channel seismic data collected by the U.S....
catalog.data.gov
data.usgs.gov
+5more
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Post-stack migrated SEG-Y multi-channel seismic data collected by the U.S. Geological Survey in U.S. Atlantic Seaboard in 2014 [Dataset]. https://catalog.data.gov/dataset/post-stack-migrated-seg-y-multi-channel-seismic-data-collected-by-the-u-s-geological-surve
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
East Coast of the United States, United States
Description
In summer 2014, the U.S. Geological Survey conducted a 21-day geophysical program in deep water along the Atlantic continental margin by using R/V Marcus G. Langseth (Field Activity Number 2014-011-FA). The purpose of the seismic program was to collect multichannel seismic reflection and refraction data to determine sediment thickness. These data enable the United States to delineate its Extended Continental Shelf (ECS) along the Atlantic margin. The same data can also be used to understand large submarine landslides and therefore assess their potential tsunami hazard for infrastructure and communities living along the eastern seaboard. Supporting geophysical data were collected as marine magnetic data, gravity data, 3.5-kilohertz shallow seismic reflections, multibeam echo sounder bathymetry, and multibeam backscatter. The survey was conducted from water depths of approximately 1,500 meters to abyssal seafloor depths greater than 5,000 meters. Approximately 2,761 kilometers of multi-channel seismic data was collected along with 30 sonobuoy profiles. This field program had two primary objectives: (1) to collect some of the data necessary to establish the outer limits of the U.S. Continental Shelf, or Extended Continental Shelf, as defined by Article 76 of the United Nations Convention of the Law of the Sea and (2) to study the sudden mass transport of sediments down the continental margin as submarine landslides that pose potential tsunamigenic hazards to the Atlantic and Caribbean coastal communities.
Z
Replication package for the paper "What do Developers Discuss about Code...
data.niaid.nih.gov
zenodo.org
Updated Jun 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous (2021). Replication package for the paper "What do Developers Discuss about Code Comments" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4470125
Explore at:
Dataset updated
Jun 30, 2021
Dataset authored and provided by
Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RP-commenting-practices-multiple-sources

Replication package for the paper "What do Developers Discuss about Code Comments?"

Structure

Appendix.pdf Tags-topics.md Stack-exchange-query.md RQ1/ LDA_input/ combined-so-quora-mallet-metadata.csv topic-input.mallet LDA_output/ Mallet/ output_csv/ docs-in-topics.csv topic-words.csv topics-in-docs.csv topics-metadata.csv output_html/ all_topics.html Docs/ Topics/ RQ2/ datasource_rawdata/ quora.csv stackoverflow.csv manual_analysis_output/ stackoverflow_quora_taxonomy.xlsx

Contents of the Replication Package

Appendix.pdf- Appendix of the paper containing supplement tables

Tags-topics.md tags selected from Stack overflow and topics selected from Quora for the study (RQ1 & RQ2)

Stack-exchange-query.md the query interface used to extract the posts from stack exchnage explorer.

RQ1/ - contains the data used to answer RQ1

LDA_input/ - input data used for LDA analysis

combined-so-quora-mallet-metadata.csv - Stack overflow and Quora questions used to perform LDA analysis

topic-input.mallet - input file to the mallet tool

LDA_output/

Mallet/ - contains the LDA output generated by MALLET tool

output_csv/

docs-in-topics.csv - documents per topic

topic-words.csv - most relevant topic words

topics-in-docs.csv - topic probability per document

topics-metadata.csv - metadata per document and topic probability

output_html/ - Browsable results of mallet output

all_topics.html

Docs/

Topics/

RQ2/ - contains the data used to answer RQ2

datasource_rawdata/ - contains the raw data for each source

quora.csv - contains the processed dataset (like removing html tags). To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool.

stackoverflow.csv - contains the processed stackoverflow dataset. To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool.

manual_analysis_output/

stackoverflow_quora_taxonomy.xlsx - contains the classified dataset of stackoverflow and quora and description of taxonomy.

Taxonomy - contains the description of the first dimension and second dimension categories. Second dimension categories are further divided into levels, separated by | symbol.

stackoverflow-posts - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.
- quota-posts - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.
s
Seair Exim Solutions
seair.co.in
Updated Nov 17, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2016). Seair Exim Solutions [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Nov 17, 2016
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
w
Distribution of employees per stack where industry equals Software
workwithdata.com
Updated May 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Distribution of employees per stack where industry equals Software [Dataset]. https://www.workwithdata.com/charts/companies?agg=sum&chart=bar&f=1&fcol0=industry&fop0=%3D&fval0=Software&x=stack&y=employees
Explore at:
Dataset updated
May 6, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This bar chart displays employees (people) by stack using the aggregation sum. The data is filtered where the industry is Software. The data is about companies.
m
Data from: Stacked Ensemble Model for Accurate Crop Yield Prediction Using...
data.mendeley.com
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramesh V (2025). Stacked Ensemble Model for Accurate Crop Yield Prediction Using Machine Learning Techniques [Dataset]. http://doi.org/10.17632/ncw2vbcgnk.2
Explore at:
Unique identifier
https://doi.org/10.17632/ncw2vbcgnk.2
Dataset updated
Feb 5, 2025
Authors
Ramesh V
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We used historical data for crop yield in 27 Indian states and 3 Union Territories of India, covering the years 1997 to 2020. The dataset consists of 19,689 data points, each with ten features including Crop, Season, Crop_Year, State, Annual_Rainfall, Area, Production, Pesticide, Fertilizer, and Yield. The dataset encompasses 55 different types of crops cultivated across India. The crop yield dataset was used to prediction of crop yield using regression with stacking ensemble model. The dataset is split into training 80% and testing 20%.
United States: media meshing and stacking 2016
statista.com
Updated Jun 10, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2016). United States: media meshing and stacking 2016 [Dataset]. https://www.statista.com/statistics/370051/media-meshing-stacking-us/
Explore at:
Dataset updated
Jun 10, 2016
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 2016 - Mar 2016
Area covered
United States
Description
This statistic shows data on media meshing and stacking in the United States in 2016. During the survey period, it was found that ** percent of U.S. internet users accessed program-related content on mobile devices while watching TV.
1000+ z-stack experiment
catalog.data.gov
data.nist.gov
+2more
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). 1000+ z-stack experiment [Dataset]. https://catalog.data.gov/dataset/1000-z-stack-experiment-c9617
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This data set consists of 1000+ z-stack of actin and nucleus stained cells. The cells reside on 10 different scaffold types and are segmented from the z-stacks to investigate their shape changes.
Data from: New methods for data stacking and P- and S-wave arrival time...
osf.io
Updated Oct 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuefeng Yuan (2020). New methods for data stacking and P- and S-wave arrival time determination using the deep moonquake Apollo recordings [Dataset]. http://doi.org/10.17605/OSF.IO/4WCKX
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/4WCKX
Dataset updated
Oct 14, 2020
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Yuefeng Yuan
Description
No description was included in this Dataset collected from the OSF
w
Distribution of employees per stack where sector equals Communication...
workwithdata.com
Updated May 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Distribution of employees per stack where sector equals Communication Services [Dataset]. https://www.workwithdata.com/charts/companies?agg=sum&chart=bar&f=1&fcol0=sector&fop0=%3D&fval0=Communication+Services&x=stack&y=employees
Explore at:
Dataset updated
May 6, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This bar chart displays employees (people) by stack using the aggregation sum. The data is filtered where the sector is Communication Services. The data is about companies.

Facebook

Twitter

Click to copy link

Link copied

Cite

BigCode (2024). the-stack-v2-train-smol-ids [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids

the-stack-v2-train-smol-ids

The-Stack-v2

bigcode/the-stack-v2-train-smol-ids

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 1, 2024

Dataset authored and provided by

BigCode

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The Stack v2

The dataset consists of 4 versions:

bigcode/the-stack-v2: the full "The Stack v2" dataset bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories. bigcode/the-stack-v2-train-smol-ids: based on the bigcode/the-stack-v2-dedup… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids.

Clear search

Close search

Google apps

Main menu

the-stack-v2-train-smol-ids

the-stack

Stack Exchange Data Dump (2025-06-30, revision 2)

SlowOps: An Industrial Dataset of Stack Traces

Data from: Behind-the-Meter Storage Policy Stack

New Zealand Environmental Data Stack (NZEnvDS) - Dataset - DataStore

Most used technologies in the data science tech stack worldwide 2024

Stack Exchange Data Dump (2025-06-30)

Math StackExchange Dump .parquet

Licence information

Schema

Related notebooks and datasets

Data from: Stack Overflow Dataset

Data from: The best of two worlds: using stacked generalisation for...

Post-stack migrated SEG-Y multi-channel seismic data collected by the U.S....

Replication package for the paper "What do Developers Discuss about Code...

RP-commenting-practices-multiple-sources

Structure

Contents of the Replication Package

- quota-posts - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.

Seair Exim Solutions

Distribution of employees per stack where industry equals Software

Data from: Stacked Ensemble Model for Accurate Crop Yield Prediction Using...

United States: media meshing and stacking 2016

1000+ z-stack experiment

Data from: New methods for data stacking and P- and S-wave arrival time...

Distribution of employees per stack where sector equals Communication...

the-stack-v2-train-smol-idsSee More Versions

The-Stack-v2

bigcode/the-stack-v2-train-smol-ids

- `quota-posts` - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.

the-stack-v2-train-smol-ids