100+ datasets found
  1. h

    the-stack-v2-train-smol-ids

    • huggingface.co
    Updated Mar 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2024). the-stack-v2-train-smol-ids [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids
    Explore at:
    Dataset updated
    Mar 1, 2024
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The Stack v2

    The dataset consists of 4 versions:

    bigcode/the-stack-v2: the full "The Stack v2" dataset bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories. bigcode/the-stack-v2-train-smol-ids: based on the bigcode/the-stack-v2-dedup… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids.

  2. h

    the-stack

    • huggingface.co
    • opendatalab.com
    Updated Oct 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for The Stack

      Changelog
    

    Release Description

    v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

    v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.

  3. Stack Exchange Data Dump (2025-06-30, revision 2)

    • academictorrents.com
    bittorrent
    Updated Aug 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stack Exchange Community (2025). Stack Exchange Data Dump (2025-06-30, revision 2) [Dataset]. https://academictorrents.com/details/53d504734619bc57bf4f4ec81fdf2a2536b3b501
    Explore at:
    bittorrent(97361592423)Available download formats
    Dataset updated
    Aug 1, 2025
    Dataset provided by
    Stack Exchangehttp://stackexchange.com/
    Authors
    Stack Exchange Community
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    This data dump is sourced from the various sites in the Stack Exchange network of Q&A sites. This dump contains data up to and including 2025-06-30. This revision contains one bugfix from rev. 1: one site lacked content in its posts.xml: The exact licenses for each bit of content is embedded in each entry. For license date ranges, see the root-level license.txt, or . For the schema, see the sede-and-data-dump-schema.md file within each .7z This torrent has also been archived at

  4. SlowOps: An Industrial Dataset of Stack Traces

    • zenodo.org
    bin, zip
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Egor Shibaev; Egor Shibaev; Denis Sushentsev; Denis Sushentsev; Yaroslav Golubev; Yaroslav Golubev; Aleksandr Khvorov; Aleksandr Khvorov (2024). SlowOps: An Industrial Dataset of Stack Traces [Dataset]. http://doi.org/10.5281/zenodo.14364858
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Egor Shibaev; Egor Shibaev; Denis Sushentsev; Denis Sushentsev; Yaroslav Golubev; Yaroslav Golubev; Aleksandr Khvorov; Aleksandr Khvorov
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    This package contains SlowOps, and industrial dataset of stack traces introduced in our paper "Stack Trace Deduplication: Faster, More Accurately, and in More Realistic Scenarios".

    SlowOps is an dataset of stack traces (reports) and their categories (issues), aimed at evaluating different models for stack trace deduplication. The dataset includes reports related to Slow Operation Assertion, collected at JetBrains from IntelliJ-based products in the time from 26.01.2021 to 29.02.2024. It contains 886,730 reports in 1,361 categories.

    For more information about the dataset, please refer to the README.

  5. d

    Data from: Behind-the-Meter Storage Policy Stack

    • catalog.data.gov
    • data.openei.org
    • +1more
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2025). Behind-the-Meter Storage Policy Stack [Dataset]. https://catalog.data.gov/dataset/behind-the-meter-storage-policy-stack-164ba
    Explore at:
    Dataset updated
    Jan 22, 2025
    Dataset provided by
    National Renewable Energy Laboratory
    Description

    A variety of studies and disparate data sets track state energy storage policies, but these datasets do not cover all BTM-related storage policy. Moreover, these databases do not align policies with the policy stacking framework. Thus, it is unclear which BTM storage policies are adopted across the country, what should comprise a complete storage policy framework or stack, or how states policies compare with that stack. This first-of-its-kind BTM storage policy stack includes 11 parent policy categories and 31 policies across the market preparation, creation, and expansion policy components.

  6. l

    New Zealand Environmental Data Stack (NZEnvDS) - Dataset - DataStore

    • datastore.landcareresearch.co.nz
    Updated Apr 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). New Zealand Environmental Data Stack (NZEnvDS) - Dataset - DataStore [Dataset]. https://datastore.landcareresearch.co.nz/dataset/nzenvds
    Explore at:
    Dataset updated
    Apr 8, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New Zealand
    Description

    A standardised collection of environmental spatial layers for biodiversity modelling and site characterisation. These spatial layers were published in: McCarthy, J. K., Leathwick, J. R., Roudier, P., Barringer, J. R. F., Etherington, T. R., Morgan, F. J., Odgers, N. P., Price, R. H., Wiser, S. K., Richardson, S. J. (2021) New Zealand Environmental Data Stack (NZEnvDS): A standardised collection of environmental spatial layers for biodiversity modelling and site characterisation. New Zealand Journal of Ecology, 45(2): 3440 https://dx.doi.org/10.20417/nzjecol.45.31 Please cite this paper when using NZEnvDS. These layers can also be downloaded from LRIS: https://lris.scinfo.org.nz/group/nzenvds/data/ Version 1.1 We were advised of a problem with the alignment of the NZTM layers that was caused by an issue with the way the “raster” R-package reprojects spatial layers. This resulted in the NZTM layers being shifted about 200 m south from their intended location. We have corrected this in Version 1.1 of the data by using the replacement for the “raster” R package (“terra”) to perform the reprojections from NZMG to NZTM. The updated scripts used to reproject the layers and updated NZTM layers are included in this version of the data. The NZMG layers are also included, and the scripts used to generate them, but they remain unchanged from Version 1.0.

  7. Most used technologies in the data science tech stack worldwide 2024

    • statista.com
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most used technologies in the data science tech stack worldwide 2024 [Dataset]. https://www.statista.com/statistics/1292394/popular-technologies-in-the-data-science-tech-stack/
    Explore at:
    Dataset updated
    Jun 26, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 1, 2024 - Jun 30, 2024
    Area covered
    Worldwide
    Description

    A tech stack represents a combination of technologies a company uses in order to build and run an application or project. The most popular technology skill in the data science tech stack in 2024 was Python 3.x, chosen by **** percent of respondents. ETL ranked second, being used by *** percent of respondents. This comes as no surprise due to Python's importance in building artificial intelligence (AI) solutions and machine learning products.

  8. Stack Exchange Data Dump (2025-06-30)

    • academictorrents.com
    bittorrent
    Updated Jul 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stack Exchange Community (2025). Stack Exchange Data Dump (2025-06-30) [Dataset]. https://academictorrents.com/details/7c8c9a8ffff4d962e052674e236ea0b7390cd9c0
    Explore at:
    bittorrent(97339542034)Available download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    Stack Exchangehttp://stackexchange.com/
    Authors
    Stack Exchange Community
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    This data dump is sourced from the various sites in the Stack Exchange network of Q&A sites. This dump contains data up to and including 2025-06-30. The exact licenses for each bit of content is embedded in each entry. For license date ranges, see the root-level license.txt, or . For the schema, see the sede-and-data-dump-schema.md file within each .7z This torrent has also been archived at

  9. Math StackExchange Dump .parquet

    • kaggle.com
    Updated Aug 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey (2025). Math StackExchange Dump .parquet [Dataset]. https://www.kaggle.com/datasets/andreyvm/math-stackexchange-dump-parquet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Andrey
    Description
  10. g

    Data from: Stack Overflow Dataset

    • gts.ai
    json
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2024). Stack Overflow Dataset [Dataset]. https://gts.ai/dataset-download/stack-overflow-dataset/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    Description

    The Stack Overflow dataset, a detailed archive of posts, votes, tags, and badges from the world’s largest programmer community.

  11. d

    Data from: The best of two worlds: using stacked generalisation for...

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Oeser; Damaris Zurell; Frieder Mayer; Emrah Çoraman; Niya Toshkova; Stanimira Deleva; Ioseb Natradze; Petr Benda; Astghik Ghazaryan; Sercan Irmak; Nijat Hasanov; Gulnar Guliyeva; Mariya Gritsina; Tobias Kuemmerle (2024). The best of two worlds: using stacked generalisation for integrating expert range maps in species distribution models [Dataset]. http://doi.org/10.5061/dryad.6q573n65m
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Dryad
    Authors
    Julian Oeser; Damaris Zurell; Frieder Mayer; Emrah Çoraman; Niya Toshkova; Stanimira Deleva; Ioseb Natradze; Petr Benda; Astghik Ghazaryan; Sercan Irmak; Nijat Hasanov; Gulnar Guliyeva; Mariya Gritsina; Tobias Kuemmerle
    Time period covered
    Mar 18, 2024
    Description

    Aim Species distribution models (SDMs) are powerful tools for assessing suitable habitats across large areas and at fine spatial resolution. Yet, the usefulness of SDMs for mapping species' realised distributions is often limited since data biases or missing information on dispersal barriers or biotic interactions hinder them from accurately delineating species' range limits. One way to overcome this limitation is to integrate SDMs with expert range maps, which provide coarse-scale information on the extent of species' ranges and thereby range limits that are complementary to information offered by SDMs.

    Innovation Here, we propose a new approach for integrating expert range maps in SDMs based on an ensemble method called stacked generalisation. Specifically, our approach relies on training a meta-learner regression model using predictions from one or more SDM algorithms alongside the distance of training points to expert-defined ranges as predictor variables. We demonstrate our app...

  12. d

    Post-stack migrated SEG-Y multi-channel seismic data collected by the U.S....

    • catalog.data.gov
    • data.usgs.gov
    • +5more
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Post-stack migrated SEG-Y multi-channel seismic data collected by the U.S. Geological Survey in U.S. Atlantic Seaboard in 2014 [Dataset]. https://catalog.data.gov/dataset/post-stack-migrated-seg-y-multi-channel-seismic-data-collected-by-the-u-s-geological-surve
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    East Coast of the United States, United States
    Description

    In summer 2014, the U.S. Geological Survey conducted a 21-day geophysical program in deep water along the Atlantic continental margin by using R/V Marcus G. Langseth (Field Activity Number 2014-011-FA). The purpose of the seismic program was to collect multichannel seismic reflection and refraction data to determine sediment thickness. These data enable the United States to delineate its Extended Continental Shelf (ECS) along the Atlantic margin. The same data can also be used to understand large submarine landslides and therefore assess their potential tsunami hazard for infrastructure and communities living along the eastern seaboard. Supporting geophysical data were collected as marine magnetic data, gravity data, 3.5-kilohertz shallow seismic reflections, multibeam echo sounder bathymetry, and multibeam backscatter. The survey was conducted from water depths of approximately 1,500 meters to abyssal seafloor depths greater than 5,000 meters. Approximately 2,761 kilometers of multi-channel seismic data was collected along with 30 sonobuoy profiles. This field program had two primary objectives: (1) to collect some of the data necessary to establish the outer limits of the U.S. Continental Shelf, or Extended Continental Shelf, as defined by Article 76 of the United Nations Convention of the Law of the Sea and (2) to study the sudden mass transport of sediments down the continental margin as submarine landslides that pose potential tsunamigenic hazards to the Atlantic and Caribbean coastal communities.

  13. Z

    Replication package for the paper "What do Developers Discuss about Code...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2021). Replication package for the paper "What do Developers Discuss about Code Comments" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4470125
    Explore at:
    Dataset updated
    Jun 30, 2021
    Dataset authored and provided by
    Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RP-commenting-practices-multiple-sources

    Replication package for the paper "What do Developers Discuss about Code Comments?"

    Structure

    Appendix.pdf
    Tags-topics.md
    Stack-exchange-query.md
    
    RQ1/
      LDA_input/
        combined-so-quora-mallet-metadata.csv
        topic-input.mallet
    
      LDA_output/
        Mallet/
          output_csv/
            docs-in-topics.csv
            topic-words.csv
            topics-in-docs.csv
            topics-metadata.csv
          output_html/
            all_topics.html
            Docs/
            Topics/
    
    RQ2/
      datasource_rawdata/
        quora.csv
        stackoverflow.csv
      manual_analysis_output/
        stackoverflow_quora_taxonomy.xlsx
    

    Contents of the Replication Package

    • Appendix.pdf- Appendix of the paper containing supplement tables

    • Tags-topics.md tags selected from Stack overflow and topics selected from Quora for the study (RQ1 & RQ2)

    • Stack-exchange-query.md the query interface used to extract the posts from stack exchnage explorer.

    • RQ1/ - contains the data used to answer RQ1

      • LDA_input/ - input data used for LDA analysis
      • combined-so-quora-mallet-metadata.csv - Stack overflow and Quora questions used to perform LDA analysis
      • topic-input.mallet - input file to the mallet tool
      • LDA_output/
      • Mallet/ - contains the LDA output generated by MALLET tool
        • output_csv/
          • docs-in-topics.csv - documents per topic
          • topic-words.csv - most relevant topic words
          • topics-in-docs.csv - topic probability per document
          • topics-metadata.csv - metadata per document and topic probability
        • output_html/ - Browsable results of mallet output
          • all_topics.html
          • Docs/
          • Topics/
    • RQ2/ - contains the data used to answer RQ2

      • datasource_rawdata/ - contains the raw data for each source
      • quora.csv - contains the processed dataset (like removing html tags). To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool.
      • stackoverflow.csv - contains the processed stackoverflow dataset. To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool.
      • manual_analysis_output/
      • stackoverflow_quora_taxonomy.xlsx - contains the classified dataset of stackoverflow and quora and description of taxonomy.
        • Taxonomy - contains the description of the first dimension and second dimension categories. Second dimension categories are further divided into levels, separated by | symbol.
        • stackoverflow-posts - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.

          - quota-posts - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories.

  14. s

    Seair Exim Solutions

    • seair.co.in
    Updated Nov 17, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2016). Seair Exim Solutions [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Nov 17, 2016
    Dataset provided by
    Seair Info Solutions PVT LTD
    Authors
    Seair Exim
    Area covered
    India
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  15. w

    Distribution of employees per stack where industry equals Software

    • workwithdata.com
    Updated May 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Distribution of employees per stack where industry equals Software [Dataset]. https://www.workwithdata.com/charts/companies?agg=sum&chart=bar&f=1&fcol0=industry&fop0=%3D&fval0=Software&x=stack&y=employees
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This bar chart displays employees (people) by stack using the aggregation sum. The data is filtered where the industry is Software. The data is about companies.

  16. m

    Data from: Stacked Ensemble Model for Accurate Crop Yield Prediction Using...

    • data.mendeley.com
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramesh V (2025). Stacked Ensemble Model for Accurate Crop Yield Prediction Using Machine Learning Techniques [Dataset]. http://doi.org/10.17632/ncw2vbcgnk.2
    Explore at:
    Dataset updated
    Feb 5, 2025
    Authors
    Ramesh V
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We used historical data for crop yield in 27 Indian states and 3 Union Territories of India, covering the years 1997 to 2020. The dataset consists of 19,689 data points, each with ten features including Crop, Season, Crop_Year, State, Annual_Rainfall, Area, Production, Pesticide, Fertilizer, and Yield. The dataset encompasses 55 different types of crops cultivated across India. The crop yield dataset was used to prediction of crop yield using regression with stacking ensemble model. The dataset is split into training 80% and testing 20%.

  17. United States: media meshing and stacking 2016

    • statista.com
    Updated Jun 10, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2016). United States: media meshing and stacking 2016 [Dataset]. https://www.statista.com/statistics/370051/media-meshing-stacking-us/
    Explore at:
    Dataset updated
    Jun 10, 2016
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2016 - Mar 2016
    Area covered
    United States
    Description

    This statistic shows data on media meshing and stacking in the United States in 2016. During the survey period, it was found that ** percent of U.S. internet users accessed program-related content on mobile devices while watching TV.

  18. 1000+ z-stack experiment

    • catalog.data.gov
    • data.nist.gov
    • +2more
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). 1000+ z-stack experiment [Dataset]. https://catalog.data.gov/dataset/1000-z-stack-experiment-c9617
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This data set consists of 1000+ z-stack of actin and nucleus stained cells. The cells reside on 10 different scaffold types and are segmented from the z-stacks to investigate their shape changes.

  19. Data from: New methods for data stacking and P- and S-wave arrival time...

    • osf.io
    Updated Oct 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuefeng Yuan (2020). New methods for data stacking and P- and S-wave arrival time determination using the deep moonquake Apollo recordings [Dataset]. http://doi.org/10.17605/OSF.IO/4WCKX
    Explore at:
    Dataset updated
    Oct 14, 2020
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Yuefeng Yuan
    Description

    No description was included in this Dataset collected from the OSF

  20. w

    Distribution of employees per stack where sector equals Communication...

    • workwithdata.com
    Updated May 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Distribution of employees per stack where sector equals Communication Services [Dataset]. https://www.workwithdata.com/charts/companies?agg=sum&chart=bar&f=1&fcol0=sector&fop0=%3D&fval0=Communication+Services&x=stack&y=employees
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This bar chart displays employees (people) by stack using the aggregation sum. The data is filtered where the sector is Communication Services. The data is about companies.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BigCode (2024). the-stack-v2-train-smol-ids [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids

the-stack-v2-train-smol-ids

The-Stack-v2

bigcode/the-stack-v2-train-smol-ids

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Mar 1, 2024
Dataset authored and provided by
BigCode
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The Stack v2

The dataset consists of 4 versions:

bigcode/the-stack-v2: the full "The Stack v2" dataset bigcode/the-stack-v2-dedup: based on the bigcode/the-stack-v2 but further near-deduplicated bigcode/the-stack-v2-train-full-ids: based on the bigcode/the-stack-v2-dedup dataset but further filtered with heuristics and spanning 600+ programming languages. The data is grouped into repositories. bigcode/the-stack-v2-train-smol-ids: based on the bigcode/the-stack-v2-dedup… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids.

Search
Clear search
Close search
Google apps
Main menu