20 datasets found

h
proof-pile
huggingface.co
Updated Dec 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoskinson Center for Formal Mathematics (2022). proof-pile [Dataset]. https://huggingface.co/datasets/hoskinson-center/proof-pile
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 25, 2022
Dataset authored and provided by
Hoskinson Center for Formal Mathematics
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
A dataset of high quality mathematical text.
h
Pile-NER-type-conll
huggingface.co
Updated Sep 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pile-NER-type-conll [Dataset]. https://huggingface.co/datasets/milistu/Pile-NER-type-conll
Explore at:
Dataset updated
Sep 28, 2024
Authors
Milutin Studen
Description
Pile-NER Dataset in CoNLL Format

Overview

The Pile-NER-type dataset provides named entity recognition annotations derived from The Pile, a large-scale text corpus. This dataset is formatted in CoNLL style for easy use with token classification models.

Structure

Format: CoNLL Split: Train only (45,889 examples) Features: id: Unique identifier for each example words: Sequence of tokens ner_tags: Named entity tags for each token labels: Label annotations for each… See the full description on the dataset page: https://huggingface.co/datasets/milistu/Pile-NER-type-conll.
h
pile-philpapers
huggingface.co
Updated Nov 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2024). pile-philpapers [Dataset]. https://huggingface.co/datasets/timaeus/pile-philpapers
Explore at:
Dataset updated
Nov 17, 2024
Dataset authored and provided by
Timaeus
Description
Dataset: Pile - Philpapers

This dataset is a subset of Philpapers data from The Pile dataset, containing text examples from this specific source.

Dataset Source

This dataset is derived from the uncopyrighted portion of The Pile dataset, specifically accessed through:

monology/pile-uncopyrighted

Citations

If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-philpapers.
Proof-Pile-2
opendatalab.com
zip
Updated Oct 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Princeton University (2023). Proof-Pile-2 [Dataset]. https://opendatalab.com/OpenDataLab/Proof-Pile-2
Explore at:
zipAvailable download formats
Dataset updated
Oct 1, 2023
Dataset provided by
EleutherAIhttps://eleuther.ai/
University of Toronto
Princeton University
Description
Proof-Pile II, a 55 billion token dataset of mathematical and scientific documents.
Z
Stack Overflow Chat Dutch
data.niaid.nih.gov
huggingface.co
+1more
Updated Jun 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379
Explore at:
Dataset updated
Jun 20, 2023
Dataset authored and provided by
Vanroy, Bram
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

Data Instances

{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

Data Fields

id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

topic: the topic description

Dataset Creation

Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

"""

The prompt to translate the topic is:

TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.

"

The system message was:

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

The translation quality has not been verified. Use at your own risk!

Licensing Information

Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.
h
TheArabicPile_Dialects
huggingface.co
Updated May 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Premio.AI (2024). TheArabicPile_Dialects [Dataset]. https://huggingface.co/datasets/premio-ai/TheArabicPile_Dialects
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 1, 2024
Dataset provided by
Premio.AI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The Arabic Pile

Introduction:

The Arabic Pile is a comprehensive dataset meticulously designed to parallel the structure of The Pile and The Nordic Pile. Focused on the Arabic language, the dataset encompasses a vast array of linguistic nuances, incorporating both Modern Standard Arabic (MSA) and various Levantine, North African, and Egyptian dialects. Tailored for the training and fine-tuning of large language models, the dataset consists of 13 subsets, each… See the full description on the dataset page: https://huggingface.co/datasets/premio-ai/TheArabicPile_Dialects.
Europe Steel Piles Market Growth, Industry & Revenue [2030]
growthmarketreports.com
csv, pdf, pptx
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2024). Europe Steel Piles Market Growth, Industry & Revenue [2030] [Dataset]. https://growthmarketreports.com/report/steel-piles-market-europe-industry-analysis
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Feb 28, 2024
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Europe steel piles Market Outlook

The Europe steel piles market was valued at USD 6,918.1 Million in 2019 and is projected to reach USD 7,826.1 Million by 2027, expanding at a CAGR of 2.4% during the forecast period. The number of construction and developments activities is increasing at a rapid pace globally owing to the increase in urbanization and globalization. This, in turn, is fueling the requirement for rigid foundation solutions for superstructures, thus driving the demand for steel piles.

Steel piles play a vital role in several construction industries due their ability to provide a rigid structure. Several types of steel piles that are used for applications such as ports & harbors, urban civil engineering, and bridges. These piles can be used for piers, seawalls, cofferdams, groynes, and other groundworks. Additionally, rise in trade among the countries in Europe is boosting for various infrastructure developments. This, in turn, is projected to fuel the demand for steel piles.

Market Dynamics

Rising demand for steel pipe piles in various applications such as urban civil engineering and bridge engineering is driving the Europe steel piles market. Steel piles offer a large bearing capacity and can penetrate through stiff layers or boulders. Additionally, steel piles are capable of withstanding rough handling. Moreover, the volume of soil displaced during the driving of steel piles is less.

Soil displacement, high initial costs, and additional requirements such as the prevention of deterioration associated with steel piles are major factors hampering the market. Advancements in technology such as the use of advanced fiber optic sensors in pile instrumentation is a major factor creating opportunities in the market.

Steel pile consumption increased due to growing population, increasing development of infrastructure, and the substitution of many metals with steel piles in various end-uses, such as railway and bridges. This is expected to drive the market growth in during the forecast period.

Segmental Outlook

On the basis of product types, the Europe steel piles market is segregated into steel sheet piles, st
C
Four-point-bending experiments on historic quay wall foundation piles; Data...
data.4tu.nl
4tu.edu.hpc.n-helix.com
zip
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mart-Jan Hemel; Jelle van Ophuizen; Dirk Jan Peters; Willem van Bommel; Eric Valckenier; Andreas den Dekker (2023). Four-point-bending experiments on historic quay wall foundation piles; Data underlying the dissertation: "Amsterdam quays under pressure - Modelling and testing of historic canal walls". [Dataset]. http://doi.org/10.4121/86c74fa5-1bcf-49f0-8396-c14f9a45eeef.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/86c74fa5-1bcf-49f0-8396-c14f9a45eeef.v1
Dataset updated
Jul 19, 2023
Dataset provided by
4TU.ResearchData
Authors
Mart-Jan Hemel; Jelle van Ophuizen; Dirk Jan Peters; Willem van Bommel; Eric Valckenier; Andreas den Dekker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2022
Area covered

Description
The dataset has been compiled as part of the dissertation of M.J. Hemel (2023) entitled "Amsterdam quays under pressure - Modelling and testing of historic canal walls".
The mechanical properties of existing spruce foundation piles, retrieved from a historic quay wall at Overamstel in Amsterdam, were studied. To better understand the material properties of historic foundation piles in bending, a series of 6 four-point bending experiments were conducted. An adjustable bending test setup was used in which timber piles were bent by hoist and support slings. The forces were adjusted such that the outer fibre stress between the two forces was consistent, leading to a failure at the weakest cross-section. Measurements of the deflections along the piles and force distribution were taken along the pile length during loading. In addition, biological decay in the outer layer of the timber piles, also referred to as ‘soft shell’, was identified with IML micro-drillings. For two piles (pile 5 and 6) Internal strains were measured by gluing fibre optical wires inside the soft shell of the timber piles.
This dataset contains the following:
- Bending deflections and forces (Excel format)
- Fibre optic measurments and installation details (Text files and Excel format)
- IML Micro drilling measurements and IML guide(Text Files and PDF)
Z
Dataset - What are the Machine Learning best practices reported by...
data.niaid.nih.gov
zenodo.org
Updated Jun 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mario Linares-Vásquez (2023). Dataset - What are the Machine Learning best practices reported by practitioners on Stack Exchange? [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7908721
Explore at:
Dataset updated
Jun 25, 2023
Dataset provided by
Mario Linares-Vásquez
Steffen Herbold
Andrea Bayona
Fabio A. González
Anamaria Mojica-Hanke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data correspond to the posts (questions and answers) retrieved by querying for posts related to the tag 'machine learning' and the phrase 'best practice(s).' The data were used as the basis for a study currently under review on discussing machine learning best practices as discussed by practitioners in question-and-answer communities such as Stack Exchange. The information from each type of post (i.e., questions and answers) is presented in multiple formats (i.e., .txt, .csv, and .xlsx).

Answers - Variables

AID: Unique identification of the answer in the Q&A website.

ParentId: Unique identification of the question associated with the answer in the Q&A website

AcceptedAnswerId : In the case in which an answer is the most voted question associated with the ParentId, and it is different from the accepted answer, a different identifier from the AID is available. In the case in which the accepted question had a score lower than 1, a -1 is assigned.

ABody: HTML text of the answer.

Score: Upvotes - downvotes of the answer.

url_Answer: URL of the answer. The question URL can be from different websites.

type: best or accepted. Accepted in the case that the information belongs to the accepted answer of the ParentId question and best in the case in which it is the most voted question of the ParentId question.

Date: Creation date of the answer.

Questions - Variables

QID: Unique identification of the question in the Q&A website.

AcceptedAnswerId: Unique identification of the accepted answer for a specific question in the Q&A website. In the case in which a question had a most-voted answer different from the accepted one, and the accepted one had a negative score, a -1 was assigned to the AcceptedAnswerId.

BestAnswerId: Unique identification of the most voted answer for a specific question in the Q&A website. In the case in which the most voted and accepted questions were the same, then a -1 was assigned to the BestAnswerId.

Qtitle: Title of the question.

QBody: HTML text of the question.

Score: Upvotes - downvotes of the questions.

QTags: Tags that are associated with each question.

url_question: URL of the question. The question URL can be from different websites.

Date: Creation date of the question

This dataset is a subset of the Stack Exchange dump of 03.2021 (https://archive.org/details/stackexchange_20210301) in which a series of filters were applied to obtain the data used in the study.
C
Lateral pile group experiments on historic quay wall foundation piles; Data...
4tu.edu.hpc.n-helix.com
zip
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mart-Jan Hemel; Jelle van Ophuizen; Eric Valckenier; Dirk Jan Peters; Mandy Korff; Willem van Bommel; Erik Hutcheson (2023). Lateral pile group experiments on historic quay wall foundation piles; Data underlying the dissertation: "Amsterdam quays under pressure - Modelling and testing of historic canal walls". [Dataset]. http://doi.org/10.4121/2f3d53fd-d4aa-4cad-9757-626cecb785a6.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/2f3d53fd-d4aa-4cad-9757-626cecb785a6.v1
Dataset updated
Jul 19, 2023
Dataset provided by
4TU.ResearchData
Authors
Mart-Jan Hemel; Jelle van Ophuizen; Eric Valckenier; Dirk Jan Peters; Mandy Korff; Willem van Bommel; Erik Hutcheson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2022
Area covered

Description
To test the lateral response of old timber quay wall foundations, two lateral experiments have been carried out on 3x4 pile groups in Overamstel Amsterdam. In one of the experiments, the piles are loaded axially/vertically by means of a top load. The experiment without top load is labelled F1 and the experiment with top load is labelled F2. The pile group was instrumented to measure pile cap deflections, rotations and pile cap loads. For a selection of piles, in-ground strain measurements along the piles were carried out. For test F1 and F2, the instrumentation is identical.

This dataset contains the following:

- Pile forces, deflections and cap rotations data (Excel format)
- Fibre optic measurments and installation details (Text files and Excel format)
h
pile-ubuntu_irc
huggingface.co
Updated Nov 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2024). pile-ubuntu_irc [Dataset]. https://huggingface.co/datasets/timaeus/pile-ubuntu_irc
Explore at:
Dataset updated
Nov 17, 2024
Dataset authored and provided by
Timaeus
Description
Dataset: Pile - Ubuntu Irc

This dataset is a subset of Ubuntu Irc data from The Pile dataset, containing text examples from this specific source.

Dataset Source

This dataset is derived from the uncopyrighted portion of The Pile dataset, specifically accessed through:

monology/pile-uncopyrighted

Citations

If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-ubuntu_irc.
Python Questions from Stack Overflow
kaggle.com
zip
Updated Oct 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stack Overflow (2016). Python Questions from Stack Overflow [Dataset]. https://www.kaggle.com/datasets/stackoverflow/pythonquestions/versions/1
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Oct 20, 2016
Dataset authored and provided by
Stack Overflowhttp://stackoverflow.com/
Description
Context

Full text of all questions and answers from Stack Overflow that are tagged with the python tag. Useful for natural language processing and community analysis. See also the dataset of R questions.

Content

This dataset is organized as three tables:

Questions contains the title, body, creation date, score, and owner ID for each Python question.

Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.

Tags contains the tags on each question besides the Python tag.

Questions may be deleted by the user who posted them. They can also be closed by community vote, if the question is deemed off-topic for instance. Such questions are not included in this dataset.

The dataset contains questions all questions asked between August 2, 2008 and Ocotober 19, 2016.

License

All Stack Overflow user contributions are licensed under CC-BY-SA 3.0 with attribution required.
Wu and Liu 2023 9368 PileBetaG sourced qpss data
figshare.com
txt
Updated May 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xing zheng wu (2024). Wu and Liu 2023 9368 PileBetaG sourced qpss data [Dataset]. http://doi.org/10.6084/m9.figshare.25855843.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25855843.v1
Dataset updated
May 19, 2024
Dataset provided by
figshare
Authors
xing zheng wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data were sourced by the literature (Wu X.Z., Liu H. Development of environmental contours from site-specific regression parameters of load−settlement curves for piles: the global database. International Journal of Geomechanics, 2023, 23(9):04023148-1-22.) https://doi.org/10.1061/IJGNAI.GMENG-8372 The qpss file can be opened by any text editor, such as notepad or editplus. The tested load-settlement data for each curve were saved in the qpss file at a construction site. The data format of these qpss files are as follows: Q11 s11 Q12 s12 ... ... Q1k s1k
t
Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024)....
service.tib.eu
Updated Dec 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024). Dataset: Proof-Pile-2. https://doi.org/10.57702/7iiqekx3 [Dataset]. https://service.tib.eu/ldmservice/dataset/proof-pile-2
Explore at:
Dataset updated
Dec 16, 2024
Description
The dataset used for continual pre-training of large language models, with a focus on balancing the text distribution and mitigating overfitting.
Z
SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow...
data.niaid.nih.gov
zenodo.org
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Baltes, Sebastian (2020). SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts — Supplementary Material [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1201553
Explore at:
Dataset updated
Jan 21, 2020
Dataset authored and provided by
Baltes, Sebastian
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Stack Overflow is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on Stack Overflow evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on Stack Overflow evolves, we built SOTorrent, an open dataset based on the official Stack Exchange data dump. SOTorrent provides access to the version history of Stack Overflow content at the level of whole posts and individual text or code blocks.

This dataset has been retrieved from SOTorrent using the following scripts:

https://doi.org/10.5281/zenodo.1201679

For the MSR 2018 paper about SOTorrent, we used the following scripts to analyze the data:

https://doi.org/10.5281/zenodo.1201706

The files sample_before_10.ods and sample_after_10.ods contain our qualitative analysis of 50 comments that were made up to 10 minutes before/after an edit.
CBAG (left) compared to GPT-2 huge unmodified with 1.5B parameters (right).
plos.figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justin Sybrandt; Ilya Safro (2023). CBAG (left) compared to GPT-2 huge unmodified with 1.5B parameters (right). [Dataset]. http://doi.org/10.1371/journal.pone.0253905.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0253905.t004
Dataset updated
Jun 9, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Justin Sybrandt; Ilya Safro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Both systems are given the same title as a prompt. CBAG receives metadata. Results truncated for space.
h
pile-of-law
huggingface.co
opendatalab.com
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
Explore at:
Dataset updated
Jul 10, 2022
Dataset authored and provided by
Pile of Law
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
The-Stack
opendatalab.com
zip
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face (2022). The-Stack [Dataset]. https://opendatalab.com/OpenDataLab/the-stack
Explore at:
zipAvailable download formats
Dataset updated
Jan 1, 2022
Dataset provided by
Hugging Facehttps://huggingface.co/
ServiceNow Research
License
https://huggingface.co/datasets/bigcode/the-stackhttps://huggingface.co/datasets/bigcode/the-stack
Description
该堆栈包含超过6TB的许可源代码文件，涵盖358种编程语言。该数据集是作为BigCode项目的一部分创建的，BigCode项目是一个开放的科学合作项目，致力于代码的大型语言模型 (Code LLMs) 的负责任开发。堆栈用作代码llm的预训练数据集，即代码生成AI系统，该系统能够从自然语言描述以及其他代码片段合成程序。
h
pile-europarl
huggingface.co
Updated Nov 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timaeus (2024). pile-europarl [Dataset]. https://huggingface.co/datasets/timaeus/pile-europarl
Explore at:
Dataset updated
Nov 17, 2024
Dataset authored and provided by
Timaeus
Description
Dataset: Pile - Europarl

This dataset is a subset of Europarl data from The Pile dataset, containing text examples from this specific source.

Dataset Source

This dataset is derived from the uncopyrighted portion of The Pile dataset, specifically accessed through:

monology/pile-uncopyrighted

Citations

If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language modeling}… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-europarl.
P
SEDE Dataset
paperswithcode.com
Updated Aug 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moshe Hazoom; Vibhor Malik; Ben Bogin (2024). SEDE Dataset [Dataset]. https://paperswithcode.com/dataset/sede
Explore at:
Dataset updated
Aug 8, 2024
Authors
Moshe Hazoom; Vibhor Malik; Ben Bogin
Description
SEDE is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written by real users of the Stack Exchange Data Explorer out of a natural interaction. These pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset. The goal of this dataset is to take a significant step towards evaluation of Text-to-SQL models in a real-world setting. Compared to other Text-to-SQL datasets, SEDE contains at least 10 times more SQL queries templates (queries after canonization and anonymization of values) than other datasets, and has the most diverse set of utterances and SQL queries (in terms of 3-grams) out of all single-domain datasets. SEDE introduces real-world challenges, such as under-specification, usage of parameters in queries, dates manipulation and more.
Not seeing a result you expected?
Learn how you can add new datasets to our index.