Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A dataset of high quality mathematical text.
Pile-NER Dataset in CoNLL Format
Overview
The Pile-NER-type dataset provides named entity recognition annotations derived from The Pile, a large-scale text corpus. This dataset is formatted in CoNLL style for easy use with token classification models.
Structure
Format: CoNLL Split: Train only (45,889 examples) Features: id: Unique identifier for each example words: Sequence of tokens ner_tags: Named entity tags for each token labels: Label annotations for each… See the full description on the dataset page: https://huggingface.co/datasets/milistu/Pile-NER-type-conll.
Dataset: Pile - Philpapers
This dataset is a subset of Philpapers data from The Pile dataset, containing text examples from this specific source.
Dataset Source
This dataset is derived from the uncopyrighted portion of The Pile dataset, specifically accessed through:
monology/pile-uncopyrighted
Citations
If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-philpapers.
Proof-Pile II, a 55 billion token dataset of mathematical and scientific documents.
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.
Data Instances
{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},
Data Fields
id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]
input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].
topic: the topic description
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.
The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the conversation consists of the AI (marked as [|AI|]
) and the human ([|Human|]
) talking in turns and responding to each other;
2. do not translate the speaker identifiers [|AI|]
and [|Human|]
but always copy them into the translation in appropriate places;
3. ensure accurate translation and keep the correctness of the conversation;
4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
5. translate the human's text using informal, but standard, language;
6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
The prompt to translate the topic is:
TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.
"
The system message was:
You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].
The translation quality has not been verified. Use at your own risk!
Licensing Information
Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The Arabic Pile
Introduction:
The Arabic Pile is a comprehensive dataset meticulously designed to parallel the structure of The Pile and The Nordic Pile. Focused on the Arabic language, the dataset encompasses a vast array of linguistic nuances, incorporating both Modern Standard Arabic (MSA) and various Levantine, North African, and Egyptian dialects. Tailored for the training and fine-tuning of large language models, the dataset consists of 13 subsets, each… See the full description on the dataset page: https://huggingface.co/datasets/premio-ai/TheArabicPile_Dialects.
The Europe steel piles market was valued at USD 6,918.1 Million in 2019 and is projected to reach USD 7,826.1 Million by 2027, expanding at a CAGR of 2.4% during the forecast period. The number of construction and developments activities is increasing at a rapid pace globally owing to the increase in urbanization and globalization. This, in turn, is fueling the requirement for rigid foundation solutions for superstructures, thus driving the demand for steel piles.
Steel piles play a vital role in several construction industries due their ability to provide a rigid structure. Several types of steel piles that are used for applications such as ports & harbors, urban civil engineering, and bridges. These piles can be used for piers, seawalls, cofferdams, groynes, and other groundworks. Additionally, rise in trade among the countries in Europe is boosting for various infrastructure developments. This, in turn, is projected to fuel the demand for steel piles.
Rising demand for steel pipe piles in various applications such as urban civil engineering and bridge engineering is driving the Europe steel piles market. Steel piles offer a large bearing capacity and can penetrate through stiff layers or boulders. Additionally, steel piles are capable of withstanding rough handling. Moreover, the volume of soil displaced during the driving of steel piles is less.
Soil displacement, high initial costs, and additional requirements such as the prevention of deterioration associated with steel piles are major factors hampering the market. Advancements in technology such as the use of advanced fiber optic sensors in pile instrumentation is a major factor creating opportunities in the market.
Steel pile consumption increased due to growing population, increasing development of infrastructure, and the substitution of many metals with steel piles in various end-uses, such as railway and bridges. This is expected to drive the market growth in during the forecast period.
On the basis of product types, the Europe steel piles market is segregated into steel sheet piles, st
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset has been compiled as part of the dissertation of M.J. Hemel (2023) entitled "Amsterdam quays under pressure - Modelling and testing of historic canal walls".
The mechanical properties of existing spruce foundation piles, retrieved from a historic quay wall at Overamstel in Amsterdam, were studied. To better understand the material properties of historic foundation piles in bending, a series of 6 four-point bending experiments were conducted. An adjustable bending test setup was used in which timber piles were bent by hoist and support slings. The forces were adjusted such that the outer fibre stress between the two forces was consistent, leading to a failure at the weakest cross-section. Measurements of the deflections along the piles and force distribution were taken along the pile length during loading. In addition, biological decay in the outer layer of the timber piles, also referred to as ‘soft shell’, was identified with IML micro-drillings. For two piles (pile 5 and 6) Internal strains were measured by gluing fibre optical wires inside the soft shell of the timber piles.
This dataset contains the following:
- Bending deflections and forces (Excel format)
- Fibre optic measurments and installation details (Text files and Excel format)
- IML Micro drilling measurements and IML guide(Text Files and PDF)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data correspond to the posts (questions and answers) retrieved by querying for posts related to the tag 'machine learning' and the phrase 'best practice(s).' The data were used as the basis for a study currently under review on discussing machine learning best practices as discussed by practitioners in question-and-answer communities such as Stack Exchange. The information from each type of post (i.e., questions and answers) is presented in multiple formats (i.e., .txt, .csv, and .xlsx).
Answers - Variables
AID: Unique identification of the answer in the Q&A website.
ParentId: Unique identification of the question associated with the answer in the Q&A website
AcceptedAnswerId : In the case in which an answer is the most voted question associated with the ParentId, and it is different from the accepted answer, a different identifier from the AID is available. In the case in which the accepted question had a score lower than 1, a -1 is assigned.
ABody: HTML text of the answer.
Score: Upvotes - downvotes of the answer.
url_Answer: URL of the answer. The question URL can be from different websites.
type: best or accepted. Accepted in the case that the information belongs to the accepted answer of the ParentId question and best in the case in which it is the most voted question of the ParentId question.
Date: Creation date of the answer.
Questions - Variables
QID: Unique identification of the question in the Q&A website.
AcceptedAnswerId: Unique identification of the accepted answer for a specific question in the Q&A website. In the case in which a question had a most-voted answer different from the accepted one, and the accepted one had a negative score, a -1 was assigned to the AcceptedAnswerId.
BestAnswerId: Unique identification of the most voted answer for a specific question in the Q&A website. In the case in which the most voted and accepted questions were the same, then a -1 was assigned to the BestAnswerId.
Qtitle: Title of the question.
QBody: HTML text of the question.
Score: Upvotes - downvotes of the questions.
QTags: Tags that are associated with each question.
url_question: URL of the question. The question URL can be from different websites.
Date: Creation date of the question
This dataset is a subset of the Stack Exchange dump of 03.2021 (https://archive.org/details/stackexchange_20210301) in which a series of filters were applied to obtain the data used in the study.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
To test the lateral response of old timber quay wall foundations, two lateral experiments have been carried out on 3x4 pile groups in Overamstel Amsterdam. In one of the experiments, the piles are loaded axially/vertically by means of a top load. The experiment without top load is labelled F1 and the experiment with top load is labelled F2. The pile group was instrumented to measure pile cap deflections, rotations and pile cap loads. For a selection of piles, in-ground strain measurements along the piles were carried out. For test F1 and F2, the instrumentation is identical.
This dataset contains the following:
- Pile forces, deflections and cap rotations data (Excel format)
- Fibre optic measurments and installation details (Text files and Excel format)
Dataset: Pile - Ubuntu Irc
This dataset is a subset of Ubuntu Irc data from The Pile dataset, containing text examples from this specific source.
Dataset Source
This dataset is derived from the uncopyrighted portion of The Pile dataset, specifically accessed through:
monology/pile-uncopyrighted
Citations
If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-ubuntu_irc.
Full text of all questions and answers from Stack Overflow that are tagged with the python tag. Useful for natural language processing and community analysis. See also the dataset of R questions.
This dataset is organized as three tables:
Questions may be deleted by the user who posted them. They can also be closed by community vote, if the question is deemed off-topic for instance. Such questions are not included in this dataset.
The dataset contains questions all questions asked between August 2, 2008 and Ocotober 19, 2016.
All Stack Overflow user contributions are licensed under CC-BY-SA 3.0 with attribution required.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data were sourced by the literature (Wu X.Z., Liu H. Development of environmental contours from site-specific regression parameters of load−settlement curves for piles: the global database. International Journal of Geomechanics, 2023, 23(9):04023148-1-22.) https://doi.org/10.1061/IJGNAI.GMENG-8372 The qpss file can be opened by any text editor, such as notepad or editplus. The tested load-settlement data for each curve were saved in the qpss file at a construction site. The data format of these qpss files are as follows: Q11 s11 Q12 s12 ... ... Q1k s1k
The dataset used for continual pre-training of large language models, with a focus on balancing the text distribution and mitigating overfitting.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Stack Overflow is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on Stack Overflow evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on Stack Overflow evolves, we built SOTorrent, an open dataset based on the official Stack Exchange data dump. SOTorrent provides access to the version history of Stack Overflow content at the level of whole posts and individual text or code blocks.
This dataset has been retrieved from SOTorrent using the following scripts:
https://doi.org/10.5281/zenodo.1201679
For the MSR 2018 paper about SOTorrent, we used the following scripts to analyze the data:
https://doi.org/10.5281/zenodo.1201706
The files sample_before_10.ods and sample_after_10.ods contain our qualitative analysis of 50 comments that were made up to 10 minutes before/after an edit.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Both systems are given the same title as a prompt. CBAG receives metadata. Results truncated for space.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
https://huggingface.co/datasets/bigcode/the-stackhttps://huggingface.co/datasets/bigcode/the-stack
该堆栈包含超过6TB的许可源代码文件,涵盖358种编程语言。该数据集是作为BigCode项目的一部分创建的,BigCode项目是一个开放的科学合作项目,致力于代码的大型语言模型 (Code LLMs) 的负责任开发。堆栈用作代码llm的预训练数据集,即代码生成AI系统,该系统能够从自然语言描述以及其他代码片段合成程序。
Dataset: Pile - Europarl
This dataset is a subset of Europarl data from The Pile dataset, containing text examples from this specific source.
Dataset Source
This dataset is derived from the uncopyrighted portion of The Pile dataset, specifically accessed through:
monology/pile-uncopyrighted
Citations
If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language modeling}… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-europarl.
SEDE is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written by real users of the Stack Exchange Data Explorer out of a natural interaction. These pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset. The goal of this dataset is to take a significant step towards evaluation of Text-to-SQL models in a real-world setting. Compared to other Text-to-SQL datasets, SEDE contains at least 10 times more SQL queries templates (queries after canonization and anonymization of values) than other datasets, and has the most diverse set of utterances and SQL queries (in terms of 3-grams) out of all single-domain datasets. SEDE introduces real-world challenges, such as under-specification, usage of parameters in queries, dates manipulation and more.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A dataset of high quality mathematical text.