20 datasets found
  1. h

    proof-pile

    • huggingface.co
    Updated Dec 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoskinson Center for Formal Mathematics (2022). proof-pile [Dataset]. https://huggingface.co/datasets/hoskinson-center/proof-pile
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2022
    Dataset authored and provided by
    Hoskinson Center for Formal Mathematics
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A dataset of high quality mathematical text.

  2. h

    Pile-NER-type-conll

    • huggingface.co
    Updated Sep 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pile-NER-type-conll [Dataset]. https://huggingface.co/datasets/milistu/Pile-NER-type-conll
    Explore at:
    Dataset updated
    Sep 28, 2024
    Authors
    Milutin Studen
    Description

    Pile-NER Dataset in CoNLL Format

      Overview
    

    The Pile-NER-type dataset provides named entity recognition annotations derived from The Pile, a large-scale text corpus. This dataset is formatted in CoNLL style for easy use with token classification models.

      Structure
    

    Format: CoNLL Split: Train only (45,889 examples) Features: id: Unique identifier for each example words: Sequence of tokens ner_tags: Named entity tags for each token labels: Label annotations for each… See the full description on the dataset page: https://huggingface.co/datasets/milistu/Pile-NER-type-conll.

  3. h

    pile-philpapers

    • huggingface.co
    Updated Nov 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2024). pile-philpapers [Dataset]. https://huggingface.co/datasets/timaeus/pile-philpapers
    Explore at:
    Dataset updated
    Nov 17, 2024
    Dataset authored and provided by
    Timaeus
    Description

    Dataset: Pile - Philpapers

    This dataset is a subset of Philpapers data from The Pile dataset, containing text examples from this specific source.

      Dataset Source
    

    This dataset is derived from the uncopyrighted portion of The Pile dataset, specifically accessed through:

    monology/pile-uncopyrighted

      Citations
    

    If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-philpapers.

  4. Proof-Pile-2

    • opendatalab.com
    zip
    Updated Oct 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Princeton University (2023). Proof-Pile-2 [Dataset]. https://opendatalab.com/OpenDataLab/Proof-Pile-2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 1, 2023
    Dataset provided by
    EleutherAIhttps://eleuther.ai/
    University of Toronto
    Princeton University
    Description

    Proof-Pile II, a 55 billion token dataset of mathematical and scientific documents.

  5. Z

    Stack Overflow Chat Dutch

    • data.niaid.nih.gov
    • huggingface.co
    • +1more
    Updated Jun 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379
    Explore at:
    Dataset updated
    Jun 20, 2023
    Dataset authored and provided by
    Vanroy, Bram
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

    Data Instances

    {'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

    Data Fields

    id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

    input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

    topic: the topic description

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

    Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

    Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

    """

    The prompt to translate the topic is:

    TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
    " summarizing manner. Translate accurately and formally. Do not provide any explanation"
    " about the translation and do not include the original title.

    "

    The system message was:

    You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

    Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.

  6. h

    TheArabicPile_Dialects

    • huggingface.co
    Updated May 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Premio.AI (2024). TheArabicPile_Dialects [Dataset]. https://huggingface.co/datasets/premio-ai/TheArabicPile_Dialects
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2024
    Dataset provided by
    Premio.AI
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The Arabic Pile

      Introduction:
    

    The Arabic Pile is a comprehensive dataset meticulously designed to parallel the structure of The Pile and The Nordic Pile. Focused on the Arabic language, the dataset encompasses a vast array of linguistic nuances, incorporating both Modern Standard Arabic (MSA) and various Levantine, North African, and Egyptian dialects. Tailored for the training and fine-tuning of large language models, the dataset consists of 13 subsets, each… See the full description on the dataset page: https://huggingface.co/datasets/premio-ai/TheArabicPile_Dialects.

  7. Europe Steel Piles Market Growth, Industry & Revenue [2030]

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2024). Europe Steel Piles Market Growth, Industry & Revenue [2030] [Dataset]. https://growthmarketreports.com/report/steel-piles-market-europe-industry-analysis
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Europe steel piles Market Outlook



    The Europe steel piles market was valued at USD 6,918.1 Million in 2019 and is projected to reach USD 7,826.1 Million by 2027, expanding at a CAGR of 2.4% during the forecast period. The number of construction and developments activities is increasing at a rapid pace globally owing to the increase in urbanization and globalization. This, in turn, is fueling the requirement for rigid foundation solutions for superstructures, thus driving the demand for steel piles.



    Steel piles play a vital role in several construction industries due their ability to provide a rigid structure. Several types of steel piles that are used for applications such as ports & harbors, urban civil engineering, and bridges. These piles can be used for piers, seawalls, cofferdams, groynes, and other groundworks. Additionally, rise in trade among the countries in Europe is boosting for various infrastructure developments. This, in turn, is projected to fuel the demand for steel piles.





    Market Dynamics



    Rising demand for steel pipe piles in various applications such as urban civil engineering and bridge engineering is driving the Europe steel piles market. Steel piles offer a large bearing capacity and can penetrate through stiff layers or boulders. Additionally, steel piles are capable of withstanding rough handling. Moreover, the volume of soil displaced during the driving of steel piles is less.



    Soil displacement, high initial costs, and additional requirements such as the prevention of deterioration associated with steel piles are major factors hampering the market. Advancements in technology such as the use of advanced fiber optic sensors in pile instrumentation is a major factor creating opportunities in the market.



    Steel pile consumption increased due to growing population, increasing development of infrastructure, and the substitution of many metals with steel piles in various end-uses, such as railway and bridges. This is expected to drive the market growth in during the forecast period.



    Segmental Outlook



    On the basis of product types, the Europe steel piles market is segregated into steel sheet piles, st

  8. C

    Four-point-bending experiments on historic quay wall foundation piles; Data...

    • data.4tu.nl
    • 4tu.edu.hpc.n-helix.com
    zip
    Updated Jul 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mart-Jan Hemel; Jelle van Ophuizen; Dirk Jan Peters; Willem van Bommel; Eric Valckenier; Andreas den Dekker (2023). Four-point-bending experiments on historic quay wall foundation piles; Data underlying the dissertation: "Amsterdam quays under pressure - Modelling and testing of historic canal walls". [Dataset]. http://doi.org/10.4121/86c74fa5-1bcf-49f0-8396-c14f9a45eeef.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 19, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Mart-Jan Hemel; Jelle van Ophuizen; Dirk Jan Peters; Willem van Bommel; Eric Valckenier; Andreas den Dekker
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    2022
    Area covered
    Description

    The dataset has been compiled as part of the dissertation of M.J. Hemel (2023) entitled "Amsterdam quays under pressure - Modelling and testing of historic canal walls".

    The mechanical properties of existing spruce foundation piles, retrieved from a historic quay wall at Overamstel in Amsterdam, were studied. To better understand the material properties of historic foundation piles in bending, a series of 6 four-point bending experiments were conducted. An adjustable bending test setup was used in which timber piles were bent by hoist and support slings. The forces were adjusted such that the outer fibre stress between the two forces was consistent, leading to a failure at the weakest cross-section. Measurements of the deflections along the piles and force distribution were taken along the pile length during loading. In addition, biological decay in the outer layer of the timber piles, also referred to as ‘soft shell’, was identified with IML micro-drillings. For two piles (pile 5 and 6) Internal strains were measured by gluing fibre optical wires inside the soft shell of the timber piles.

    This dataset contains the following:

    - Bending deflections and forces (Excel format)

    - Fibre optic measurments and installation details (Text files and Excel format)

    - IML Micro drilling measurements and IML guide(Text Files and PDF)

  9. Z

    Dataset - What are the Machine Learning best practices reported by...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mario Linares-Vásquez (2023). Dataset - What are the Machine Learning best practices reported by practitioners on Stack Exchange? [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7908721
    Explore at:
    Dataset updated
    Jun 25, 2023
    Dataset provided by
    Mario Linares-Vásquez
    Steffen Herbold
    Andrea Bayona
    Fabio A. González
    Anamaria Mojica-Hanke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data correspond to the posts (questions and answers) retrieved by querying for posts related to the tag 'machine learning' and the phrase 'best practice(s).' The data were used as the basis for a study currently under review on discussing machine learning best practices as discussed by practitioners in question-and-answer communities such as Stack Exchange. The information from each type of post (i.e., questions and answers) is presented in multiple formats (i.e., .txt, .csv, and .xlsx).

    Answers - Variables

    AID: Unique identification of the answer in the Q&A website.

    ParentId: Unique identification of the question associated with the answer in the Q&A website

    AcceptedAnswerId : In the case in which an answer is the most voted question associated with the ParentId, and it is different from the accepted answer, a different identifier from the AID is available. In the case in which the accepted question had a score lower than 1, a -1 is assigned.

    ABody: HTML text of the answer.

    Score: Upvotes - downvotes of the answer.

    url_Answer: URL of the answer. The question URL can be from different websites.

    type: best or accepted. Accepted in the case that the information belongs to the accepted answer of the ParentId question and best in the case in which it is the most voted question of the ParentId question.

    Date: Creation date of the answer.

    Questions - Variables

    QID: Unique identification of the question in the Q&A website.

    AcceptedAnswerId: Unique identification of the accepted answer for a specific question in the Q&A website. In the case in which a question had a most-voted answer different from the accepted one, and the accepted one had a negative score, a -1 was assigned to the AcceptedAnswerId.

    BestAnswerId: Unique identification of the most voted answer for a specific question in the Q&A website. In the case in which the most voted and accepted questions were the same, then a -1 was assigned to the BestAnswerId.

    Qtitle: Title of the question.

    QBody: HTML text of the question.

    Score: Upvotes - downvotes of the questions.

    QTags: Tags that are associated with each question.

    url_question: URL of the question. The question URL can be from different websites.

    Date: Creation date of the question

    This dataset is a subset of the Stack Exchange dump of 03.2021 (https://archive.org/details/stackexchange_20210301) in which a series of filters were applied to obtain the data used in the study.

  10. C

    Lateral pile group experiments on historic quay wall foundation piles; Data...

    • 4tu.edu.hpc.n-helix.com
    zip
    Updated Jul 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mart-Jan Hemel; Jelle van Ophuizen; Eric Valckenier; Dirk Jan Peters; Mandy Korff; Willem van Bommel; Erik Hutcheson (2023). Lateral pile group experiments on historic quay wall foundation piles; Data underlying the dissertation: "Amsterdam quays under pressure - Modelling and testing of historic canal walls". [Dataset]. http://doi.org/10.4121/2f3d53fd-d4aa-4cad-9757-626cecb785a6.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 19, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Mart-Jan Hemel; Jelle van Ophuizen; Eric Valckenier; Dirk Jan Peters; Mandy Korff; Willem van Bommel; Erik Hutcheson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    2022
    Area covered
    Description

    To test the lateral response of old timber quay wall foundations, two lateral experiments have been carried out on 3x4 pile groups in Overamstel Amsterdam. In one of the experiments, the piles are loaded axially/vertically by means of a top load. The experiment without top load is labelled F1 and the experiment with top load is labelled F2. The pile group was instrumented to measure pile cap deflections, rotations and pile cap loads. For a selection of piles, in-ground strain measurements along the piles were carried out. For test F1 and F2, the instrumentation is identical.


    This dataset contains the following:


    - Pile forces, deflections and cap rotations data (Excel format)

    - Fibre optic measurments and installation details (Text files and Excel format)


  11. h

    pile-ubuntu_irc

    • huggingface.co
    Updated Nov 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2024). pile-ubuntu_irc [Dataset]. https://huggingface.co/datasets/timaeus/pile-ubuntu_irc
    Explore at:
    Dataset updated
    Nov 17, 2024
    Dataset authored and provided by
    Timaeus
    Description

    Dataset: Pile - Ubuntu Irc

    This dataset is a subset of Ubuntu Irc data from The Pile dataset, containing text examples from this specific source.

      Dataset Source
    

    This dataset is derived from the uncopyrighted portion of The Pile dataset, specifically accessed through:

    monology/pile-uncopyrighted

      Citations
    

    If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-ubuntu_irc.

  12. Python Questions from Stack Overflow

    • kaggle.com
    zip
    Updated Oct 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stack Overflow (2016). Python Questions from Stack Overflow [Dataset]. https://www.kaggle.com/datasets/stackoverflow/pythonquestions/versions/1
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Oct 20, 2016
    Dataset authored and provided by
    Stack Overflowhttp://stackoverflow.com/
    Description

    Context

    Full text of all questions and answers from Stack Overflow that are tagged with the python tag. Useful for natural language processing and community analysis. See also the dataset of R questions.

    Content

    This dataset is organized as three tables:

    • Questions contains the title, body, creation date, score, and owner ID for each Python question.
    • Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.
    • Tags contains the tags on each question besides the Python tag.

    Questions may be deleted by the user who posted them. They can also be closed by community vote, if the question is deemed off-topic for instance. Such questions are not included in this dataset.

    The dataset contains questions all questions asked between August 2, 2008 and Ocotober 19, 2016.

    License

    All Stack Overflow user contributions are licensed under CC-BY-SA 3.0 with attribution required.

  13. Wu and Liu 2023 9368 PileBetaG sourced qpss data

    • figshare.com
    txt
    Updated May 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xing zheng wu (2024). Wu and Liu 2023 9368 PileBetaG sourced qpss data [Dataset]. http://doi.org/10.6084/m9.figshare.25855843.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 19, 2024
    Dataset provided by
    figshare
    Authors
    xing zheng wu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These data were sourced by the literature (Wu X.Z., Liu H. Development of environmental contours from site-specific regression parameters of load−settlement curves for piles: the global database. International Journal of Geomechanics, 2023, 23(9):04023148-1-22.) https://doi.org/10.1061/IJGNAI.GMENG-8372 The qpss file can be opened by any text editor, such as notepad or editplus. The tested load-settlement data for each curve were saved in the qpss file at a construction site. The data format of these qpss files are as follows: Q11 s11 Q12 s12 ... ... Q1k s1k

  14. t

    Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024)....

    • service.tib.eu
    Updated Dec 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Yunfan Shao, Linyang Li, Zhaoye Fei, Hang Yan, Dahua Lin, Xipeng Qiu (2024). Dataset: Proof-Pile-2. https://doi.org/10.57702/7iiqekx3 [Dataset]. https://service.tib.eu/ldmservice/dataset/proof-pile-2
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The dataset used for continual pre-training of large language models, with a focus on balancing the text distribution and mitigating overfitting.

  15. Z

    SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baltes, Sebastian (2020). SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts — Supplementary Material [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1201553
    Explore at:
    Dataset updated
    Jan 21, 2020
    Dataset authored and provided by
    Baltes, Sebastian
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Stack Overflow is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on Stack Overflow evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on Stack Overflow evolves, we built SOTorrent, an open dataset based on the official Stack Exchange data dump. SOTorrent provides access to the version history of Stack Overflow content at the level of whole posts and individual text or code blocks.

    This dataset has been retrieved from SOTorrent using the following scripts:

    https://doi.org/10.5281/zenodo.1201679

    For the MSR 2018 paper about SOTorrent, we used the following scripts to analyze the data:

    https://doi.org/10.5281/zenodo.1201706

    The files sample_before_10.ods and sample_after_10.ods contain our qualitative analysis of 50 comments that were made up to 10 minutes before/after an edit.

  16. CBAG (left) compared to GPT-2 huge unmodified with 1.5B parameters (right).

    • plos.figshare.com
    xls
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justin Sybrandt; Ilya Safro (2023). CBAG (left) compared to GPT-2 huge unmodified with 1.5B parameters (right). [Dataset]. http://doi.org/10.1371/journal.pone.0253905.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Justin Sybrandt; Ilya Safro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Both systems are given the same title as a prompt. CBAG receives metadata. Results truncated for space.

  17. h

    pile-of-law

    • huggingface.co
    • opendatalab.com
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
    Explore at:
    Dataset updated
    Jul 10, 2022
    Dataset authored and provided by
    Pile of Law
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

  18. The-Stack

    • opendatalab.com
    zip
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face (2022). The-Stack [Dataset]. https://opendatalab.com/OpenDataLab/the-stack
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 1, 2022
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    ServiceNow Research
    License

    https://huggingface.co/datasets/bigcode/the-stackhttps://huggingface.co/datasets/bigcode/the-stack

    Description

    该堆栈包含超过6TB的许可源代码文件,涵盖358种编程语言。该数据集是作为BigCode项目的一部分创建的,BigCode项目是一个开放的科学合作项目,致力于代码的大型语言模型 (Code LLMs) 的负责任开发。堆栈用作代码llm的预训练数据集,即代码生成AI系统,该系统能够从自然语言描述以及其他代码片段合成程序。

  19. h

    pile-europarl

    • huggingface.co
    Updated Nov 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timaeus (2024). pile-europarl [Dataset]. https://huggingface.co/datasets/timaeus/pile-europarl
    Explore at:
    Dataset updated
    Nov 17, 2024
    Dataset authored and provided by
    Timaeus
    Description

    Dataset: Pile - Europarl

    This dataset is a subset of Europarl data from The Pile dataset, containing text examples from this specific source.

      Dataset Source
    

    This dataset is derived from the uncopyrighted portion of The Pile dataset, specifically accessed through:

    monology/pile-uncopyrighted

      Citations
    

    If you use this dataset, please cite the original Pile papers: @article{gao2020pile, title={The Pile: An 800GB dataset of diverse text for language modeling}… See the full description on the dataset page: https://huggingface.co/datasets/timaeus/pile-europarl.

  20. P

    SEDE Dataset

    • paperswithcode.com
    Updated Aug 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moshe Hazoom; Vibhor Malik; Ben Bogin (2024). SEDE Dataset [Dataset]. https://paperswithcode.com/dataset/sede
    Explore at:
    Dataset updated
    Aug 8, 2024
    Authors
    Moshe Hazoom; Vibhor Malik; Ben Bogin
    Description

    SEDE is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written by real users of the Stack Exchange Data Explorer out of a natural interaction. These pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset. The goal of this dataset is to take a significant step towards evaluation of Text-to-SQL models in a real-world setting. Compared to other Text-to-SQL datasets, SEDE contains at least 10 times more SQL queries templates (queries after canonization and anonymization of values) than other datasets, and has the most diverse set of utterances and SQL queries (in terms of 3-grams) out of all single-domain datasets. SEDE introduces real-world challenges, such as under-specification, usage of parameters in queries, dates manipulation and more.

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hoskinson Center for Formal Mathematics (2022). proof-pile [Dataset]. https://huggingface.co/datasets/hoskinson-center/proof-pile

proof-pile

proof-pile

hoskinson-center/proof-pile

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 25, 2022
Dataset authored and provided by
Hoskinson Center for Formal Mathematics
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

A dataset of high quality mathematical text.

Search
Clear search
Close search
Google apps
Main menu