9 datasets found
  1. Drive_Stats

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Backblaze, Drive_Stats [Dataset]. https://huggingface.co/datasets/backblaze/Drive_Stats
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Backblazehttp://www.backblaze.com/
    Backblaze
    Authors
    Backblaze
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Drive Stats

    Drive Stats is a public data set of daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating a… See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.

  2. h

    pubmed25

    • huggingface.co
    Updated Apr 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hà Huy Hoàng (2025). pubmed25 [Dataset]. https://huggingface.co/datasets/HoangHa/pubmed25
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Hà Huy Hoàng
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information. This version is modified to extract the full text from structured abstracts.

  3. h

    US-PD-Books

    • huggingface.co
    Updated Mar 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Majstorovic (2024). US-PD-Books [Dataset]. https://huggingface.co/datasets/storytracer/US-PD-Books
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2024
    Authors
    Sebastian Majstorovic
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    UPDATE: The Internet Archive has requested that this dataset be deleted (see discussion #2) because they consider the IA's metadata too unreliable to determine whether a book is in the public domain. To alleviate the IA's concerns, the full texts of the books have been removed from this dataset until a more reliable way to curate public domain books from the IA collections is established. The metadata and documentation remain for reference purposes. I was able to recreate one subcollection… See the full description on the dataset page: https://huggingface.co/datasets/storytracer/US-PD-Books.

  4. c

    JARVIS QMOF

    • materials.colabfit.org
    • huggingface.co
    Updated Jan 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew S. Rosen; Shaelyn M. Iyer; Debmalya Ray; Zhenpeng Yao; Alán Aspuru-Guzik; Laura Gagliardi; Justin M. Notestein; Randall Q. Snurr (2024). JARVIS QMOF [Dataset]. https://materials.colabfit.org/id/DS_221svb9fxfk7_0
    Explore at:
    Dataset updated
    Jan 21, 2024
    Dataset provided by
    ColabFit
    Authors
    Andrew S. Rosen; Shaelyn M. Iyer; Debmalya Ray; Zhenpeng Yao; Alán Aspuru-Guzik; Laura Gagliardi; Justin M. Notestein; Randall Q. Snurr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The JARVIS_QMOF dataset is part of the joint automated repository for various integrated simulations (JARVIS) database. This dataset contains configurations from the Quantum Metal-Organic Frameworks (QMOF) dataset, comprising quantum-chemical properties for >14,000 experimentally synthesized MOFs. QMOF contains "DFT-ready" data: filtered to remove omitted, overlapping, unbonded or deleted atoms, along with other kinds of problematic structures commented on in the literature. Data were generated via high-throughput DFT workflow, at the PBE-D3(BJ) level of theory using VASP software. JARVIS is a set of tools and collected datasets built to meet current materials design challenges.

  5. h

    IGC-2022-1

    • huggingface.co
    Updated Feb 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stofnun Árna Magnússonar í íslenskum fræðum (2022). IGC-2022-1 [Dataset]. https://huggingface.co/datasets/arnastofnun/IGC-2022-1
    Explore at:
    Dataset updated
    Feb 1, 2022
    Dataset authored and provided by
    Stofnun Árna Magnússonar í íslenskum fræðum
    Description

    🔴IMPORTANT❗🔴 There is now a newer and bigger version of IGC available here: https://huggingface.co/datasets/arnastofnun/IGC-2024. The data has been deleted from this dataset.

      THE ICELANDIC GIGAWORD CORPUS - JSONL-FORMAT
    

    This package contains those subcorpora of the Icelandic Gigaword Corpus, version 22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an open licence (CC-BY), in a jsonl format, which is suitable for LLM training.

      ABOUT THE… See the full description on the dataset page: https://huggingface.co/datasets/arnastofnun/IGC-2022-1.
    
  6. h

    vi_pubmed

    • huggingface.co
    Updated Mar 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Long Phan (2023). vi_pubmed [Dataset]. https://huggingface.co/datasets/justinphan3110/vi_pubmed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Authors
    Long Phan
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for PubMed

      Dataset Summary
    

    NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    English

      Dataset Structure
    

    Bear… See the full description on the dataset page: https://huggingface.co/datasets/justinphan3110/vi_pubmed.

  7. h

    MSR_data_cleaned

    • huggingface.co
    Updated Jan 16, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Taghavi (2016). MSR_data_cleaned [Dataset]. https://huggingface.co/datasets/starsofchance/MSR_data_cleaned
    Explore at:
    Dataset updated
    Jan 16, 2016
    Authors
    Mohammad Taghavi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MSR Data Cleaned - C/C++ Code Vulnerability Dataset

      📌 Dataset Description
    

    A curated collection of C/C++ code vulnerabilities paired with:

    CVE details (scores, classifications, exploit status) Code changes (commit messages, added/deleted lines) File-level and function-level diffs

      🔍 Sample Data Structure from original file
    

    +---------------+-----------------+----------------------+---------------------------+ | CVE ID | Attack Origin | Publish Date… See the full description on the dataset page: https://huggingface.co/datasets/starsofchance/MSR_data_cleaned.

  8. h

    agentlans_multilingual-sentences_paired_10_sts

    • huggingface.co
    Updated Nov 4, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erick Merino (2016). agentlans_multilingual-sentences_paired_10_sts [Dataset]. https://huggingface.co/datasets/erickfmm/agentlans_multilingual-sentences_paired_10_sts
    Explore at:
    Dataset updated
    Nov 4, 2016
    Authors
    Erick Merino
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Sentences from agentlans/multilingual-sentences in Spanish, and processed with Sentence Similarity Cosine Scores with model hiiamsid/sentence_similarity_spanish_es Each sentence in original dataset was randomly assigned 10 rows (sentences) within a batch of 1000, calculate the sentence similarity, and then deleted duplicate pairs The code for processing can be found here Useful for data distillation, training or benchmarking. Its recommended resampling the dataset to undersample to get a… See the full description on the dataset page: https://huggingface.co/datasets/erickfmm/agentlans_multilingual-sentences_paired_10_sts.

  9. h

    ru-paraphrase-NMT-Leipzig-cleaned

    • huggingface.co
    Updated Aug 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fedor Yaronskiy (2022). ru-paraphrase-NMT-Leipzig-cleaned [Dataset]. https://huggingface.co/datasets/fyaronskiy/ru-paraphrase-NMT-Leipzig-cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2022
    Authors
    Fedor Yaronskiy
    Description

    Dataset Description

    The dataset is obtained by filtering dataset of russian paraphrases by David Dale with automatic metrics. The data structure is saved. Have been deleted:

    Paraphrases that have cosine LABSE similarity with source sentences < 0.75. Paraphrases that are more than 2.5 times longer than source sentences. (Most of them are looped errors of back translation) Paraphrases that are similar in spelling to the original texts (paraphrases that have ChrF++ similarity > 0.6… See the full description on the dataset page: https://huggingface.co/datasets/fyaronskiy/ru-paraphrase-NMT-Leipzig-cleaned.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Backblaze, Drive_Stats [Dataset]. https://huggingface.co/datasets/backblaze/Drive_Stats
Organization logoOrganization logo

Drive_Stats

Drive Stats

backblaze/Drive_Stats

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Backblazehttp://www.backblaze.com/
Backblaze
Authors
Backblaze
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Drive Stats

Drive Stats is a public data set of daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating a… See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.

Search
Clear search
Close search
Google apps
Main menu