9 datasets found

Drive_Stats
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Backblaze, Drive_Stats [Dataset]. https://huggingface.co/datasets/backblaze/Drive_Stats
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Backblazehttp://www.backblaze.com/
Backblaze
Authors
Backblaze
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Drive Stats

Drive Stats is a public data set of daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating a… See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.
h
pubmed25
huggingface.co
Updated Apr 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hà Huy Hoàng (2025). pubmed25 [Dataset]. https://huggingface.co/datasets/HoangHa/pubmed25
Explore at:
Dataset updated
Apr 26, 2025
Authors
Hà Huy Hoàng
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information. This version is modified to extract the full text from structured abstracts.
h
US-PD-Books
huggingface.co
Updated Mar 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Majstorovic (2024). US-PD-Books [Dataset]. https://huggingface.co/datasets/storytracer/US-PD-Books
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 12, 2024
Authors
Sebastian Majstorovic
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
UPDATE: The Internet Archive has requested that this dataset be deleted (see discussion #2) because they consider the IA's metadata too unreliable to determine whether a book is in the public domain. To alleviate the IA's concerns, the full texts of the books have been removed from this dataset until a more reliable way to curate public domain books from the IA collections is established. The metadata and documentation remain for reference purposes. I was able to recreate one subcollection… See the full description on the dataset page: https://huggingface.co/datasets/storytracer/US-PD-Books.
c
JARVIS QMOF
materials.colabfit.org
huggingface.co
Updated Jan 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew S. Rosen; Shaelyn M. Iyer; Debmalya Ray; Zhenpeng Yao; Alán Aspuru-Guzik; Laura Gagliardi; Justin M. Notestein; Randall Q. Snurr (2024). JARVIS QMOF [Dataset]. https://materials.colabfit.org/id/DS_221svb9fxfk7_0
Explore at:
Dataset updated
Jan 21, 2024
Dataset provided by
ColabFit
Authors
Andrew S. Rosen; Shaelyn M. Iyer; Debmalya Ray; Zhenpeng Yao; Alán Aspuru-Guzik; Laura Gagliardi; Justin M. Notestein; Randall Q. Snurr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The JARVIS_QMOF dataset is part of the joint automated repository for various integrated simulations (JARVIS) database. This dataset contains configurations from the Quantum Metal-Organic Frameworks (QMOF) dataset, comprising quantum-chemical properties for >14,000 experimentally synthesized MOFs. QMOF contains "DFT-ready" data: filtered to remove omitted, overlapping, unbonded or deleted atoms, along with other kinds of problematic structures commented on in the literature. Data were generated via high-throughput DFT workflow, at the PBE-D3(BJ) level of theory using VASP software. JARVIS is a set of tools and collected datasets built to meet current materials design challenges.
h
IGC-2022-1
huggingface.co
Updated Feb 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stofnun Árna Magnússonar í íslenskum fræðum (2022). IGC-2022-1 [Dataset]. https://huggingface.co/datasets/arnastofnun/IGC-2022-1
Explore at:
Dataset updated
Feb 1, 2022
Dataset authored and provided by
Stofnun Árna Magnússonar í íslenskum fræðum
Description
🔴IMPORTANT❗🔴 There is now a newer and bigger version of IGC available here: https://huggingface.co/datasets/arnastofnun/IGC-2024. The data has been deleted from this dataset.

THE ICELANDIC GIGAWORD CORPUS - JSONL-FORMAT

This package contains those subcorpora of the Icelandic Gigaword Corpus, version 22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an open licence (CC-BY), in a jsonl format, which is suitable for LLM training.

ABOUT THE… See the full description on the dataset page: https://huggingface.co/datasets/arnastofnun/IGC-2022-1.
h
vi_pubmed
huggingface.co
Updated Mar 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Long Phan (2023). vi_pubmed [Dataset]. https://huggingface.co/datasets/justinphan3110/vi_pubmed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Authors
Long Phan
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for PubMed

Dataset Summary

NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

English

Dataset Structure

Bear… See the full description on the dataset page: https://huggingface.co/datasets/justinphan3110/vi_pubmed.
h
MSR_data_cleaned
huggingface.co
Updated Jan 16, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Taghavi (2016). MSR_data_cleaned [Dataset]. https://huggingface.co/datasets/starsofchance/MSR_data_cleaned
Explore at:
Dataset updated
Jan 16, 2016
Authors
Mohammad Taghavi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MSR Data Cleaned - C/C++ Code Vulnerability Dataset

📌 Dataset Description

A curated collection of C/C++ code vulnerabilities paired with:

CVE details (scores, classifications, exploit status) Code changes (commit messages, added/deleted lines) File-level and function-level diffs

🔍 Sample Data Structure from original file

+---------------+-----------------+----------------------+---------------------------+ | CVE ID | Attack Origin | Publish Date… See the full description on the dataset page: https://huggingface.co/datasets/starsofchance/MSR_data_cleaned.
h
agentlans_multilingual-sentences_paired_10_sts
huggingface.co
Updated Nov 4, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erick Merino (2016). agentlans_multilingual-sentences_paired_10_sts [Dataset]. https://huggingface.co/datasets/erickfmm/agentlans_multilingual-sentences_paired_10_sts
Explore at:
Dataset updated
Nov 4, 2016
Authors
Erick Merino
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Sentences from agentlans/multilingual-sentences in Spanish, and processed with Sentence Similarity Cosine Scores with model hiiamsid/sentence_similarity_spanish_es Each sentence in original dataset was randomly assigned 10 rows (sentences) within a batch of 1000, calculate the sentence similarity, and then deleted duplicate pairs The code for processing can be found here Useful for data distillation, training or benchmarking. Its recommended resampling the dataset to undersample to get a… See the full description on the dataset page: https://huggingface.co/datasets/erickfmm/agentlans_multilingual-sentences_paired_10_sts.
h
ru-paraphrase-NMT-Leipzig-cleaned
huggingface.co
Updated Aug 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fedor Yaronskiy (2022). ru-paraphrase-NMT-Leipzig-cleaned [Dataset]. https://huggingface.co/datasets/fyaronskiy/ru-paraphrase-NMT-Leipzig-cleaned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2022
Authors
Fedor Yaronskiy
Description
Dataset Description

The dataset is obtained by filtering dataset of russian paraphrases by David Dale with automatic metrics. The data structure is saved. Have been deleted:

Paraphrases that have cosine LABSE similarity with source sentences < 0.75. Paraphrases that are more than 2.5 times longer than source sentences. (Most of them are looped errors of back translation) Paraphrases that are similar in spelling to the original texts (paraphrases that have ChrF++ similarity > 0.6… See the full description on the dataset page: https://huggingface.co/datasets/fyaronskiy/ru-paraphrase-NMT-Leipzig-cleaned.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Backblaze, Drive_Stats [Dataset]. https://huggingface.co/datasets/backblaze/Drive_Stats

Drive_Stats

Drive Stats

backblaze/Drive_Stats

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset provided by

Backblazehttp://www.backblaze.com/
Backblaze

Authors

Backblaze

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Drive Stats

Drive Stats is a public data set of daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating a… See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.

Clear search

Close search

Google apps

Main menu

Drive_Stats

pubmed25

US-PD-Books

JARVIS QMOF

IGC-2022-1

vi_pubmed

MSR_data_cleaned

agentlans_multilingual-sentences_paired_10_sts

ru-paraphrase-NMT-Leipzig-cleaned

Drive_Stats

Drive Stats

backblaze/Drive_Stats