Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Drive Stats
Drive Stats is a public data set of daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating a… See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information. This version is modified to extract the full text from structured abstracts.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
UPDATE: The Internet Archive has requested that this dataset be deleted (see discussion #2) because they consider the IA's metadata too unreliable to determine whether a book is in the public domain. To alleviate the IA's concerns, the full texts of the books have been removed from this dataset until a more reliable way to curate public domain books from the IA collections is established. The metadata and documentation remain for reference purposes. I was able to recreate one subcollection… See the full description on the dataset page: https://huggingface.co/datasets/storytracer/US-PD-Books.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The JARVIS_QMOF dataset is part of the joint automated repository for various integrated simulations (JARVIS) database. This dataset contains configurations from the Quantum Metal-Organic Frameworks (QMOF) dataset, comprising quantum-chemical properties for >14,000 experimentally synthesized MOFs. QMOF contains "DFT-ready" data: filtered to remove omitted, overlapping, unbonded or deleted atoms, along with other kinds of problematic structures commented on in the literature. Data were generated via high-throughput DFT workflow, at the PBE-D3(BJ) level of theory using VASP software. JARVIS is a set of tools and collected datasets built to meet current materials design challenges.
Facebook
Twitter🔴IMPORTANT❗🔴 There is now a newer and bigger version of IGC available here: https://huggingface.co/datasets/arnastofnun/IGC-2024. The data has been deleted from this dataset.
THE ICELANDIC GIGAWORD CORPUS - JSONL-FORMAT
This package contains those subcorpora of the Icelandic Gigaword Corpus, version 22.10 (http://hdl.handle.net/20.500.12537/253), that have been published with an open licence (CC-BY), in a jsonl format, which is suitable for LLM training.
ABOUT THE… See the full description on the dataset page: https://huggingface.co/datasets/arnastofnun/IGC-2022-1.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for PubMed
Dataset Summary
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
English
Dataset Structure
Bear… See the full description on the dataset page: https://huggingface.co/datasets/justinphan3110/vi_pubmed.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MSR Data Cleaned - C/C++ Code Vulnerability Dataset
📌 Dataset Description
A curated collection of C/C++ code vulnerabilities paired with:
CVE details (scores, classifications, exploit status) Code changes (commit messages, added/deleted lines) File-level and function-level diffs
🔍 Sample Data Structure from original file
+---------------+-----------------+----------------------+---------------------------+ | CVE ID | Attack Origin | Publish Date… See the full description on the dataset page: https://huggingface.co/datasets/starsofchance/MSR_data_cleaned.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Sentences from agentlans/multilingual-sentences in Spanish, and processed with Sentence Similarity Cosine Scores with model hiiamsid/sentence_similarity_spanish_es Each sentence in original dataset was randomly assigned 10 rows (sentences) within a batch of 1000, calculate the sentence similarity, and then deleted duplicate pairs The code for processing can be found here Useful for data distillation, training or benchmarking. Its recommended resampling the dataset to undersample to get a… See the full description on the dataset page: https://huggingface.co/datasets/erickfmm/agentlans_multilingual-sentences_paired_10_sts.
Facebook
TwitterDataset Description
The dataset is obtained by filtering dataset of russian paraphrases by David Dale with automatic metrics. The data structure is saved. Have been deleted:
Paraphrases that have cosine LABSE similarity with source sentences < 0.75. Paraphrases that are more than 2.5 times longer than source sentences. (Most of them are looped errors of back translation) Paraphrases that are similar in spelling to the original texts (paraphrases that have ChrF++ similarity > 0.6… See the full description on the dataset page: https://huggingface.co/datasets/fyaronskiy/ru-paraphrase-NMT-Leipzig-cleaned.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Drive Stats
Drive Stats is a public data set of daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating a… See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.