Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Downloading the Options IV SP500 Dataset
This document will guide you through the steps to download the Options IV SP500 dataset from Hugging Face Datasets. This dataset includes data on the options of the S&P 500, including implied volatility. To start, you'll need to install Hugging Face's datasets library if you haven't done so already. You can do this using the following pip command: !pip install datasets
Here's the Python code to load the Options IV SP500 dataset from Hugging… See the full description on the dataset page: https://huggingface.co/datasets/gauss314/options-IV-SP500.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Microbiome Immunity Project: Protein Universe
~200,000 predicted structures for diverse protein sequences from 1,003 representative genomes across the microbial tree of life and annotate them functionally on a per-residue basis.
Quickstart Usage
Install HuggingFace Datasets package
Each subset can be loaded into python using the Huggingface datasets library. First, from the command line install the datasets library $ pip install datasets
Optionally set the… See the full description on the dataset page: https://huggingface.co/datasets/RosettaCommons/MIP.
Facebook
TwitterQuickstart Usage
This dataset can be loaded into python using the Huggingface datasets library. First, install the datasets library via command line: $ pip install datasets
With datasets installed, the user should then import it into their python script / environment:
import datasets
The user can then load the CF-MS_Homo_sapiens_PPI dataset using datasets.load_dataset(...). There are two configurations, or 'views' for the set. The user can choose between them via the name… See the full description on the dataset page: https://huggingface.co/datasets/viridono/CF-MS_Homo_sapiens_PPI.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Persian Question Answering (PersianQA) Dataset is a reading comprehension dataset on Persian Wikipedia. The crowd-sourced dataset consists of more than 9,000 entries. Each entry can be either an impossible-to-answer or a question with one or more answers spanning in the passage (the context) from which the questioner proposed the question. Much like the SQuAD2.0 dataset, the impossible or unanswerable questions can be utilized to create a system which "knows that it doesn't know the answer".
Moreover, the dataset has 900 test data available. On top of that, the very first models trained on the dataset, Transformers, are available online.
All the crowdworkers of the dataset are native Persian speakers. Also, it worth mentioning that the contexts are collected from all categories of the Wiki (Historical, Religious, Geography, Science, etc).
At the moment, each context has 7 pairs of questions with one answer and 3 impossible questions.
You can find the dataset under the dataset directory and use it like below:
import read_qa # is avalible at src/read_ds.py
train_ds = read_qa('pqa_train.json')
test_ds = read_qa('pqa_test.json')
Alternatively, you can also access the data through the HuggingFace🤗 datasets library. For that, you need to install datasets using this command in your terminal:
pip install -q datasets
Afterwards, import persian_qa dataset using load_dataset:
from datasets import load_dataset
dataset = load_dataset("SajjadAyoubi/persian_qa")
| Split | # of instances | # of unanswerables | avg. question length | avg. paragraph length | avg. answer length |
|---|---|---|---|---|---|
| Train | 9,000 | 2,700 | 8.39 | 224.58 | 9.61 |
| Test | 938 | 280 | 8.02 | 220.18 | 5.99 |
The lengths are on the token level.
To learn more about the data and more examples take a look here.
Currently, two models (baseline) on HuggingFace🤗 model hub are using the dataset. The models are listed in the table below.
As of yet, we didn't publish any papers on the work.
However, if you did, please cite us properly with an entry like the one below.
bibtex
@misc{PersianQA,
author = {Ayoubi, Sajjad \& Davoodeh, Mohammad Yasin},
title = {PersianQA: a dataset for Persian Question Answering},
year = 2021,
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/SajjjadAyobi/PersianQA}},
}
Facebook
Twitter🛠️ Requirements and Installation
git clone https://github.com/Yofuria/ICE.git cd ICE
conda create -n ICE python=3.10 conda activate ICE
pip install -r requirements.txt
In lines 32 and 33 of examples/run_knowedit_llama2.py, you need to download the punkt package.
If your Internet speed is fast enough, you can run the code directly from the command line.
if name == "_main_": # If you have a slow Internet connection and… See the full description on the dataset page: https://huggingface.co/datasets/kailinjiang/punkt.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attentive Skin
To Predict Skin Corrosion/Irritation Potentials of Chemicals via Explainable Machine Learning Methods Download: https://github.com/BeeBeeWong/AttentiveSkin/releases/tag/v1.0
Quickstart Usage
Load a dataset in python
Each subset can be loaded into python using the Huggingface datasets library. First, from the command line install the datasets library $ pip install datasets
then, from within python load the datasets library
import datasets… See the full description on the dataset page: https://huggingface.co/datasets/maomlab/AttentiveSkin.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the complete code, model and datasets for the article ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation
In case you cannot access the article this preprint is available: ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships.
Portela, J.R., Pérez-Terán, N., Manrique, R. (2026). ESNLIR: Expanding Spanish NLI Benchmarks with Multi-genre and Causal Annotation. In: Florez, H., Peluffo-Ordoñez, D. (eds) Applied Informatics. ICAI 2025. Communications in Computer and Information Science, vol 2667. Springer, Cham. https://doi.org/10.1007/978-3-032-07175-0_23
If you still want to use the Zenodo repository, follow the steps below. But once again, it is way easier to work with the links above.
----------------------------------------------------------------------------------------------
This repository is a poetry project, which means that it can be installed easily by executing the following command from a shell in the repository folder:
poetry install
As this repository is script based, the README.md file contains all the commands executed to generate the dataset and train models.
----------------------------------------------------------------------------------------------
The core code used for all the experiments is in the folder auto-nli and all the calls to the core code with the parameters requested are found in README.md
----------------------------------------------------------------------------------------------
All the parameters to create datasets and train models with the core code are found in the folder parameters.
----------------------------------------------------------------------------------------------
For BERT based models, all in pytorch, there are two types of models from huggingfaces that were used for training and also are required to load a dataset because of the tokenizer:
The model folder contains all the trained models for the paper. There are three types of models:
Models with the suffix _annot are models trained with the premise (first sentence) only. Apart from the pytorch model folder, each model result folder (ex: ) contains the test results for the test set and the stress test sets (ex: )
Models are found in the folder model and all of them are pytorch models which can be loaded with the huggingface interface:
from transformers import AutoModel model = AutoModel.from_pretrained('
----------------------------------------------------------------------------------------------
This file is included outside the ZIP containing all other files, and it contains the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.
The datasets can be found in the folder data that is divided in the following folders:
The splits to train, validate and test the models.
Splits of train-val-test extracted for each corpora. They are used to generate base_dataset.
Pairs of sentences found in each corpus. They are used to generate splits_data.
This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:
Example:
{"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews_spanish_pd_news_531537","dataset":"esnews_spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}
To load a dataset/split as a pytorch object used to train-validate-test models you must use the custom class dataset
from auto_nli.model.bert_based.dataset import BERTDataset
dataset = BERTDataset(os.path.join(dataset_folder, max_len=model_type=only_premise=max_samples=----------------------------------------------------------------------------------------------
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for VLM4Bio
Instructions for downloading the dataset
Install Git LFS Git clone the VLM4Bio repository to download all metadata and associated files Run the following commands in a terminal:
git clone https://huggingface.co/datasets/imageomics/VLM4Bio cd VLM4Bio
Downloading and processing bird images
To download the bird images, run the following command:
bash download_bird_images.sh
This should download the bird images inside datasets/Bird/images… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/VLM4Bio.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.
The model has underwent a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures. When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3 Mini-4K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.
Resources and Technical Documentation:
Primary use cases
The model is intended for commercial and research use in English. The model provides uses for applications which require:
1) Memory/compute constrained environments 2) Latency bound scenarios 3) Strong reasoning (especially code, math and logic)
Our model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.
Use case considerations
Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
Phi-3 Mini-4K-Instruct has been integrated in the development version (4.41.0.dev0) of transformers. Until the official version is released through pip, ensure that you are doing one of the following:
When loading the model, ensure that trust_remote_code=True is passed as an argument of the from_pretrained() function.
Update your local transformers to the development version: pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers. The previous command is an alternative to cloning and installing from the source.
The current transformers version can be verified with: pip list | grep transformers.
Phi-3 Mini-4K-Instruct is also available in HuggingChat.
Phi-3 Mini-4K-Instruct supports a vocabulary size of up to 32064 tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.
Given the nature of the training data, the Phi-3 Mini-4K-Instruct model is best suited for prompts using the chat format as follows.
You can provide the prompt as a question with a generic template as follow:
markdown
<|user|>
Question <|end|>
<|assistant|>
For example:
markdown
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>
where the model generates the text after <|assistant|> . In case of few-shots prompt, the prompt can be formatted as the following:
<|user|>
I am going to Paris, what should I see?<|end|>
<|assistant|>
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:
1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Simeonov2008
The Simeonov2008 dataset contains 7,152 compounds in the train set, with high-throughput screening (HTS) results recorded in the "Activity Outcome" column.
Quickstart Usage
Load a dataset in python
Each subset can be loaded into python using the Huggingface datasets library. First, from the command line install the datasets library $ pip install datasets
then, from within python load the datasets library
import datasets
and load the… See the full description on the dataset page: https://huggingface.co/datasets/haneulpark/Simeonov2008.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
How to use
make sure your environment is set up
pip install datasets
Run the following command to download&load your data
from datasets import load_dataset dataset = load_dataset("aidenpan/s_clips-v1.0-safe")
Print it out
print(dataset["val"]["identifier"])
['137720:2', '221257:7', '159943:2', '124745:14', '179035:9'... ]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multimodal Pragmatic Jailbreak on Text-to-image Models
Project page | Paper | Code The Multimodal Pragmatic Unsafe Prompts (MPUP) is a dataset designed to assess the multimodal pragmatic safety in Text-to-Image (T2I) models. It comprises two key sections: image_prompt, and text_prompt.
Dataset Usage
Downloading the Data
To download the dataset, install Huggingface Datasets and then use the following command: from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/tongliuphysics/multimodalpragmatic.
Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
How to download
Set up environment
pip install datasets tqdm wget https://raw.githubusercontent.com/bytedance/coconut_cvpr2024/main/download_coconut.py
Use the download script to download the COCONut dataset splits.
python download_coconut.py # default split: relabeled_coco_val
The above command should print your download status, if you download it successfully you can see the results below:
Download other COCONut dataset splits.
If you want to download the other splits… See the full description on the dataset page: https://huggingface.co/datasets/xdeng77/relabeled_coco_val.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Preparing OpenMLPerf dataset
To process the semi-raw MLPerf data into the OpenMLPerf dataset, run the following command:
bzip2 -d semi-raw-mlperf-data.tar.bz2 tar xvf semi-raw-mlperf-data.tar
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python process.py
The processed dataset will be saved both as… See the full description on the dataset page: https://huggingface.co/datasets/gfursin/OpenMLPerf.
Facebook
TwitterVisit Site!
If you want to run your Termux/Terminal very smoothly and run without any problem, will this command is for you.. #
Install in one click!
sh install.sh / bash install.sh
# or follow the manual process! - Follow the installation process!!
Termux Installation Commands
PKG Command PIP Command termux-change-repo pip install requests
pkg update pip2 install requests
pkg upgrade pip3 install requests
pkg install python pip install mechanize… See the full description on the dataset page: https://huggingface.co/datasets/poisk-ls/jade-cmd.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SUPPORT ME ON PATREON
https://www.patreon.com/c/Rombodawg
Prerequisites:
Python: https://www.python.org/downloads/ Git: https://git-scm.com/downloads
Instructions:
Make sure python and git are installed Open a command prompt terminal on your local folder In command prompt run
git lfs install
then git clone https://huggingface.co/datasets/Rombo-Org/Easy_Galore_8bit_training_With_Native_Windows_Support
then cd Easy_Galore_8bit_training_With_Native_Windows_Support
Now… See the full description on the dataset page: https://huggingface.co/datasets/Rombo-Org/Easy_Galore_8bit_training_With_Native_Windows_Support.
Facebook
TwitterViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models
📷 This is the code dataset for the paper: ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models. ACM MM 2024.
Preparation steps: environment installation
(1) Environment installation command:
python pip install -r requirements.txt
(2) Please fill in the API information in the file:… See the full description on the dataset page: https://huggingface.co/datasets/BRZ911/ViTCoT.
Facebook
TwitterDataset
Download Data
UNPC_EN_ZH You may download the EN.txt and ZH.txt manually. or using git command git lfs install git clone https://huggingface.co/datasets/ZkiZkiZki/UNPC_EN_ZH
Make sure the dataset path is correct data/UNPC_EN_ZH/EN.txt data/UNPC_EN_ZH/ZH.txt
Reference
Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.
Facebook
TwittermPLUG/DocStruct4M reformated for VSFT with TRL's SFT Trainer.Referenced the format of HuggingFaceH4/llava-instruct-mix-vsft The dataset uses image paths instead of embedding actual images.To access the images, you'll need to download them from the original mPLUG/DocStruct4M dataset. To download the original images, use the following commands: pip install -U "huggingface_hub[cli]" huggingface-cli download mPLUG/DocStruct4M --repo-type dataset
As specified in the official repo, extract the… See the full description on the dataset page: https://huggingface.co/datasets/Ryoo72/DocStruct4M_ip.
Facebook
TwitterNatural Reasoning Embeddings
This is a dataset containing the embeddings for the Natural Reasoning dataset, using the same Embedding Model as the original paper. The code that created these embeddings is below
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Downloading the Options IV SP500 Dataset
This document will guide you through the steps to download the Options IV SP500 dataset from Hugging Face Datasets. This dataset includes data on the options of the S&P 500, including implied volatility. To start, you'll need to install Hugging Face's datasets library if you haven't done so already. You can do this using the following pip command: !pip install datasets
Here's the Python code to load the Options IV SP500 dataset from Hugging… See the full description on the dataset page: https://huggingface.co/datasets/gauss314/options-IV-SP500.