5 datasets found

Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Goethe University Frankfurt
Authors
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
r
Data from: JSON Dataset of Simulated Building Heat Control for System of...
researchdata.se
gimi9.com
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Nilsson (2025). JSON Dataset of Simulated Building Heat Control for System of Systems Interoperability [Dataset]. http://doi.org/10.5878/e5hb-ne80
Explore at:
(438755370), (110041420), (156812), (5417)Available download formats
Unique identifier
https://doi.org/10.5878/e5hb-ne80
Dataset updated
Mar 21, 2025
Dataset provided by
Luleå University of Technology
Authors
Jacob Nilsson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Luleå Municipality
Description
Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation.

The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data.

The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.

The data file with temperatures (smhi-july-23-29-2018.csv) acts as input for the thermodynamic building simulation found on Github, where it is used to get the outside temperature and corresponding timestamps. Temperature data for Luleå Summer 2018 were downloaded from SMHI.
Z
Flow map data of the singel pendulum, double pendulum and 3-body problem
data.niaid.nih.gov
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Horn, Philipp; Veronica, Saz Ulibarrena; Koren, Barry; Simon, Portegies Zwart (2024). Flow map data of the singel pendulum, double pendulum and 3-body problem [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11032351
Explore at:
Dataset updated
Apr 23, 2024
Dataset provided by
Leiden Observatory
Eindhoven University of Technology
Authors
Horn, Philipp; Veronica, Saz Ulibarrena; Koren, Barry; Simon, Portegies Zwart
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was constructed to compare the performance of various neural network architectures learning the flow maps of Hamiltonian systems. It was created for the paper: A Generalized Framework of Neural Networks for Hamiltonian Systems.

The dataset consists of trajectory data from three different Hamiltonian systems. Namely, the single pendulum, double pendulum and 3-body problem. The data was generated using numerical integrators. For the single pendulum, the symplectic Euler method with a step size of 0.01 was used. The data of the double pendulum was also computed by the symplectic Euler method, however, with an adaptive step size. The trajectories of the 3-body problem were calculated by the arbitrarily high-precision code Brutus.

For each Hamiltonian system, there is one file containing the entire trajectory information (*_all_runs.h5.1). In these files, the states along all trajectories are recorded with a step size of 0.01. These files are composed of several Pandas DataFrames. One DataFrame per trajectory, called "run0", "run1", ... and finally one large DataFrame in which all the trajectories are combined, called "all_runs". Additionally, one Pandas Series called "constants" is contained in these files, in which several parameters of the data are listed.

Also, there is a second file per Hamiltonian system in which the data is prepared as features and labels ready for neural networks to be trained (*_training.h5.1). Similar to the first type of files, they contain a Series called "constants". The features and labels are then separated into 6 DataFrames called "features", "labels", "val_features", "val_labels", "test_features" and "test_labels". The data is split into 80% training data, 10% validation data and 10% test data.

The code used to train various neural network architectures on this data can be found on GitHub at: https://github.com/AELITTEN/GHNN.

Already trained neural networks can be found on GitHub at: https://github.com/AELITTEN/NeuralNets_GHNN.

Single pendulum Double pendulum 3-body problem

Number of trajectories 500 2000 5000

final time in all_runs T (one period of the pendulum) 10 10

final time in training data 0.25*T 5 5

step size in training data 0.1 0.1 0.5
Llama 3.1 8B Correct Labels
kaggle.com
zip
Updated Aug 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jatin Mehra_666 (2025). Llama 3.1 8B Correct Labels [Dataset]. https://www.kaggle.com/datasets/jatinmehra666/llama-3-1-8b-correct-labels
Explore at:
zip(11853454078 bytes)Available download formats
Dataset updated
Aug 26, 2025
Authors
Jatin Mehra_666
Description
training Code ```Python

from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')

Fill missing Misconception values with 'NA'

train.Misconception = train.Misconception.fillna('NA')

Create a combined target label (Category:Misconception)

train['target'] = train.Category + ":" + train.Misconception

Encode target labels to numerical format

le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()

idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers

Merge 'is_correct' flag into the main training DataFrame

train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)

from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"

model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)

tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)

def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )

train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )

from datasets import Dataset

Split data into training and validation sets

train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

Convert to Hugging Face Dataset

COLS = ['text', 'label']

Create clean DataFrame with the full training data

train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'

Ensure labels are proper integers

train_df_clean['label'] = train_df_clean['label'].astype(np.int64)

Reset index to ensure clean DataFrame structure

train_df_clean = train_df_clean.reset_index(drop=True)

Create dataset with the full training data

train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)

def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)

Apply tokenization to the full dataset

train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])

Add a new padding token

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Resize the model's token embeddings to match the new tokenizer

model.resize_token_embeddings(len(tokenizer))

Set the pad token id in the model's config

model.config.pad_token_id = tokenizer.pad_token_id

2. Clear HF cache after loading

import os from huggingface_hub import scan_cache_dir

Then clear cache to free ~16GB

cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()

--- Training Arguments ---

from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil

Ensure temp directories exist

os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)

--- Training Arguments (FIXED) ---

training_args = TrainingArguments( output_dir=f"{TEMP_DIR}/training_output/",
do_train=True, do_eval=False, save_strategy="no", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=5e-5, logging_dir=f"{TEMP_DIR}/logs/",
logging_steps=500, bf16=True, fp16=False, report_to="none", warmup_ratio=0.1, lr_scheduler_type="cosine", dataloader_pin_memory=False, gradient_checkpointing=True,
)

--- Custom Metric Computation (MAP@3) ---

def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()

# Get top 3 predicted class indi...
V2 Balloon Detection Dataset
kaggle.com
zip
Updated Jul 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vbookshelf (2022). V2 Balloon Detection Dataset [Dataset]. https://www.kaggle.com/vbookshelf/v2-balloon-detection-dataset
Explore at:
zip(49788043 bytes)Available download formats
Dataset updated
Jul 7, 2022
Authors
vbookshelf
Description
Context

I needed a simple image dataset that I could use when trying different object detection algorithms for the first time. It had to be something that could be quickly understood and easily loaded. I didn't want spend a lot of time doing EDA or trying to remember how the data is structured. Moreover, I wanted to be able to clearly see when a model 's prediction was correct or when it had made a mistake. When working with chest x-ray images, for example, it takes an expert to know if a model's predictions are correct.

I found the Balloons dataset and simplified it. The original data is split into train and test sets and it has two json files that need to be parsed. In this new version, I copied all images into a single folder and replaced the json files with one csv file that can be easily loaded with Pandas.

Content

The dataset consists of 74 jpg images and one csv file. Each image contains one or more balloons.

The csv file has five columns:

fname - The image file name. height - The image height. width - The image width. num_balloons - The number of balloons on the image. bbox - The coordinates of each bounding box on the image.

The coordinates of each bbox are stored in a dictionary. The format is as follows:

{"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300} Where xmin and ymin are the coordinates of the top left corner, and xmax and ymax are the coordinates of the bottom right corner.

Each entry in the bbox column is a list of dictionaries. For example, if an image has two ballons and hence two bounding boxes, the entry will be as follows:

[{"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}, {"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}]

When loaded into a Pandas dataframe all items in the bbox column are of type string. The strings can be converted to a python lists like this:

import ast # convert each item in the bbox column from type str to type list df['bbox'] = df['bbox'].apply(ast.literal_eval)

Acknowledgements

Many thanks to Waleed Abdulla who created this dataset.

The original dataset can be downloaded and unzipped using this code:

!wget https://github.com/matterport/Mask_RCNN/releases/download/v2.1/balloon_dataset.zip !unzip balloon_dataset.zip > /dev/null

Inspiration

Can you create an app that can look at an image and tell you: - how many balloons are on the image, and - what are the colours of those balloons.

This is something that could help blind people. To help you get started here's an example of a similar project .

License

In this blog post the dataset's creator mentions that the images were sourced from Flickr. All images have a "Commercial use & mods allowed" license.

Header image by andremsantana on Pixabay.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784

Multimodal Vision-Audio-Language Dataset

Explore at:

Dataset updated

Jul 11, 2024

Dataset provided by

Goethe University Frankfurt

Authors

Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Clear search

Close search

Google apps

Main menu

Multimodal Vision-Audio-Language Dataset

Data from: JSON Dataset of Simulated Building Heat Control for System of...

Flow map data of the singel pendulum, double pendulum and 3-body problem

Llama 3.1 8B Correct Labels

Fill missing Misconception values with 'NA'

Create a combined target label (Category:Misconception)

Encode target labels to numerical format

Merge 'is_correct' flag into the main training DataFrame

Split data into training and validation sets

train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

Convert to Hugging Face Dataset

Create clean DataFrame with the full training data

Ensure labels are proper integers

Reset index to ensure clean DataFrame structure

Create dataset with the full training data

Apply tokenization to the full dataset

Add a new padding token

Resize the model's token embeddings to match the new tokenizer

Set the pad token id in the model's config

2. Clear HF cache after loading

Then clear cache to free ~16GB

--- Training Arguments ---

Ensure temp directories exist

--- Training Arguments (FIXED) ---

--- Custom Metric Computation (MAP@3) ---

V2 Balloon Detection Dataset

Context

Content

Acknowledgements

Inspiration

License

Multimodal Vision-Audio-Language Dataset