16 datasets found
  1. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Goethe University Frankfurt
    Authors
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  2. Llama 3.1 8B Correct Labels

    • kaggle.com
    zip
    Updated Aug 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jatin Mehra_666 (2025). Llama 3.1 8B Correct Labels [Dataset]. https://www.kaggle.com/datasets/jatinmehra666/llama-3-1-8b-correct-labels
    Explore at:
    zip(11853454078 bytes)Available download formats
    Dataset updated
    Aug 26, 2025
    Authors
    Jatin Mehra_666
    Description

    training Code ```Python

    from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split import os import pandas as pd import numpy as np os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" TEMP_DIR = "tmp" os.makedirs(TEMP_DIR, exist_ok=True) train = pd.read_csv('input/map-charting-student-math-misunderstandings/train.csv')

    Fill missing Misconception values with 'NA'

    train.Misconception = train.Misconception.fillna('NA')

    Create a combined target label (Category:Misconception)

    train['target'] = train.Category + ":" + train.Misconception

    Encode target labels to numerical format

    le = LabelEncoder() train['label'] = le.fit_transform(train['target']) n_classes = len(le.classes_) # Number of unique target classes print(f"Train shape: {train.shape} with {n_classes} target classes") print("Train head:") train.head()

    idx = train.apply(lambda row: row.Category.split('_')[0], axis=1) == 'True' correct = train.loc[idx].copy() correct['c'] = correct.groupby(['QuestionId', 'MC_Answer']).MC_Answer.transform('count') correct = correct.sort_values('c', ascending=False) correct = correct.drop_duplicates(['QuestionId']) correct = correct[['QuestionId', 'MC_Answer']] correct['is_correct'] = 1 # Mark these as correct answers

    Merge 'is_correct' flag into the main training DataFrame

    train = train.merge(correct, on=['QuestionId', 'MC_Answer'], how='left') train.is_correct = train.is_correct.fillna(0)

    from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

    Model_Name = "unsloth/Meta-Llama-3.1-8B-Instruct"

    model = AutoModelForSequenceClassification.from_pretrained(Model_Name, num_labels=n_classes, torch_dtype=torch.bfloat16, device_map="balanced", cache_dir=TEMP_DIR)

    tokenizer = AutoTokenizer.from_pretrained(Model_Name, cache_dir=TEMP_DIR)

    def format_input(row): x = "Yes" if not row['is_correct']: x = "No" return ( f"Question: {row['QuestionText']} " f"Answer: {row['MC_Answer']} " f"Correct? {x} " f"Student Explanation: {row['StudentExplanation']}" )

    train['text'] = train.apply(format_input,axis=1) print("Example prompt for our LLM:") print() print( train.text.values[0] )

    from datasets import Dataset

    Split data into training and validation sets

    train_df, val_df = train_test_split(train, test_size=0.2, random_state=42)

    Convert to Hugging Face Dataset

    COLS = ['text', 'label']

    Create clean DataFrame with the full training data

    train_df_clean = train[COLS].copy() # Use 'train' instead of 'train_df'

    Ensure labels are proper integers

    train_df_clean['label'] = train_df_clean['label'].astype(np.int64)

    Reset index to ensure clean DataFrame structure

    train_df_clean = train_df_clean.reset_index(drop=True)

    Create dataset with the full training data

    train_ds = Dataset.from_pandas(train_df_clean, preserve_index=False)

    def tokenize(batch): """Tokenizes a batch of text inputs.""" return tokenizer(batch["text"], truncation=True, max_length=256)

    Apply tokenization to the full dataset

    train_ds = train_ds.map(tokenize, batched=True, remove_columns=['text'])

    Add a new padding token

    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

    Resize the model's token embeddings to match the new tokenizer

    model.resize_token_embeddings(len(tokenizer))

    Set the pad token id in the model's config

    model.config.pad_token_id = tokenizer.pad_token_id

    2. Clear HF cache after loading

    import os from huggingface_hub import scan_cache_dir

    Then clear cache to free ~16GB

    cache_info = scan_cache_dir() cache_info.delete_revisions(*[repo.revisions for repo in cache_info.repos]).execute()

    --- Training Arguments ---

    from transformers import TrainingArguments, Trainer, DataCollatorWithPadding import tempfile import shutil

    Ensure temp directories exist

    os.makedirs(f"{TEMP_DIR}/training_output/", exist_ok=True) os.makedirs(f"{TEMP_DIR}/logs/", exist_ok=True)

    --- Training Arguments (FIXED) ---

    training_args = TrainingArguments( output_dir=f"{TEMP_DIR}/training_output/",
    do_train=True, do_eval=False, save_strategy="no", num_train_epochs=3, per_device_train_batch_size=16, learning_rate=5e-5, logging_dir=f"{TEMP_DIR}/logs/",
    logging_steps=500, bf16=True, fp16=False, report_to="none", warmup_ratio=0.1, lr_scheduler_type="cosine", dataloader_pin_memory=False, gradient_checkpointing=True,
    )

    --- Custom Metric Computation (MAP@3) ---

    def compute_map3(eval_pred): """ Computes Mean Average Precision at 3 (MAP@3) for evaluation. """ logits, labels = eval_pred probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()

    # Get top 3 predicted class indi...
    
  3. Modified Swiss Dwellings

    • kaggle.com
    zip
    Updated Nov 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Casper van Engelenburg (2023). Modified Swiss Dwellings [Dataset]. https://www.kaggle.com/datasets/caspervanengelenburg/modified-swiss-dwellings
    Explore at:
    zip(4996692802 bytes)Available download formats
    Dataset updated
    Nov 7, 2023
    Authors
    Casper van Engelenburg
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Modified Swiss Dwellings

    The Modified Swiss Dwellings (MSD) dataset is an ML-ready dataset for floor plan generation and analysis at building-level scale. The MSD dataset is completely derived from the Swiss Dwellings database (v3.0.0). The MSD dataset contains highly-detailed 5372 floor plans of single- as well as multi-unit building complexes across Switzerland, hence extending the building scale w.r.t. of other well know floor plan datasets like the RPLAN dataset.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15635478%2F9d43d7618fca2d6ebd7f99ee3009fb5f%2Foutput.png?generation=1688033406322972&alt=media" alt="">

    Naming and dataset split

    The naming (IDs in the folders) is based on the original dataset.

    The dataset is split into train and test based on the buildings the floor plans originate from. There is, for obvious reasons, no overlap between building identities in train and test set. Hence, all floor plans that originate from the same building will be either all in the train set or all in the test set. We included as well a cleaned, filtered, and modified Pandas dataframe with all geometries (such as rooms, walls, etc.) derived from the original dataset. The unique floor plan IDs in the dataframe is the same as train and test set combined. We included it to allow users to develop their own algorithms on top of it, such as image, structure, and graph extraction.

    Example use-case: Floor plan auto-completion

    The MSD dataset is developed with the goal for the computer science community to develop (deep learning) models for tasks such as floor plan auto-completion. The floor plan auto-completion task takes as input the boundary of a building, the structural elements necessary for the building’s structural integrity, and a set of user constraints formalized in a graph structure, with the goal of automatically generating the full floor plan. Specifically, the goal is to learn the correlation between the the joint distribution of graph_in and struct_in with that of full_out. graph_out is provided when researchers want to use / develop methods from graph signal processing, or graph machine learning specifically. This was a challenge which was part of the 1st Workshop on Computer-Aided Architectural Design (CVAAD) - an official half-day workshop at ICCV 2023.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15635478%2F4079b2c34fc70ba5e223fa831bd14ded%2FPicture1.png?generation=1688033493295295&alt=media" alt="">

    Important links

  4. Z

    Flow map data of the singel pendulum, double pendulum and 3-body problem

    • data.niaid.nih.gov
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Horn, Philipp; Veronica, Saz Ulibarrena; Koren, Barry; Simon, Portegies Zwart (2024). Flow map data of the singel pendulum, double pendulum and 3-body problem [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11032351
    Explore at:
    Dataset updated
    Apr 23, 2024
    Dataset provided by
    Leiden Observatory
    Eindhoven University of Technology
    Authors
    Horn, Philipp; Veronica, Saz Ulibarrena; Koren, Barry; Simon, Portegies Zwart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was constructed to compare the performance of various neural network architectures learning the flow maps of Hamiltonian systems. It was created for the paper: A Generalized Framework of Neural Networks for Hamiltonian Systems.

    The dataset consists of trajectory data from three different Hamiltonian systems. Namely, the single pendulum, double pendulum and 3-body problem. The data was generated using numerical integrators. For the single pendulum, the symplectic Euler method with a step size of 0.01 was used. The data of the double pendulum was also computed by the symplectic Euler method, however, with an adaptive step size. The trajectories of the 3-body problem were calculated by the arbitrarily high-precision code Brutus.

    For each Hamiltonian system, there is one file containing the entire trajectory information (*_all_runs.h5.1). In these files, the states along all trajectories are recorded with a step size of 0.01. These files are composed of several Pandas DataFrames. One DataFrame per trajectory, called "run0", "run1", ... and finally one large DataFrame in which all the trajectories are combined, called "all_runs". Additionally, one Pandas Series called "constants" is contained in these files, in which several parameters of the data are listed.

    Also, there is a second file per Hamiltonian system in which the data is prepared as features and labels ready for neural networks to be trained (*_training.h5.1). Similar to the first type of files, they contain a Series called "constants". The features and labels are then separated into 6 DataFrames called "features", "labels", "val_features", "val_labels", "test_features" and "test_labels". The data is split into 80% training data, 10% validation data and 10% test data.

    The code used to train various neural network architectures on this data can be found on GitHub at: https://github.com/AELITTEN/GHNN.

    Already trained neural networks can be found on GitHub at: https://github.com/AELITTEN/NeuralNets_GHNN.

    Single pendulum Double pendulum 3-body problem

    Number of trajectories 500 2000 5000

    final time in all_runs T (one period of the pendulum) 10 10

    final time in training data 0.25*T 5 5

    step size in training data 0.1 0.1 0.5

  5. Convert Text to Pandas

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
    Explore at:
    zip(4333134 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    Zeyad Usf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    kaggle notebook
    Github Repo

    I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

    • Rahima411/text-to-pandas:

      • The data is divided into Train with 57.5k and Test with 19.2k.

      • The data has two columns as you can see in the example:

        • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
        • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
    • hiltch/pandas-create-context:

      • It contains 17k rows with three columns:
        • question : text .
        • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
        • answer : Pandas code.
          question           |            context             |       answer 
    ----------------------------------------|--------------------------------------------------------|---------------------------------------
    What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   
    

    As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

    Args:
      text (str): The input text containing table definitions and questions.
    
    Returns:
      tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
      # Extract column names
      columns = re.findall(column_pattern, columns_str)
      column_names = [col[0] for col in columns]
    
      # Format DataFrame creation statement
      df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
      df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '
    

    '.join(df_creations)

    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question
    
    After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
    > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
    > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
    > - `Question` : It is ...
    
  6. g

    Data from: JSON Dataset of Simulated Building Heat Control for System of...

    • gimi9.com
    • researchdata.se
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JSON Dataset of Simulated Building Heat Control for System of Systems Interoperability [Dataset]. https://gimi9.com/dataset/eu_https-doi-org-10-5878-1tv7-9x76/
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.

  7. Raw data from datasets used in SIMON analysis

    • data.europa.eu
    unknown
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). Raw data from datasets used in SIMON analysis [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-2580414?locale=hr
    Explore at:
    unknown(312591)Available download formats
    Dataset updated
    Jan 27, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here you can find raw data and information about each of the 34 datasets generated by the mulset algorithm and used for further analysis in SIMON. Each dataset is stored in separate folder which contains 4 files: json_info: This file contains, number of features with their names and number of subjects that are available for the same dataset data_testing: data frame with data used to test trained model data_training: data frame with data used to train models results: direct unfiltered data from database Files are written in feather format. Here is an example of data structure for each file in repository. File was compressed using 7-Zip available at https://www.7-zip.org/.

  8. Train set metadata for DFDC

    • kaggle.com
    zip
    Updated Jan 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nosound (2020). Train set metadata for DFDC [Dataset]. https://www.kaggle.com/zaharch/train-set-metadata-for-dfdc
    Explore at:
    zip(11544720 bytes)Available download formats
    Dataset updated
    Jan 4, 2020
    Authors
    nosound
    Description

    The train data for the DFDC competition is big, almost 500Gb, so I hope it can be useful to have all the json files and the metadata in one dataframe.

    The dataset includes, for each video file

    1. Info from the json files: filename, folder, label, original
    2. split: train (118346 videos), public validation test (400 videos) or train sample (400 videos). 119146 videos in total. Note that the public validation and the train sample are subsets of the full train, so it is enough to mark them in this dataframe.
    3. Full file md5 column
    4. Hash on audio file sequence wav.hash and on subset of pixels pxl.hash
    5. The rest are metadata fields from the files, obtained with ffprobe. Note that I removed many columns, which didn't give new information.

    Simple analysis of the dataset can be found at: https://www.kaggle.com/zaharch/looking-at-the-full-train-set-metadata

  9. MementoML

    • kaggle.com
    zip
    Updated Oct 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MI2 DataLab (2020). MementoML [Dataset]. https://www.kaggle.com/mi2datalab/mementoml
    Explore at:
    zip(68958934 bytes)Available download formats
    Dataset updated
    Oct 1, 2020
    Authors
    MI2 DataLab
    Description

    Details can be found at: https://arxiv.org/abs/2008.13162

    Dataset

    This dataset contains ACC and AUC scores of 7 popular machine learning algorithms. Each algorithm was ran 20 times on each datasets, once per each train/test split. There were used 39 datasets. Each score is reproducible.

    Train/test splits were drawn from bootstrap sampling.

    Benchmarks

    Resulting dataset is a dataframe with 7 columns. First ”dataset” column denotes OpenML id of the dataset i.e.1486, next ”row_index” is the train/test split identifier i.e. 12 from splits file, third ”model” is a model name i.e.gbm or kknn. Fourth is a ”param_index” denoting hyperparameter set identifier from parameters file. These identifiers starts from 1001 (1001 denotes 1). Fifth is ”time”, that is a learning time measured inms. The last two columns are acc and auc measures.

    Hyperparameters

    There is also a dataframe for each model with its hyperparameters. First column is a ”paramindex”.Rest of the columns correspond to hyperparameters related to this model and used in calculations.

    Train/test splits

    For each used dataset there is a separate file with a train/test splits. Each of its rows indicatesrow indices in a single test subset in the mentioned dataset.

    Algorithms

    Here you can find score of 7 algorithms: * catboost * gbm * glmnet * kknn * randomforest * ranger * xgboost

    Hyperparameters

    For each machine learning model were used fixed number of the most commonly used hyperparameters.

    Datasets

    We used some of the OpenML100 datasets.

  10. V2 Balloon Detection Dataset

    • kaggle.com
    zip
    Updated Jul 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vbookshelf (2022). V2 Balloon Detection Dataset [Dataset]. https://www.kaggle.com/vbookshelf/v2-balloon-detection-dataset
    Explore at:
    zip(49788043 bytes)Available download formats
    Dataset updated
    Jul 7, 2022
    Authors
    vbookshelf
    Description

    Context

    I needed a simple image dataset that I could use when trying different object detection algorithms for the first time. It had to be something that could be quickly understood and easily loaded. I didn't want spend a lot of time doing EDA or trying to remember how the data is structured. Moreover, I wanted to be able to clearly see when a model 's prediction was correct or when it had made a mistake. When working with chest x-ray images, for example, it takes an expert to know if a model's predictions are correct.

    I found the Balloons dataset and simplified it. The original data is split into train and test sets and it has two json files that need to be parsed. In this new version, I copied all images into a single folder and replaced the json files with one csv file that can be easily loaded with Pandas.

    Content

    The dataset consists of 74 jpg images and one csv file. Each image contains one or more balloons.

    The csv file has five columns:

    fname - The image file name.
    height - The image height.
    width - The image width.
    num_balloons - The number of balloons on the image.
    bbox - The coordinates of each bounding box on the image.
    

    The coordinates of each bbox are stored in a dictionary. The format is as follows:

    {"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}
    
    Where xmin and ymin are the coordinates of the top left corner, and xmax and ymax are the coordinates of the bottom right corner.
    

    Each entry in the bbox column is a list of dictionaries. For example, if an image has two ballons and hence two bounding boxes, the entry will be as follows:

    [{"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}, {"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}]

    When loaded into a Pandas dataframe all items in the bbox column are of type string. The strings can be converted to a python lists like this:

    import ast
    
    # convert each item in the bbox column from type str to type list
    df['bbox'] = df['bbox'].apply(ast.literal_eval)
    
    

    Acknowledgements

    Many thanks to Waleed Abdulla who created this dataset.

    The original dataset can be downloaded and unzipped using this code:

    !wget https://github.com/matterport/Mask_RCNN/releases/download/v2.1/balloon_dataset.zip
    !unzip balloon_dataset.zip > /dev/null
    

    Inspiration

    Can you create an app that can look at an image and tell you: - how many balloons are on the image, and - what are the colours of those balloons.

    This is something that could help blind people. To help you get started here's an example of a similar project .

    License

    In this blog post the dataset's creator mentions that the images were sourced from Flickr. All images have a "Commercial use & mods allowed" license.



    Header image by andremsantana on Pixabay.

  11. AIMO External Dataset

    • kaggle.com
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    moth (2024). AIMO External Dataset [Dataset]. https://www.kaggle.com/datasets/alejopaullier/aimo-external-dataset/discussion
    Explore at:
    zip(4481662 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    moth
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description

    This dataset is a compiled version of two benchmark math dataframes for solving math problems using LLMs, namely: - MATH: "MATH is a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations." - GSM8K: "a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning."

    The dataset consists of 21k math problems with its corresponding solutions.

    Columns

    • problem: text with the mathematical problem statement.
    • level: level of difficulty (GSM8K does not provide this column).
    • type: math field (GSM8K does not provide this column).
    • solution: text with the mathematical problem solution.
    • stage: either "train" or "test". This corresponds to the original dataframe split.
    • source: either "MATH" or "GSM8K". Source of the problem.
  12. Arabic(Indian) digits MADBase

    • kaggle.com
    zip
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HOSSAM_AHMED_SALAH (2023). Arabic(Indian) digits MADBase [Dataset]. https://www.kaggle.com/datasets/hossamahmedsalah/arabicindian-digits-madbase/code
    Explore at:
    zip(15373598 bytes)Available download formats
    Dataset updated
    Jul 26, 2023
    Authors
    HOSSAM_AHMED_SALAH
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Area covered
    India
    Description

    This dataset is flattern images where each image is represented in a row - Objective: Establish benchmark results for Arabic digit recognition using different classification techniques. - Objective: Compare performances of different classification techniques on Arabic and Latin digit recognition problems. - Valid comparison requires Arabic and Latin digit databases to be in the same format. - A Modified version of the ADBase (MADBase) with the same size and format as MNIST is created. - MADBase is derived from ADBase by size-normalizing each digit to a 20x20 box while preserving aspect ratio. - Size-normalization procedure results in gray levels due to anti-aliasing filter. - MADBase and MNIST have the same size and format. - MNIST is a modified version of the NIST digits database. - MNIST is available for download. I used this code to turn 70k arabic digit into a tabular data for ease of use and to waste less time in the preprocessing ```

    Define the root directory of the dataset

    root_dir = "MAHD"

    Define the names of the folders containing the images

    folder_names = ['Part{:02d}'.format(i) for i in range(1, 13)]

    folder_names = ['Part{}'.format(i) if i>9 else 'Part0{}'.format(i) for i in range(1, 13)]

    Define the names of the subfolders containing the training and testing images

    train_test_folders = ['MAHDBase_TrainingSet', 'test']

    Initialize an empty list to store the image data and labels

    data = [] labels = []

    Loop over the training and testing subfolders in each Part folder

    for tt in train_test_folders: for folder_name in folder_names: if tt == train_test_folders[1] and folder_name == 'Part03': break subfolder_path = os.path.join(root_dir, tt, folder_name) print(subfolder_path) print(os.listdir(subfolder_path)) for filename in os.listdir(subfolder_path): # check of the file fromat that it's an image if os.path.splitext(filename)[1].lower() not in '.bmp': continue # Load the image img_path = os.path.join(subfolder_path, filename) img = Image.open(img_path)

        # Convert the image to grayscale and flatten it into a 1D array
        img_grey = img.convert('L')
        img_data = np.array(img_grey).flatten()
    
        # Extract the label from the filename and convert it to an integer
        label = int(filename.split('_')[2].replace('digit', '').split('.')[0])
    
        # Add the image data and label to the lists
        data.append(img_data)
        labels.append(label)
    

    Convert the image data and labels to a pandas dataframe

    df = pd.DataFrame(data) df['label'] = labels ``` This dataset made by https://datacenter.aucegypt.edu/shazeem with 2 datasets - ADBase - MADBase (✅ the one this dataset derived from , similar in form to mnist)

  13. TPS-October-2022-data-feather

    • kaggle.com
    zip
    Updated Oct 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Rörup (2022). TPS-October-2022-data-feather [Dataset]. https://www.kaggle.com/datasets/timrrup/tpsoctober2022datafeather
    Explore at:
    zip(4409338221 bytes)Available download formats
    Dataset updated
    Oct 6, 2022
    Authors
    Tim Rörup
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    It is the identical dataset of the Tabular Playground Series - October 2022 but compressed to feather files instead of csv to enable faster processing.

    Can be parsed by pandas.read_feather(path)

    Description

    The following description is copied from the competition:

    The dataset consists of sequences of snapshots of the state of a Rocket League match, including position and velocity of all players and the ball, as well as extra information. The goal of the competition is to predict -- from a given snapshot in the game -- for each team, the probability that they will score within the next 10 seconds of game time.

    The data was taken from professional Rocket League matches. Each event consists of a chronological series of frames recorded at 10 frames per second. All events begin with a kickoff, and most end in one team scoring a goal, but some are truncated and end with no goal scored due to circumstances which can cause gameplay strategies to shift, for example 1) nearing end of regulation (where the game continues until the ball touches the ground) or 2) becoming non-competitive, eg one team winning by 3+ goals with little time remaining.

    Files:

    • train_[0-9].csv: Train set split into 10 files. Rows are sorted by game_num, event_id, and event_time, and each event is entirely contained in one file.
    • test.csv: Test set. Unlike the train set, the rows are scrambled.
    • [train|test]_dtypes.csv: pandas dtypes for the columns in the train / test set, which can be pulled and passed to pd.read_csv() on the full set to read it with correct types since by default, pd.read_csv() will use 64-bit types which will waste memory. See below for example code.
    • sample_submission.csv: A sample submission in the correct format.

    Columns:

    • game_num (train only): Unique identifier for the game from which the event was taken.
    • event_id (train only): Unique identifier for the sequence of consecutive frames.
    • event_time (train only): Time in seconds before the event ended, either by a goal being scored or simply when we decided to truncate the timeseries if a goal was not scored.
    • ball_pos_[xyz]: Ball's position as a 3d vector.
    • ball_vel_[xyz]: Ball's velocity as a 3d vector.
    • For i in [0, 6): - p{i}_pos_[xyz]: Player i's position as a 3d vector. - p{i}_vel_[xyz]: Player i's velocity as a 3d vector. - p{i}_boost: Player i's boost remaining, in [0, 100]. A player can consume boost to substantially increase their speed, and is required to fly up into the z dimension (besides driving up a wall, or the small air gained by a jump). - All p{i} columns will be NaN if and only if the player is demolished (destroyed by an enemy player; will respawn within a few seconds). - Players 0, 1, and 2 make up team A and players 3, 4, and 5 make up team B. - The orientation vector of the player's car (which way the car is facing) does not necessarily match the player's velocity vector, and this dataset does not capture orientation data.
    • For i in [0, 6): - boost{i}_timer: Time in seconds until big boost orb i respawns, or 0 if it's available. Big boost orbs grant a full 100 boost to a player driving over it. The orb (x, y) locations are roughly [ (-61.4, -81.9), (61.4, -81.9), (-71.7, 0), (71.7, 0), (-61.4, 81.9), (61.4, 81.9) ] with z = 0. (Players can also gain boost from small boost pads across the map, but we do not capture those pads in this dataset).

    • player_scoring_next (train only): Which player scores at the end of the current event, in [0, 6), or -1 if the event does not end in a goal.

    • team_scoring_next (train only): Which team scores at the end of the current event (A or B), or NaN if the event does not end in a goal.

    • team_[A|B]_scoring_within_10sec (train only): [Target columns] Value of 1 if team_scoring_next == [A|B] and time_before_event is in [-10, 0], otherwise 0.

    • id (test and submission only): Unique identifier for each test row. Your submission should be a pair of team_A_scoring_within_10sec and team_B_scoring_within_10sec probability predictions for each id, where your predictions can range the real numbers from [0,

  14. Antibody Developability Benchmark

    • kaggle.com
    zip
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taylor (2025). Antibody Developability Benchmark [Dataset]. https://www.kaggle.com/datasets/tywangty/antibody-developability-benchmark
    Explore at:
    zip(38868 bytes)Available download formats
    Dataset updated
    Oct 1, 2025
    Authors
    Taylor
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    https://huggingface.co/spaces/ginkgo-datapoints/abdev-leaderboard

    Antibodies have to be manufacturable, stable in high concentrations, and have low off-target effects. Properties such as these can often hinder the progression of an antibody to the clinic, and are collectively referred to as 'developability'. Here we invite the community to submit and develop better predictors, which will be tested out on a heldout private set to assess model generalization.

    🧬 Developability properties in this competition 💧 Hydrophobicity 🎯 Polyreactivity 🧲 Self-association 🌡️ Thermostability 🧪 Titer

    Cross-validation

    For the GDPa1 cross-validation predictions (GDPa1_v1.2_sequences.csv):

    Split the dataset using the "hierarchical_cluster_IgG_isotype_stratified_fold" column Train on 4 folds and predict on the held-out fold Collect held-out predictions for all 5 folds into one dataframe Write this dataframe to a .csv file and submit as your GDPa1 cross-validation predictions The leaderboard will show the average Spearman rank correlation across the 5 folds. For a code example, check out our tutorial on training an antibody developability prediction model with cross-validation here.

    Test set

    heldout-set-sequences.csv

  15. Edinburgh Airbnb Data

    • kaggle.com
    zip
    Updated Jun 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CandiceZhao28 (2022). Edinburgh Airbnb Data [Dataset]. https://www.kaggle.com/datasets/candicezhao28/edinburgh-airbnb-data/discussion
    Explore at:
    zip(28794738 bytes)Available download formats
    Dataset updated
    Jun 16, 2022
    Authors
    CandiceZhao28
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Edinburgh
    Description

    This dataset provides data of Airbnb listings in the capital of Scotland, Edinburgh, for a period of one year, from 25 June 2019 to 24 June 2020.

    The dataset contains 12 files, 2 of which are original and the rest 10 are processed. The original data are uncleaned web-scraped data, which can be used for data cleaning, data engineering, exploratory data analysis (EDA), followed by any algorithms a user finds suitable. On the other hand, the preprocessed data are provided for users who want to quickly run some regression algorithms without spending time on other aspects of a project.

    Code

    • The code for obtaining the preprocessed data is provided as notebook Price Prediction-Part 1 Feature Engineering & EDA.
    • The code using these preprocessed data to train regression models is provided as Price Prediction-Part2 Neural Network & XGBoost.

    Original Data

    Select your features, clean your data, then EDA or applying algorithms you find suitable. - original_data_listings.csv (13,245 rows, 106 columns) Contains data about 13245 properties listed on Airbnb for the period of data collection. 106 fields about the listings are provided, such as the number of bedrooms, neighbourhood, cancellation policy, cleaning fee (averaged over the period of data collection as hosts can change how much they charge for cleaning), etc.

    • original_data_calendar.csv (4,834,568 rows, 7 columns) Contains the status data of each property on each day over the period of data collection, such as, on a given date, whether the property was occupied and the price per night.

    Preprocessed data

    If you simply would like to run some regression models (predicting a numerical variable), use the preprocessed data. Train and test data are directly available. They were preprocessed separately to prevent data leakage. The target in the preprocessed data is the price per night averaged over the period of data collection.

    It is straightforward to tell what each preprocessed data file is for. For example, targets_train.csv contains the targets for training, and inputs_numerical_test.csv contains the numerical predictor features for testing.

    Note that the numerical and categorical features are provided in separate files. Users need to combine them before model training. DataFrame index of the numerical and categorical features are identical so one can simply use a merge or join on id. The reason why the numerical and categorical features are stored in separate files is that one of the categorical features neighbourhood (cardinality = 111) was handled in 3 different ways. Users can choose which version of categorical data to use based on the encoding of this feature: - version 1: OneHot Encoding - version 2: Target / Mean Encoding (with additive smoothing) - version 3: Replacing with a new feature: avg_price_per_bedroom_by_neighbourhood, the price per bedroom averaged over neighbourhood.

  16. VinBigData 1024 JPG Dataset

    • kaggle.com
    zip
    Updated Feb 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sunghyun Jun (2021). VinBigData 1024 JPG Dataset [Dataset]. https://www.kaggle.com/sunghyunjun/vinbigdata-1024-jpg-dataset
    Explore at:
    zip(3838894272 bytes)Available download formats
    Dataset updated
    Feb 15, 2021
    Authors
    Sunghyun Jun
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    1024px JPG X-ray images dataset converted from original dataset of the competition VinBigData Chest X-ray Abnormalities Detection.

    The code for reading X-ray images is from: https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way https://www.kaggle.com/raddar/vinbigdata-competition-jpg-data-3x-downsampled

    Following code was used:

    from argparse import ArgumentParser
    import os
    import warnings
    
    import cv2
    import numpy as np
    import pandas as pd
    
    import pydicom
    from pydicom.pixel_data_handlers.util import apply_voi_lut
    from tqdm import tqdm
    
    warnings.filterwarnings(action="ignore", category=UserWarning)
    
    
    def read_xray(path, voi_lut=True, fix_monochrome=True, downscale_factor=1):
      # Read dicom image.
      # Original from:
      # https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way
      # https://www.kaggle.com/raddar/vinbigdata-competition-jpg-data-3x-downsampled
      dicom = pydicom.read_file(path)
    
      # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to "human-friendly" view
      if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
      else:
        data = dicom.pixel_array
    
      # depending on this value, X-ray may look inverted - fix that:
      if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    
      data = data - np.min(data)
      data = data / np.max(data)
      data = (data * 255).astype(np.uint8)
    
      if downscale_factor != 1:
        new_shape = tuple([int(x / downscale_factor) for x in data.shape])
        data = cv2.resize(data, (new_shape[1], new_shape[0]))
    
      return data
    
    
    def main():
      parser = ArgumentParser()
      parser.add_argument("--dataset_dir", type=str, default="dataset")
      parser.add_argument("--debug", action="store_true")
      args = parser.parse_args()
    
      raw_data_dir = os.path.join(args.dataset_dir)
      jpg_data_dir = os.path.join("dataset-jpg")
    
      os.makedirs(jpg_data_dir, exist_ok=True)
      os.makedirs(os.path.join(jpg_data_dir, "train"), exist_ok=True)
      os.makedirs(os.path.join(jpg_data_dir, "test"), exist_ok=True)
    
      train_images = os.listdir(os.path.join(raw_data_dir, "train"))
      test_images = os.listdir(os.path.join(raw_data_dir, "test"))
    
      df = pd.read_csv(os.path.join(raw_data_dir, "train.csv"))
    
      IMAGE_SIZE = 1024
    
      print(f"Making train images - {IMAGE_SIZE} px jpg")
      if args.debug:
        pbar = tqdm(train_images[:10])
      else:
        pbar = tqdm(train_images)
    
      new_df = pd.DataFrame(
        columns=[
          "image_id",
          "class_name",
          "class_id",
          "rad_id",
          "x_min",
          "y_min",
          "x_max",
          "y_max",
        ],
      )
    
      for raw_image in pbar:
        img = read_xray(
          os.path.join(raw_data_dir, "train", raw_image), downscale_factor=1
        )
    
        scale_x = IMAGE_SIZE / img.shape[1]
        scale_y = IMAGE_SIZE / img.shape[0]
    
        image_id = raw_image.split(".")[0]
    
        temp_df = df[df.image_id == image_id].copy()
    
        temp_df["raw_x_min"] = temp_df["x_min"]
        temp_df["raw_x_max"] = temp_df["x_max"]
        temp_df["raw_y_min"] = temp_df["y_min"]
        temp_df["raw_y_max"] = temp_df["y_max"]
    
        temp_df["raw_width"] = img.shape[1]
        temp_df["raw_height"] = img.shape[0]
    
        temp_df["scale_x"] = scale_x
        temp_df["scale_y"] = scale_y
    
        temp_df[["x_min", "x_max"]] = temp_df[["x_min", "x_max"]].mul(scale_x).round(0)
        temp_df[["y_min", "y_max"]] = temp_df[["y_min", "y_max"]].mul(scale_y).round(0)
    
        new_df = new_df.append(temp_df, ignore_index=True)
    
        img = cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE), interpolation=cv2.INTER_AREA)
    
        cv2.imwrite(
          os.path.join(jpg_data_dir, "train", raw_image.replace(".dicom", ".jpg")),
          img,
        )
    
      new_df.to_csv(os.path.join(jpg_data_dir, "train.csv"))
    
      print(f"Making test images - {IMAGE_SIZE} px jpg")
      if args.debug:
        pbar = tqdm(test_images[:10])
      else:
        pbar = tqdm(test_images)
    
      for raw_image in pbar:
        img = read_xray(
          os.path.join(raw_data_dir, "test", raw_image), downscale_factor=1
        )
    
        img = cv2.resize(img, (IMAGE_SIZE, IMAGE_SIZE), interpolation=cv2.INTER_AREA)
    
        cv2.imwrite(
          os.path.join(jpg_data_dir, "test", raw_image.replace(".dicom", ".jpg")), img
        )
    
    
    if _name_ == "_main_":
      main()
    
  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784

Multimodal Vision-Audio-Language Dataset

Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Goethe University Frankfurt
Authors
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Search
Clear search
Close search
Google apps
Main menu