4 datasets found
  1. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Goethe University Frankfurt
    Authors
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  2. g

    Data from: JSON Dataset of Simulated Building Heat Control for System of...

    • gimi9.com
    • researchdata.se
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JSON Dataset of Simulated Building Heat Control for System of Systems Interoperability [Dataset]. https://gimi9.com/dataset/eu_https-doi-org-10-5878-1tv7-9x76/
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.

  3. h

    Animals_dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melisa Atis, Animals_dataset [Dataset]. https://huggingface.co/datasets/Melisa13/Animals_dataset
    Explore at:
    Authors
    Melisa Atis
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Animals Dataset

      Dataset Description
    

    This dataset contains images of three animal categories: cats, dogs, and pandas.

      Dataset Structure
    

    The dataset is organized into training and testing splits: Animals_dataset/ ├── train/ │ ├── cats/ │ ├── dogs/ │ └── panda/ └── test/ ├── cats/ ├── dogs/ └── panda/

      Dataset Statistics
    

    Total Images: 600 Training Images: 480 (80.0%) Testing Images: 120 (20.0%)

      Class Distribution
    

    Training… See the full description on the dataset page: https://huggingface.co/datasets/Melisa13/Animals_dataset.

  4. Convert Text to Pandas

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
    Explore at:
    zip(4333134 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    Zeyad Usf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    kaggle notebook
    Github Repo

    I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

    • Rahima411/text-to-pandas:

      • The data is divided into Train with 57.5k and Test with 19.2k.

      • The data has two columns as you can see in the example:

        • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
        • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
    • hiltch/pandas-create-context:

      • It contains 17k rows with three columns:
        • question : text .
        • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
        • answer : Pandas code.
          question           |            context             |       answer 
    ----------------------------------------|--------------------------------------------------------|---------------------------------------
    What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   
    

    As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

    Args:
      text (str): The input text containing table definitions and questions.
    
    Returns:
      tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
      # Extract column names
      columns = re.findall(column_pattern, columns_str)
      column_names = [col[0] for col in columns]
    
      # Format DataFrame creation statement
      df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
      df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '
    

    '.join(df_creations)

    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question
    
    After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
    > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
    > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
    > - `Question` : It is ...
    
  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784

Multimodal Vision-Audio-Language Dataset

Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Goethe University Frankfurt
Authors
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Search
Clear search
Close search
Google apps
Main menu