11 datasets found
  1. h

    MuST-C-de

    • huggingface.co
    Updated Apr 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enim AI (2022). MuST-C-de [Dataset]. https://huggingface.co/datasets/enimai/MuST-C-de
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2022
    Dataset authored and provided by
    Enim AI
    License

    https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

    Description

    enimai/MuST-C-de dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. O

    MuST-Cinema

    • opendatalab.com
    zip
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Trento (2022). MuST-Cinema [Dataset]. https://opendatalab.com/OpenDataLab/MuST-Cinema
    Explore at:
    zip(589241693 bytes)Available download formats
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Fondazione Bruno Kessler
    University of Trento
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    MuST-Cinema is a Multilingual Speech-to-Subtitles corpus ideal for building subtitle-oriented machine and speech translation systems. It comprises audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations. MuST-Cinema was built by annotating MuST-C with subtitle breaks based on the original subtitle files. Special symbols have been inserted in the aligned sentences to mark subtitle breaks as follows:

  3. p

    MuST-C: The Multi-Sensor and Multi-Temporal Dataset of Multiple Crops for...

    • phenoroam.phenorob.de
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). PhaseOne High resolution RGB images CKA Phenorob Central Experiment 2023 [Dataset]. https://phenoroam.phenorob.de/geonetwork/srv/search?orgName=Linn%20Chong
    Explore at:
    Dataset updated
    Nov 26, 2023
    Description

    Phenotyping is crucial for understanding crop trait variation and advancing research, but is currently limited by expensive, labor-intensive monitoring. New methods are proposed to automate phenotypic trait monitoring and reduce this so-called phenotyping bottleneck. These methods are often data-driven, requiring a dataset for novel method development. In this paper, we present the MuST-C (Multi-Sensor, multi-Temporal, multiple Crops) data set, which contains field data from various platforms collected over one growing season, covering six different crop species. All data are georeferenced for alignment across sensors and dates. To collect our dataset, we deployed aerial and ground robotic platforms equipped with RGB cameras, LiDARs, and multispectral cameras to achieve not only a high variety of modalities but also varying viewpoints. In addition to sensor data, our data set provides destructively derived reference measurements of leaf area and biomass. Our data set enables the development of autonomous phenotypic trait estimation techniques, including novel multi-sensor approaches. Moreover, it allows method comparisons using different sensors and investigates their generalizability across crop species.

  4. h

    must-c-en-de-wait3-01

    • huggingface.co
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max (2023). must-c-en-de-wait3-01 [Dataset]. https://huggingface.co/datasets/maxolotl/must-c-en-de-wait3-01
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Authors
    Max
    Description

    Dataset Card for "must-c-en-de-wait3-01"

    More Information needed

  5. h

    must-c-en-de-wait9-01

    • huggingface.co
    Updated Feb 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max (2024). must-c-en-de-wait9-01 [Dataset]. https://huggingface.co/datasets/maxolotl/must-c-en-de-wait9-01
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2024
    Authors
    Max
    Description

    Dataset Card for "must-c-en-de-wait9-01"

    More Information needed

  6. h

    must-c-en-de-wait5-01

    • huggingface.co
    Updated Feb 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max (2024). must-c-en-de-wait5-01 [Dataset]. https://huggingface.co/datasets/maxolotl/must-c-en-de-wait5-01
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 6, 2024
    Authors
    Max
    Description

    Dataset Card for "must-c-en-de-wait5-01"

    More Information needed

  7. Data from: Lost in Translation: A Study of Bugs Introduced by Large Language...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Reza Ibrahimzada; Ali Reza Ibrahimzada (2024). Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code [Dataset]. http://doi.org/10.5281/zenodo.10447705
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ali Reza Ibrahimzada; Ali Reza Ibrahimzada
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 15, 2023
    Description

    Artifact repository for the paper Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code, accepted at ICSE 2024, Lisbon, Portugal. Authors are Rangeet Pan* Ali Reza Ibrahimzada*, Rahul Krishna, Divya Sankar, Lambert Pougeum Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand.

    Install

    This repository contains the source code for reproducing the results in our paper. Please start by cloning this repository:

    We recommend using a virtual environment for running the scripts. Please download conda 23.11.0 from this link. You can create a virtual environment using the following command:

    conda create -n plempirical python=3.10.13
    

    After creating the virtual environment, you can activate it using the following command:

    conda activate plempirical
    

    You can run the following command to make sure that you are using the correct version of Python:

    python3 --version && pip3 --version
    

    Dependencies

    To install all software dependencies, please execute the following command:

    pip3 install -r requirements.txt
    

    As for hardware dependencies, we used 16 NVIDIA A100 GPUs with 80GBs of memory for inferencing models. The models can be inferenced on any combination of GPUs as long as the reader can properly distribute the model weights across the GPUs. We did not perform weight distribution since we had enough memory (80 GB) per GPU.

    Moreover, for compiling and testing the generated translations, we used Python 3.10, g++ 11, GCC Clang 14.0, Java 11, Go 1.20, Rust 1.73, and .Net 7.0.14 for Python, C++, C, Java, Go, Rust, and C#, respectively. Overall, we recommend using a machine with Linux OS and at least 32GB of RAM for running the scripts.

    For running scripts of alternative approaches, you need to make sure you have installed C2Rust, CxGO, and Java2C# on your machine. Please refer to their repositories for installation instructions. For Java2C#, you need to create a .csproj file like below:

    
    

    Dataset

    We uploaded the dataset we used in our empirical study to Zenodo. The dataset is organized as follows:

    1. CodeNet
    2. AVATAR
    3. Evalplus
    4. Apache Commons-CLI
    5. Click

    Please download and unzip the dataset.zip file from Zenodo. After unzipping, you should see the following directory structure:

    PLTranslationEmpirical
    ├── dataset
      ├── codenet
      ├── avatar
      ├── evalplus
      ├── real-life-cli
    ├── ...
    

    The structure of each dataset is as follows:

    1. CodeNet & Avatar: Each directory in these datasets correspond to a source language where each include two directories Code and TestCases for code snippets and test cases, respectively. Each code snippet has an id in the filename, where the id is used as a prefix for test I/O files.

    2. Evalplus: The source language code snippets follow a similar structure as CodeNet and Avatar. However, as a one time effort, we manually created the test cases in the target Java language inside a maven project, evalplus_java. To evaluate the translations from an LLM, we recommend moving the generated Java code snippets to the src/main/java directory of the maven project and then running the command mvn clean test surefire-report:report -Dmaven.test.failure.ignore=true to compile, test, and generate reports for the translations.

    3. Real-life Projects: The real-life-cli directory represents two real-life CLI projects from Java and Python. These datasets only contain code snippets as files and no test cases. As mentioned in the paper, the authors manually evaluated the translations for these datasets.

    Scripts

    We provide bash scripts for reproducing our results in this work. First, we discuss the translation script. For doing translation with a model and dataset, first you need to create a .env file in the repository and add the following:

    OPENAI_API_KEY=

    1. Translation with GPT-4: You can run the following command to translate all Python -> Java code snippets in codenet dataset with the GPT-4 while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.7:

    bash scripts/translate.sh GPT-4 codenet Python Java 50 0.95 0.7 0
    

    2. Translation with CodeGeeX: Prior to running the script, you need to clone the CodeGeeX repository from here and use the instructions from their artifacts to download their model weights. After cloning it inside PLTranslationEmpirical and downloading the model weights, your directory structure should be like the following:

    PLTranslationEmpirical
    ├── dataset
      ├── codenet
      ├── avatar
      ├── evalplus
      ├── real-life-cli
    ├── CodeGeeX
      ├── codegeex
      ├── codegeex_13b.pt # this file is the model weight
      ├── ...
    ├── ...
    

    You can run the following command to translate all Python -> Java code snippets in codenet dataset with the CodeGeeX while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.2 on GPU gpu_id=0:

    bash scripts/translate.sh CodeGeeX codenet Python Java 50 0.95 0.2 0
    

    3. For all other models (StarCoder, CodeGen, LLaMa, TB-Airoboros, TB-Vicuna), you can execute the following command to translate all Python -> Java code snippets in codenet dataset with the StarCoder|CodeGen|LLaMa|TB-Airoboros|TB-Vicuna while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.2 on GPU gpu_id=0:

    bash scripts/translate.sh StarCoder codenet Python Java 50 0.95 0.2 0
    

    4. For translating and testing pairs with traditional techniques (i.e., C2Rust, CxGO, Java2C#), you can run the following commands:

    bash scripts/translate_transpiler.sh codenet C Rust c2rust fix_report
    bash scripts/translate_transpiler.sh codenet C Go cxgo fix_reports
    bash scripts/translate_transpiler.sh codenet Java C# java2c# fix_reports
    bash scripts/translate_transpiler.sh avatar Java C# java2c# fix_reports
    

    5. For compile and testing of CodeNet, AVATAR, and Evalplus (Python to Java) translations from GPT-4, and generating fix reports, you can run the following commands:

    bash scripts/test_avatar.sh Python Java GPT-4 fix_reports 1
    bash scripts/test_codenet.sh Python Java GPT-4 fix_reports 1
    bash scripts/test_evalplus.sh Python Java GPT-4 fix_reports 1
    

    6. For repairing unsuccessful translations of Java -> Python in CodeNet dataset with GPT-4, you can run the following commands:

    bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 compile
    bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 runtime
    bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 incorrect
    

    7. For cleaning translations of open-source LLMs (i.e., StarCoder) in codenet, you can run the following command:

    bash scripts/clean_generations.sh StarCoder codenet
    

    Please note that for the above commands, you can change the dataset and model name to execute the same thing for other datasets and models. Moreover, you can refer to /prompts for different vanilla and repair prompts used in our study.

    Artifacts

    Please download the artifacts.zip file from our Zenodo repository. We have organized the artifacts as follows:

    1. RQ1 - Translations: This directory contains the translations

  8. h

    must-c-en-fr-wait09_21.8

    • huggingface.co
    Updated Dec 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max (2023). must-c-en-fr-wait09_21.8 [Dataset]. https://huggingface.co/datasets/maxolotl/must-c-en-fr-wait09_21.8
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2023
    Authors
    Max
    Description

    Dataset Card for "must-c-en-fr-wait09_21.8"

    More Information needed

  9. h

    asr-alignment

    • huggingface.co
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Binh Nguyen (2024). asr-alignment [Dataset]. https://huggingface.co/datasets/nguyenvulebinh/asr-alignment
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 18, 2024
    Authors
    Binh Nguyen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Speech Recognition Alignment Dataset

    This dataset is a variation of several widely-used ASR datasets, encompassing Librispeech, MuST-C, TED-LIUM, VoxPopuli, Common Voice, and GigaSpeech. The difference is this dataset includes:

    Precise alignment between audio and text. Text that has been punctuated and made case-sensitive. Identification of named entities in the text.

      Usage
    

    First, install the latest version of the 🤗 Datasets package: pip install --upgrade pip pip… See the full description on the dataset page: https://huggingface.co/datasets/nguyenvulebinh/asr-alignment.

  10. h

    must-c-en-es-wait7-01

    • huggingface.co
    Updated Nov 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max (2023). must-c-en-es-wait7-01 [Dataset]. https://huggingface.co/datasets/maxolotl/must-c-en-es-wait7-01
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2023
    Authors
    Max
    Description

    Dataset Card for "must-c-en-es-wait7-01"

    More Information needed

  11. h

    maxolotl_must-c-en-fr_21.8

    • huggingface.co
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    french-datasets (2025). maxolotl_must-c-en-fr_21.8 [Dataset]. https://huggingface.co/datasets/french-datasets/maxolotl_must-c-en-fr_21.8
    Explore at:
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    french-datasets
    Description

    Ce répertoire est vide, il a été créé pour améliorer le référencement du jeu de données maxolotl/must-c-en-fr_21.8.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Enim AI (2022). MuST-C-de [Dataset]. https://huggingface.co/datasets/enimai/MuST-C-de

MuST-C-de

enimai/MuST-C-de

Explore at:
99 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 27, 2022
Dataset authored and provided by
Enim AI
License

https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

Description

enimai/MuST-C-de dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu