52 datasets found
  1. f

    LAIL

    • figshare.com
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Li (2024). LAIL [Dataset]. http://doi.org/10.6084/m9.figshare.22014596.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    figshare
    Authors
    Jia Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LAILLAIL is a Large language model-Aware selection approach for In-context-Learning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement.## Requirements- openai- tqdm- javaWe also privide a scripts (/Evaluation/evaluation_setup.sh) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval as example:train.jsonl and test.jsonl:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data and test_source data folders consist of the original code repositories collected from Github.estimate_prompt folder contain the constructed prompts to estimate candidate examples.generation_prompt folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL folder provides the selected examples' id in LAIL_id by our LAIL. Developers can directly use these provided prompts through codellama_completion.py to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py. (3) Finally, developers evaluate the generated programs with the source code in Evaluation folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples package.Take DevEval as example:(1) /Dataset/DevEval/estimate_prompt folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train folder.bashbash run_inference.sh ../Dataset/DevEval###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py. bashpython process_generation.py###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py(2) Then, developers use /baselines/make_prompt.py to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/EvaluationSince the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn.

  2. CoSyn-point

    • huggingface.co
    Updated Feb 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). CoSyn-point [Dataset]. https://huggingface.co/datasets/allenai/CoSyn-point
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    CoSyn-point

    CoSyn-point is a collection of diverse computer-generated images that are annotated with queries and answer points. It can be used to train models to return points in the image in response to a user query. The data was created by using the Claude large language model to generate code that can be executed to render an image, The code used to generate this data is open source. Synthetic question-answer data is also available in a seperate repo. Quick links:

    📃 CoSyn… See the full description on the dataset page: https://huggingface.co/datasets/allenai/CoSyn-point.

  3. r

    Data from: Automatic Detection of Ditches and Natural Streams from Digital...

    • researchdata.se
    Updated Mar 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariana dos Santos Toledo Busarello; William Lidberg; Anneli Ågren; Florian Westphal (2024). Automatic Detection of Ditches and Natural Streams from Digital Elevation Models Using Deep Learning [Dataset]. http://doi.org/10.5878/jrex-z325
    Explore at:
    (75), (10003963896), (119121), (90), (6577796), (77), (74), (143923371), (788389), (62817)Available download formats
    Dataset updated
    Mar 15, 2024
    Dataset provided by
    Swedish University of Agricultural Sciences
    Authors
    Mariana dos Santos Toledo Busarello; William Lidberg; Anneli Ågren; Florian Westphal
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Sweden
    Description

    This data contains the digital elevation models and polyline shapefiles with the location of channels from the 12 study areas used in this study. It also has the code to generate the datasets used to train the deep learning models to detect channels, ditches, and streams, and calculate the topographic indices. The code to train the models is also included, along with the models with the highest performance in 0.5 m resolution. The channels were mapped differently based on their type: ditches were manually digitized based on the visual analysis of some topographic indices and orthophotos obtained from the DEM. Streams were mapped by initially detecting all natural channel heads, then tracing the downstream channels, and finally manually editing them based on orthophotos.

  4. h

    codeparrot

    • huggingface.co
    Updated Sep 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natural Language Processing with Transformers (2021). codeparrot [Dataset]. https://huggingface.co/datasets/transformersbook/codeparrot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2021
    Dataset authored and provided by
    Natural Language Processing with Transformers
    Description

    CodeParrot 🦜 Dataset

      What is it?
    

    This is the full CodeParrot dataset. It contains Python files used to train the code generation model in Chapter 10: Training Transformers from Scratch in the NLP with Transformers book. You can find the full code in the accompanying Github repository.

      Creation
    

    It was created with the GitHub dataset available via Google's BigQuery. It contains approximately 22 million Python files and is 180 GB (50 GB compressed) big. The… See the full description on the dataset page: https://huggingface.co/datasets/transformersbook/codeparrot.

  5. LLM: 7 prompt training dataset

    • kaggle.com
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description
    • Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
      File: train_essays_RDizzl3_seven_v2.csv
      Human texts: 14247 LLM texts: 3004

      See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts



    • Version 3: "**The RDizzl3 Seven**"
      File: train_essays_RDizzl3_seven_v1.csv

    • "Car-free cities"

    • "Does the electoral college work?"

    • "Exploring Venus"

    • "The Face on Mars"

    • "Facial action coding system"

    • "A Cowboy Who Rode the Waves"

    • "Driverless cars"

    How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

    • Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

    Namely:

    • "Car-free cities"
    • "Does the electoral college work?"
    • "Exploring Venus"
    • "The Face on Mars"
    • "Facial action coding system"
    • "Seeking multiple opinions"
    • "Phones and driving"

    This dataset is a derivative of the datasets

    as well as the original competition training dataset

    • Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
  6. f

    vfillDL: A geomorphology deep learning dataset of valley fill faces...

    • figshare.com
    bin
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Maxwell (2023). vfillDL: A geomorphology deep learning dataset of valley fill faces resulting from mountaintop removal coal mining (southern West Virginia, eastern Kentucky, and southwestern Virginia, USA) [Dataset]. http://doi.org/10.6084/m9.figshare.22318522.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 22, 2023
    Dataset provided by
    figshare
    Authors
    Aaron Maxwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Southwest Virginia, Southern West Virginia, West Virginia, United States
    Description

    scripts.zip

    arcgisTools.atbx: terrainDerivatives: make terrain derivatives from digital terrain model (Band 1 = TPI (50 m radius circle), Band 2 = square root of slope, Band 3 = TPI (annulus), Band 4 = hillshade, Band 5 = multidirectional hillshades, Band 6 = slopeshade). rasterizeFeatures: convert vector polygons to raster masks (1 = feature, 0 = background).

    makeChips.R: R function to break terrain derivatives and chips into image chips of a defined size. makeTerrainDerivatives.R: R function to generated 6-band terrain derivatives from digital terrain data (same as ArcGIS Pro tool). merge_logs.R: R script to merge training logs into a single file. predictToExtents.ipynb: Python notebook to use trained model to predict to new data. trainExperiments.ipynb: Python notebook used to train semantic segmentation models using PyTorch and the Segmentation Models package. assessmentExperiments.ipynb: Python code to generate assessment metrics using PyTorch and the torchmetrics library. graphs_results.R: R code to make graphs with ggplot2 to summarize results. makeChipsList.R: R code to generate lists of chips in a directory. makeMasks.R: R function to make raster masks from vector data (same as rasterizeFeatures ArcGIS Pro tool).

    vfillDL.zip

    dems: LiDAR DTM data partitioned into training, three testing, and two validation datasets. Original DTM data were obtained from 3DEP (https://www.usgs.gov/3d-elevation-program) and the WV GIS Technical Center (https://wvgis.wvu.edu/) . extents: extents of the training, testing, and validation areas. These extents were defined by the researchers. vectors: vector features representing valley fills and partitioned into separate training, testing, and validation datasets. Extents were created by the researchers.

  7. Z

    Code and data sets for "MS²Rescore: Data-driven rescoring dramatically...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated May 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hirschler Aurélie (2022). Code and data sets for "MS²Rescore: Data-driven rescoring dramatically boosts immunopeptide identification rates" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5615049
    Explore at:
    Dataset updated
    May 15, 2022
    Dataset provided by
    Carapito Christine
    Hirschler Aurélie
    Martens Lennart
    Bouwmeester Robbin
    Declercq Arthur
    Degroeve Sven
    Gabriels Ralf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code used to prepare data sets, train and evaluate new MS²PIP models, evaluate MS²Rescore for immunopeptidomics, and generate figures. See README.md for more information on how to use these files and reproduce the results reported in the manuscript titled "MS²Rescore: Data-driven rescoring dramatically boosts immunopeptide identification rates".

  8. Z

    Magnetic Tape Recorder Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moliner, Eloi (2023). Magnetic Tape Recorder Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8026271
    Explore at:
    Dataset updated
    Jun 30, 2023
    Dataset provided by
    Wright, Alec
    Moliner, Eloi
    Välimäki, Vesa
    Mikkonen, Otto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the datasets collected and used in the research project:

    O. Mikkonen, A. Wright, E. Moliner and V. Välimäki, “Neural Modeling Of Magnetic Tape Recorders,” in Proceedings of the International Conference on Digital Audio Effects (DAFx), Copenhagen, Denmark, 4-7 September 2023.

    A pre-print of the article is available in arXiv. The code is open-source and published in GitHub. The accompanying web page can be found from here.

    Overview

    The data is divided into various subsets, stored in separate directories. The data contains both toy data generated using a software emulation of a reel-to-reel tape recorder, as well as real data collected from a physical device. The various subsets can be used for training, validating, and testing neural network behavior, similarly as was done in the research article.

    Toy and Real Data

    The toy data was generated using CHOWTape, a physically modeled reel-to-reel tape recorder. The subsets generated with the software emulation are denoted with the string CHOWTAPE. Two variants of the toy data was produced: in the first variant, the fluctuating delay produced by the simulated tape transport was disabled, and in the second kind, the delay was enabled. The latter variants are denoted with the string WOWFLUTTER.

    The real data is collected using an Akai 4000D reel-to-reel tape recorder. The corresponding subsets are denoted with the string AKAI. Two tape speeds were used during the recording: 3 3/4 IPS (inches per second) and 7 1/2 IPS, with the corresponding subsets denoted with '3.75IPS' and '7.5IPS' respectively. On top of this, two different brands of magnetic tape were used for capturing the datasets with different tape speeds: Maxell and Scotch, with the corresponding subsets denoted with 'MAXELL' and 'SCOTCH' respectively.

    Directories

    For training the models, a fraction of the inputs from SignalTrain LA2A Dataset was used. The training, validation, and testing can be replicated using the subsets:

    ReelToReel_Dataset_MiniPulse100_AKAI_*/ (hysteretic nonlinearity, real data)

    ReelToReel_Dataset_Mini192kHzPulse100_AKAI_*/ (delay generator, real data)

    Silence_AKAI_*/ (noise generator, real data)

    ReelToReel_Dataset_MiniPulse100_CHOWTAPE*/ (hysteretic nonlinearity, toy data)

    ReelToReel_Dataset_MiniPulse100_CHOWTAPE_F[0.6]_SL[60]_TRAJECTORIES/ (delay generator, toy data)

    For visualizing the model behavior, the following subsets can be used:

    LogSweepsContinuousPulse100_*/ (nonlinear magnitude responses)

    SinesFadedShortContinuousPulse100*/ (magnetic hysteresis curves)

    Directory structure

    Each directory/subset is made of up of further subdirectories that are most often used to separate the training, validation and test sets from each other. Thus, a typical directory will look like the following: [DIRECTORY_NAME] ├── Train │ ├── input_x_.wav │ ... │ ├── target_x_.wav │ ... └── Val │ ├── input_y_.wav │ ... │ ├── target_y_.wav │ ... ├── Test │ ├── input_z_.wav │ ... │ ├── target_z_.wav │ ...

    While not all of the audio is used for training purposes, all of the subsets share part of this structure to make the corresponding datasets compatible with the dataloader that was used.

    The input and target files denoted with the same number x, e.g. input_100_.wav and target_100_.wav make up a pair, such that the target audio is the input audio processed with one of the used effects. In some of the cases, a third file named trajectory_x_.npy can be found, which consists of the corresponding pre-extracted delay trajectory in the NumPy binary file format.

  9. Blue Bot Dataset: Train, Test, Validate

    • kaggle.com
    Updated Nov 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bajan Digital Creations Incorporated (2020). Blue Bot Dataset: Train, Test, Validate [Dataset]. https://www.kaggle.com/hiyaro/bluenetpenta/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bajan Digital Creations Incorporated
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Context

    “I remember that, though of humble origin, the sea was always the living pantry. The memories of my uncles spear-fishing in the waters off Inch Marlowe are fond memories. Unfortunately, they are just that, memories!

    My children love visiting Barbados. However, their ancestral waters do not have the abundance of life I recalled. They cannot live the childhood that I had and that saddens me.”

    S. Antonio Hollingsworth, Founder BDCI Barbados

    This dataset was created to give Caribbean developers in the field of artificial intelligence and machine learning a head start in training the next generation of A.I. and machine learning applications. We believe that to meet the challenges of reef collapse due to human activity, artificial intelligence will give small island developing states the edge needed to remain competitive and survive in a rapidly changing world.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    This dataset contains image data of target fish species. It is categorical in nature and is intended for use in computer vision.

    This dataset contains images of fish in different natural positions, lighting and water conditions.

    The fish are presented in there natural environment.

    Some images may contain more than one member of the target species or it may contain another species that is while not dominant may influence the training process.

    Data collection period: August - November 2020. Data collection location: Barbados. General data coordinate: 13.1939° N, 59.5432° W. Data collection depth range: 0m to 5m. Data collection climate: Tropical, Marine, Sea. Average Water temperature 29 Celsius

    Data collector: S. Antonio Hollingsworth Camera Used: BW Space Pro 4K Zoom Platform: Underwater Robot.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Thanks to:

    The UNDP Accelerator Labs for Barbados & the Eastern Caribbean for funding The Blue-Bot Project.

    Stacy R. Phillips for project proposal presentations.

    S. Antonio Hollingsworth for piloting the remote underwater robot and curating the images of this dataset.

    Youcan Robotics for there technical and customer support.

    Those dear to us who inspire us to dream of a better tomorrow.

    Code attributions:

    tensorflow.org: MobileNet V2 pre-trained model used as in transfer learning process of BlueNet

    python.org

    Inspiration

    How can we improve the data collection process in the blue economy?

    What is the best way to use A.I. in the blue economy?

    Can we use computer vision and artificial intelligence to find and learn the complex patterns that exist on coral reefs?

    How do we use this insight to create effective and long term conservation and resilience policies for small island developing states that depend on coral reefs for economic survival?

  10. Z

    Data from: Data-driven surrogate model for wind turbine damage equivalent...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haghi, Rad (2024). Data-driven surrogate model for wind turbine damage equivalent load [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12583596
    Explore at:
    Dataset updated
    Oct 29, 2024
    Dataset authored and provided by
    Haghi, Rad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    There are four zip files in this data set:

    PythonCode_OpenFAST: The code used to generate 32768 OpenFAST fst files to build the database.

    ML_TrainingCode: The code that used to train the TCN-FCNN and FCNN models for both free stream and wake

    Trained_Models: All the trained models are saved in Keras format. The models with max in their filenames were trained on maximum values. The models with XY in their naming were trained on wind in the X and Y directions.

    data: It includes all the CSV files for training and testing.

  11. m

    Prediction of Venous Thromboembolism in Diverse Populations Using Machine...

    • data.mendeley.com
    Updated Oct 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Chen (2023). Prediction of Venous Thromboembolism in Diverse Populations Using Machine Learning and Electronic Health Records [Dataset]. http://doi.org/10.17632/tkwzysr4y6.6
    Explore at:
    Dataset updated
    Oct 25, 2023
    Authors
    Robert Chen
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Contains resources needed to train, test, and analyze performance of gradient boosting models used to predict venous thromboembolism (VTE) from electronic health record (EHR) data.

    "Code for analyses" folder: Contains code we used for the analyses in our paper. Prediction.ipynb: Contains code needed to run trained models. Small, Medium, and Large.xlsx: Excel templates to correctly format data for prediction generation. Models.zip: Contains trained models. Note that this is 0.4 GB once unzipped. Analysis.ipynb: Contains code used to train the models.

    Dependencies: Python 3.10.9; Pandas 1.5.1; LightGBM 3.3.2.

  12. terraceDL: A geomorphology deep learning dataset of agricultural terraces in...

    • figshare.com
    bin
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Maxwell (2023). terraceDL: A geomorphology deep learning dataset of agricultural terraces in Iowa, USA [Dataset]. http://doi.org/10.6084/m9.figshare.22320373.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 22, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Aaron Maxwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Iowa, United States
    Description

    scripts.zip

    arcgisTools.atbx: terrainDerivatives: make terrain derivatives from digital terrain model (Band 1 = TPI (50 m radius circle), Band 2 = square root of slope, Band 3 = TPI (annulus), Band 4 = hillshade, Band 5 = multidirectional hillshades, Band 6 = slopeshade). rasterizeFeatures: convert vector polygons to raster masks (1 = feature, 0 = background).

    makeChips.R: R function to break terrain derivatives and chips into image chips of a defined size. makeTerrainDerivatives.R: R function to generated 6-band terrain derivatives from digital terrain data (same as ArcGIS Pro tool). merge_logs.R: R script to merge training logs into a single file. predictToExtents.ipynb: Python notebook to use trained model to predict to new data. trainExperiments.ipynb: Python notebook used to train semantic segmentation models using PyTorch and the Segmentation Models package. assessmentExperiments.ipynb: Python code to generate assessment metrics using PyTorch and the torchmetrics library. graphs_results.R: R code to make graphs with ggplot2 to summarize results. makeChipsList.R: R code to generate lists of chips in a directory. makeMasks.R: R function to make raster masks from vector data (same as rasterizeFeatures ArcGIS Pro tool).

    terraceDL.zip

    dems: LiDAR DTM data partitioned into training, testing, and validation datasets based on HUC8 watershed boundaries. Original DTM data were provided by the Iowa BMP mapping project: https://www.gis.iastate.edu/BMPs. extents: extents of the training, testing, and validation areas as defined by HUC 8 watershed boundaries. vectors: vector features representing agricultural terraces and partitioned into separate training, testing, and validation datasets. Original digitized features were provided by the Iowa BMP Mapping Project: https://www.gis.iastate.edu/BMPs.

  13. Data from: Global data-driven prediction of fire activity

    • springernature.figshare.com
    bin
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesca Di Giuseppe (2025). Global data-driven prediction of fire activity [Dataset]. http://doi.org/10.6084/m9.figshare.27269748.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Francesca Di Giuseppe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All files within this directory are used to generate the model/data/plots used in the submissions. Software is available at code ocean https://codeocean.com/capsule/8570224/tree

    For the generation of the dataframes used for training the various data-driven models we use (Input data not present in this folder are available from ERA5-Land) :

    MAKE_PICKLE_*.py

    This generates the following pickle files:

    TRAIN_*_10PER_2019_2021.pk1

    Once those pickle files are generated the XGBoost models are then trained using :

    TR_*.py

    This generates the model files, which are then used with the input data to generate ERA5 and forecast data :

    MAKE_GLOBAL_MAPS_ERA5.py

    The output modelled data are then available as :

    POF_*_ERA5.nc

    Computation of radar plot scores are made using :

    SCORES_FULL.py

    Output of statistical analysis are availble as:

    correlation_fwi.pkl

    obs_correlation_fwi.pkl

    *RELIABILITY.pkl

    *RADAR.pkl

    Region specific observations and forecasts are provided in :

    NEN*

    NWN*

    For plotting the plots are made using the following scripts :

    FIG_*.ipynb

    For access to data note present in this capsule please access the ftp site:

    ftp server: ftp.ecmwf.int username: ecmwf_fire password: FhXekWMuy

  14. Smartwatch Purchase Data

    • kaggle.com
    Updated Dec 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aayush Chourasiya (2022). Smartwatch Purchase Data [Dataset]. https://www.kaggle.com/datasets/albedo0/smartwatch-purchase-data/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aayush Chourasiya
    Description

    Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.

    The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.

    ----- Version 1 -------

    trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".

    The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.

    Following Python script was used to generate this dataset

    import random
    import csv
    
    # Set the number of examples to generate
    numExamples = 100000
    
    # Generate the training data
    with open("trainingData.csv", "w", newline="") as csvfile:
      fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"]
      writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
      writer.writeheader()
    
      for i in range(numExamples):
        age = random.randint(18, 70)
        income = random.randint(25000, 200000)
        gender = random.choice(["male", "female"])
        maritalStatus = random.choice(["single", "married", "divorced"])
        hour = random.randint(0, 23)
        weekend = random.choice([True, False])
    
        # Randomly assign the label based on some arbitrary conditions
        # assuming 40% of divorcees won't buy a smart watch
        if maritalStatus == "divorced" and random.random() < 0.4:
          buySmartWatch = "no"
        # assuming sales are 30% more likely to occur on weekends.
        elif weekend == True and random.random() < 1.3:
          buySmartWatch = "yes"
        elif gender == "male" and age < 30 and income > 75000:
          buySmartWatch = "yes"
        elif gender == "female" and age >= 30 and income > 100000:
          buySmartWatch = "yes"
        else:
          buySmartWatch = "no"
    
        writer.writerow({
          "age": age,
          "income": income,
          "gender": gender,
          "maritalStatus": maritalStatus,
          "hour": hour,
          "weekend": weekend,
          "buySmartWatch": buySmartWatch
        })
    

    ----- Version 2 -------

    trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)

    Python script used to generate the data:

    import random
    import csv
    
    # Set the number of examples to generate
    numExamples = 100000
    
    with open("t...
    
  15. Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence:
    {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by 
    American songwriters Gerry Goffin and Carole King."}
    

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    {
     "id": "ont_k_music_test_n", 
     "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", 
     "triples": [
     {
      "sub": "The Loco-Motion", 
      "rel": "publication date",
      "obj": "01 January 1962"
     },{
      "sub": "The Loco-Motion",
      "rel": "lyrics by",
      "obj": "Gerry Goffin"
     },{
      "sub": "The Loco-Motion", 
      "rel": "lyrics by", 
      "obj": "Carole King"
     },]
    }
    

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

  16. Z

    Data from: Augmented base pairing networks encode RNA-small molecule binding...

    • data.niaid.nih.gov
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarrazin Gendron, Roman (2023). Augmented base pairing networks encode RNA-small molecule binding preferences [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8338267
    Explore at:
    Dataset updated
    Sep 13, 2023
    Dataset provided by
    Mallet, Vincent
    Oliver, Carlos
    Waldispühl, Jérôme
    Sarrazin Gendron, Roman
    Hamilton, William L
    Moitessier, Nicolas
    Reinharz, Vladimir
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used to train and validate the RNAmigos model from "Augmented base pairing networks encode RNA-small molecule binding preferences".

    This will give you a cleaned up version of the data used to train the RNAmigos 1.0 models.

    If you run python make_nice.py you will generate a CSV file rnamigos1_dataset.csv which contains all the info you need.

    The script will also use DecoyFinder to generate the decoys for each pocket.

    Pockets

    The CSV has one row for each binding pocket.

    The columns are:

    • pdbid: the PDBID this pocket belongs to

    • model_num: the model number inside the PDB we took

    • chain: the chain the pocket belongs to

    • ligand_id: the 3-letter code of the ligand (e.g. ATP) which you can look up on RCSB.org

    • ligand_resnum: the residue number of the ligand in the PDB

    • nodelist: a list of nodes separated by ';' in the pocket as a string in the format ..-;...

    • edgelist: a list of edges separated by ';' in the pocket as a string in the format nodes are in the same format as above, and connected by a '-' char, with an additional label field. e.g. of a two edge list 1aju.A.1-1aju.A.5-CWW;1aju.A.1-1aju.A.2-B53

    • fp_native_maccs: bit string of the MACCS for the native ligand

    • split_{k}_train: one col for all the splits we ran (k \in {0-9}) contains True if this pocket was in the train set for this split

    • split_{k}_test: one col for all the splits we ran (k \in {0-9}) contains True if this pocket was in the test set for this split

    Decoys

    The folder decoy_db/ has the following structure:

    
    
    decoy_db
    
    
      _{ligand_id}_{ligand_position}
    
    
        decoyfinder
    
    
          actives.txt
    
    
          decoys.txt
    
    
        pdb
    
    
          actives.txt
    
    
          decoys.txt
    
    
     
    
    
     
    
    
    Each `actives.txt` and `decoys.txt` is a file with one SMILES per line. 
    
    
     
    
    
    `decoyfinder/` has decoys computed by DecoyFinder and the acvtives are just the native ligands.
    
    
    `pdb/` has decoys taken from other pockets in the PDB and actives are just the native ligands.
    
  17. d

    Data for three-dimensional distribution of groundwater residence time...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Data for three-dimensional distribution of groundwater residence time metrics in the glaciated United States using metamodels trained on general numerical simulation models [Dataset]. https://catalog.data.gov/dataset/data-for-three-dimensional-distribution-of-groundwater-residence-time-metrics-in-the-glaci
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    Residence time distribution (RTD) is a critically important characteristic of groundwater flow systems; however, it cannot be measured directly. RTD can be inferred from tracer data with analytical models (few parameters) or with numerical models (many parameters). The second approach permits more variation in system properties but is used less frequently than the first because large-scale numerical models can be resource intensive. With the data and computer codes in this data release users can (1) reconstruct and run 115 General Simulation Models (GSMs) of groundwater flow, (2) calculate groundwater age metrics at selected GSM cells, (3) train a boosted regression tree model using the provided data, (4) predict three-dimensional continuous groundwater age metrics across the Glacial Principal Aquifer, and (5) predict tritium concentrations at wells for comparison with measured tritium concentrations. The computer codes in this data release are in the form of Python scripts and Jupyter Notebooks. Users will need to have these Python resources installed on their computers to run the codes. Instructions for creating the Python environment can be found in the file Creating the Python environment.txt. Users who would rather not run the scripts but who wish to obtain the final data sets can do so by downloading the file Output--Predictions.7z. Users who wish to reproduce the data sets in this release can do so by downloading, unzipping, and running the data workflow in Starn_GW_Residence_Time_Data_and_Scripts.7z. The codes in this file use relative pathnames, so the directory structure within this file should not be changed. The ".7z" file extension indicates 7-Zip files, http://www.7-zip.org Executables--MODFLOW and MODPATH executable files provided for convenience. These are Windows 64-bit versions. Step 1--Create General Simulation Models--Codes to create 115 GSMs Step 2--Data preparation--Calculate residence time distributions at selected GSM cells Step 3--Metamodel training--Train a boosted regression tree metamodel (XGBoost) Step 4--Metamodel prediction--Predict age metrics throughout the Glacial Aquifer Step 5--Tritium simulation --Calculate tritium concentration at selected wells

  18. Data and coding used in paper entitled "MIU: Deep Embedded Building Cluster...

    • figshare.com
    bin
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anqi Lin (2023). Data and coding used in paper entitled "MIU: Deep Embedded Building Cluster Model of Urban Functional Zoning" [Dataset]. http://doi.org/10.6084/m9.figshare.23275238.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Anqi Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and coding used in paper entitled "MIU: Deep Embedded Building Cluster Model of Urban Functional Zoning".The compressed package contains 6 folders. Building Footprint: Building vector data were used to extract geometric and compactness featrues. Google Earth Image: VHR images were applied to extract spectral and textural features. Luojia 1-01 Nighttime Light Image: Nighttime data were used to extract brightness features. OSM Street:OSM road networks were used to extract location features. POI of Study Area:POI data were used to generate labels for training the Word2Vec model. Python Code:DEC code was used to process the cluster for generating the MIU; Word2Vec code was used train the Word2Vec model.

  19. Z

    NLUCat

    • data.niaid.nih.gov
    • huggingface.co
    • +2more
    Updated Mar 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies Unit (2024). NLUCat [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10362025
    Explore at:
    Dataset updated
    Mar 4, 2024
    Dataset authored and provided by
    Language Technologies Unit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NLUCat

    Dataset Description

    Dataset Summary

    NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

    The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

    The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

    The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

    This dataset can be used to train models for intent classification, spans identification and examples generation.

    This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

    In this repository you'll find the following items:

    NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team

    NLUCat_dataset.json: the completed NLUCat dataset

    NLUCat_stats.tsv: statistics about de NLUCat dataset

    dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers

    reports: folder with the reports done as feedback to the annotators during the annotation process

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

    Supported Tasks and Leaderboards

    Intent classification, spans identification and examples generation.

    Languages

    The dataset is in Catalan (ca-ES).

    Dataset Structure

    Data Instances

    Three JSON files, one for each split.

    Data Fields

    example: str. Example

    annotation: dict. Annotation of the example

    intent: str. Intent tag

    slots: list. List of slots

    Tag:str. tag to the slot

    Text:str. Text of the slot

    Start_char: int. First character of the span

    End_char: int. Last character of the span

    Example

    An example looks as follows:

      {      "example": "Demana una ambulància; la meva dona està de part.",      "annotation": {        "intent": "call_emergency",        "slots": [          {            "Tag": "service",            "Text": "ambulància",            "Start_char": 11,            "End_char": 21          },          {            "Tag": "situation",            "Text": "la meva dona està de part",            "Start_char": 23,            "End_char": 48          }        ]      }    },
    

    Data Splits

    NLUCat.train: 9128 examples

    NLUCat.dev: 1441 examples

    NLUCat.test: 1441 examples

    Dataset Creation

    Curation Rationale

    We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

    When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

    Source Data

    Initial Data Collection and Normalization

    We commissioned a company to create fictitious examples for the creation of this dataset.

    Who are the source language producers?

    We commissioned the writing of the examples to the company m47 labs.

    Annotations

    Annotation process

    The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.* First step: translation or elaboration of the instructions given to the annotators to write the examples.* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

    Who are the annotators?

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

    Personal and Sensitive Information

    No personal or sensitive information included.

    The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

    Considerations for Using the Data

    Social Impact of Dataset

    We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

    Discussion of Biases

    When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

  20. Z

    Learning to Grasp Unknown Objects in Domestic Environments with GP-net+

    • data.niaid.nih.gov
    Updated May 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McDonald, John (2024). Learning to Grasp Unknown Objects in Domestic Environments with GP-net+ [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10083841
    Explore at:
    Dataset updated
    May 30, 2024
    Dataset provided by
    Konrad, Anna
    Villing, Rudi
    McDonald, John
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record includes data for the paper "Learning to Grasp Unknown Objects in Domestic Environments", currently under review.Simulation environment with pre-trained GP-net+ modelThe paper presents a simulation environment for grasping objects in domestic environments. The presented objects and furniture units, as well as a pre-trained GP-net+ model can be found in the "gpnetplus_simulation_data.zip" file. After this zip file is downloaded, it can be unpacked it into the GP-net+ directory. It includes all necessary data to use the simulation environment, for example, for testing GP-net+ or other grasping models in simulated domestic environments.

    ROS model

    The paper additionally presents an ROS package that can be deployed for grasping unknown objects in domestic environments with simulated or real robots. We make a ROS-compatbile model of GP-net+ available in the "ros_gpnet_plus.zip" file, which can be used with the ROS package.

    Training dataset

    We used the simulation environment in our paper to generate a training dataset and train GP-net+. This training dataset is included in this record and can be used to replicate our results or train modifications of GP-net+.

    To improve handling of the training dataset (total size 25GB+), we split the dataset into several .zip files, named val.zip (validation data) and train_[0-6].zip (training data). Download all files individually and extract them into a single folder. Combine all files train_[0-6].zip directory into a single directory called 'train', for example, by using the 'move_train_data.sh' script provided.The final structure for the dataset should look similar to this:gpnet_data

    |-- val

     |-- depth_image_0000000.npz
    
     |-- depth_image_0000001.npz
    
     ...
    
     |--segmask_image_0052346.npz
    

    |-- train

    |-- depth_image_0000000.npz
    
    |-- depth_image_0000001.npz
    
    ...
    
    |-- segmask_image_0602506.npz
    
    |-- segmask_image_0602507.npz
    

    For generation of the training and simulation data, the following mesh databases have been used:B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar,"Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set," IEEE Robotics and Automation Magazine, vol. 22, no. 3, pp. 36–52, 2015A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, "BigBIRD: A large-scale 3D database of object instances," 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 509–516, 2014.A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, "ShapeNet: An Information-Rich 3D Model Repository," Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.

    D. Morrison, P. Corke, and J. Leitner, "EGAD! An Evolved Grasping Analysis Dataset for Diversity and Reproducibility in Robotic Manipulation," IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4368–4375, 2020

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jia Li (2024). LAIL [Dataset]. http://doi.org/10.6084/m9.figshare.22014596.v1

LAIL

Explore at:
zipAvailable download formats
Dataset updated
Jul 30, 2024
Dataset provided by
figshare
Authors
Jia Li
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LAILLAIL is a Large language model-Aware selection approach for In-context-Learning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement.## Requirements- openai- tqdm- javaWe also privide a scripts (/Evaluation/evaluation_setup.sh) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval as example:train.jsonl and test.jsonl:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data and test_source data folders consist of the original code repositories collected from Github.estimate_prompt folder contain the constructed prompts to estimate candidate examples.generation_prompt folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL folder provides the selected examples' id in LAIL_id by our LAIL. Developers can directly use these provided prompts through codellama_completion.py to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py. (3) Finally, developers evaluate the generated programs with the source code in Evaluation folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples package.Take DevEval as example:(1) /Dataset/DevEval/estimate_prompt folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train folder.bashbash run_inference.sh ../Dataset/DevEval###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py. bashpython process_generation.py###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py(2) Then, developers use /baselines/make_prompt.py to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/EvaluationSince the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn.

Search
Clear search
Close search
Google apps
Main menu