52 datasets found

f
LAIL
figshare.com
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jia Li (2024). LAIL [Dataset]. http://doi.org/10.6084/m9.figshare.22014596.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22014596.v1
Dataset updated
Jul 30, 2024
Dataset provided by
figshare
Authors
Jia Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LAILLAIL is a Large language model-Aware selection approach for In-context-Learning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement.## Requirements- openai- tqdm- javaWe also privide a scripts (/Evaluation/evaluation_setup.sh) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval as example:train.jsonl and test.jsonl:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data and test_source data folders consist of the original code repositories collected from Github.estimate_prompt folder contain the constructed prompts to estimate candidate examples.generation_prompt folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL folder provides the selected examples' id in LAIL_id by our LAIL. Developers can directly use these provided prompts through codellama_completion.py to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py. (3) Finally, developers evaluate the generated programs with the source code in Evaluation folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples package.Take DevEval as example:(1) /Dataset/DevEval/estimate_prompt folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train folder.bashbash run_inference.sh ../Dataset/DevEval###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py. bashpython process_generation.py###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py(2) Then, developers use /baselines/make_prompt.py to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/EvaluationSince the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn.
CoSyn-point
huggingface.co
Updated Feb 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). CoSyn-point [Dataset]. https://huggingface.co/datasets/allenai/CoSyn-point
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
CoSyn-point

CoSyn-point is a collection of diverse computer-generated images that are annotated with queries and answer points. It can be used to train models to return points in the image in response to a user query. The data was created by using the Claude large language model to generate code that can be executed to render an image, The code used to generate this data is open source. Synthetic question-answer data is also available in a seperate repo. Quick links:

📃 CoSyn… See the full description on the dataset page: https://huggingface.co/datasets/allenai/CoSyn-point.
r
Data from: Automatic Detection of Ditches and Natural Streams from Digital...
researchdata.se
Updated Mar 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariana dos Santos Toledo Busarello; William Lidberg; Anneli Ågren; Florian Westphal (2024). Automatic Detection of Ditches and Natural Streams from Digital Elevation Models Using Deep Learning [Dataset]. http://doi.org/10.5878/jrex-z325
Explore at:
(75), (10003963896), (119121), (90), (6577796), (77), (74), (143923371), (788389), (62817)Available download formats
Unique identifier
https://doi.org/10.5878/jrex-z325
Dataset updated
Mar 15, 2024
Dataset provided by
Swedish University of Agricultural Sciences
Authors
Mariana dos Santos Toledo Busarello; William Lidberg; Anneli Ågren; Florian Westphal
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Sweden
Description
This data contains the digital elevation models and polyline shapefiles with the location of channels from the 12 study areas used in this study. It also has the code to generate the datasets used to train the deep learning models to detect channels, ditches, and streams, and calculate the topographic indices. The code to train the models is also included, along with the models with the highest performance in 0.5 m resolution. The channels were mapped differently based on their type: ditches were manually digitized based on the visual analysis of some topographic indices and orthophotos obtained from the DEM. Streams were mapped by initially detecting all natural channel heads, then tracing the downstream channels, and finally manually editing them based on orthophotos.
h
codeparrot
huggingface.co
Updated Sep 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natural Language Processing with Transformers (2021). codeparrot [Dataset]. https://huggingface.co/datasets/transformersbook/codeparrot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2021
Dataset authored and provided by
Natural Language Processing with Transformers
Description
CodeParrot 🦜 Dataset

What is it?

This is the full CodeParrot dataset. It contains Python files used to train the code generation model in Chapter 10: Training Transformers from Scratch in the NLP with Transformers book. You can find the full code in the accompanying Github repository.

Creation

It was created with the GitHub dataset available via Google's BigQuery. It contains approximately 22 million Python files and is 180 GB (50 GB compressed) big. The… See the full description on the dataset page: https://huggingface.co/datasets/transformersbook/codeparrot.
LLM: 7 prompt training dataset
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247 LLM texts: 3004

See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts

Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"A Cowboy Who Rode the Waves"

"Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"Seeking multiple opinions"

"Phones and driving"

This dataset is a derivative of the datasets

LLM Generated Essays for the Detect AI Comp! by Radek Osmulski

persuade corpus 2.0 provided by Nicholas Broad

daigt data - llama 70b and falcon180b by Nicholas Broad

Hello, Claude! 1000 essays from Anthropic... by Darragh

as well as the original competition training dataset

Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
f
vfillDL: A geomorphology deep learning dataset of valley fill faces...
figshare.com
bin
Updated Mar 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Maxwell (2023). vfillDL: A geomorphology deep learning dataset of valley fill faces resulting from mountaintop removal coal mining (southern West Virginia, eastern Kentucky, and southwestern Virginia, USA) [Dataset]. http://doi.org/10.6084/m9.figshare.22318522.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22318522.v2
Dataset updated
Mar 22, 2023
Dataset provided by
figshare
Authors
Aaron Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Southwest Virginia, Southern West Virginia, West Virginia, United States
Description
scripts.zip

arcgisTools.atbx: terrainDerivatives: make terrain derivatives from digital terrain model (Band 1 = TPI (50 m radius circle), Band 2 = square root of slope, Band 3 = TPI (annulus), Band 4 = hillshade, Band 5 = multidirectional hillshades, Band 6 = slopeshade). rasterizeFeatures: convert vector polygons to raster masks (1 = feature, 0 = background).

makeChips.R: R function to break terrain derivatives and chips into image chips of a defined size. makeTerrainDerivatives.R: R function to generated 6-band terrain derivatives from digital terrain data (same as ArcGIS Pro tool). merge_logs.R: R script to merge training logs into a single file. predictToExtents.ipynb: Python notebook to use trained model to predict to new data. trainExperiments.ipynb: Python notebook used to train semantic segmentation models using PyTorch and the Segmentation Models package. assessmentExperiments.ipynb: Python code to generate assessment metrics using PyTorch and the torchmetrics library. graphs_results.R: R code to make graphs with ggplot2 to summarize results. makeChipsList.R: R code to generate lists of chips in a directory. makeMasks.R: R function to make raster masks from vector data (same as rasterizeFeatures ArcGIS Pro tool).

vfillDL.zip

dems: LiDAR DTM data partitioned into training, three testing, and two validation datasets. Original DTM data were obtained from 3DEP (https://www.usgs.gov/3d-elevation-program) and the WV GIS Technical Center (https://wvgis.wvu.edu/) . extents: extents of the training, testing, and validation areas. These extents were defined by the researchers. vectors: vector features representing valley fills and partitioned into separate training, testing, and validation datasets. Extents were created by the researchers.
Z
Code and data sets for "MS²Rescore: Data-driven rescoring dramatically...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated May 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hirschler Aurélie (2022). Code and data sets for "MS²Rescore: Data-driven rescoring dramatically boosts immunopeptide identification rates" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5615049
Explore at:
Dataset updated
May 15, 2022
Dataset provided by
Carapito Christine
Hirschler Aurélie
Martens Lennart
Bouwmeester Robbin
Declercq Arthur
Degroeve Sven
Gabriels Ralf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code used to prepare data sets, train and evaluate new MS²PIP models, evaluate MS²Rescore for immunopeptidomics, and generate figures. See README.md for more information on how to use these files and reproduce the results reported in the manuscript titled "MS²Rescore: Data-driven rescoring dramatically boosts immunopeptide identification rates".
Z
Magnetic Tape Recorder Dataset
data.niaid.nih.gov
zenodo.org
Updated Jun 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moliner, Eloi (2023). Magnetic Tape Recorder Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8026271
Explore at:
Dataset updated
Jun 30, 2023
Dataset provided by
Wright, Alec
Moliner, Eloi
Välimäki, Vesa
Mikkonen, Otto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the datasets collected and used in the research project:

O. Mikkonen, A. Wright, E. Moliner and V. Välimäki, “Neural Modeling Of Magnetic Tape Recorders,” in Proceedings of the International Conference on Digital Audio Effects (DAFx), Copenhagen, Denmark, 4-7 September 2023.

A pre-print of the article is available in arXiv. The code is open-source and published in GitHub. The accompanying web page can be found from here.

Overview

The data is divided into various subsets, stored in separate directories. The data contains both toy data generated using a software emulation of a reel-to-reel tape recorder, as well as real data collected from a physical device. The various subsets can be used for training, validating, and testing neural network behavior, similarly as was done in the research article.

Toy and Real Data

The toy data was generated using CHOWTape, a physically modeled reel-to-reel tape recorder. The subsets generated with the software emulation are denoted with the string CHOWTAPE. Two variants of the toy data was produced: in the first variant, the fluctuating delay produced by the simulated tape transport was disabled, and in the second kind, the delay was enabled. The latter variants are denoted with the string WOWFLUTTER.

The real data is collected using an Akai 4000D reel-to-reel tape recorder. The corresponding subsets are denoted with the string AKAI. Two tape speeds were used during the recording: 3 3/4 IPS (inches per second) and 7 1/2 IPS, with the corresponding subsets denoted with '3.75IPS' and '7.5IPS' respectively. On top of this, two different brands of magnetic tape were used for capturing the datasets with different tape speeds: Maxell and Scotch, with the corresponding subsets denoted with 'MAXELL' and 'SCOTCH' respectively.

Directories

For training the models, a fraction of the inputs from SignalTrain LA2A Dataset was used. The training, validation, and testing can be replicated using the subsets:

ReelToReel_Dataset_MiniPulse100_AKAI_*/ (hysteretic nonlinearity, real data)

ReelToReel_Dataset_Mini192kHzPulse100_AKAI_*/ (delay generator, real data)

Silence_AKAI_*/ (noise generator, real data)

ReelToReel_Dataset_MiniPulse100_CHOWTAPE*/ (hysteretic nonlinearity, toy data)

ReelToReel_Dataset_MiniPulse100_CHOWTAPE_F[0.6]_SL[60]_TRAJECTORIES/ (delay generator, toy data)

For visualizing the model behavior, the following subsets can be used:

LogSweepsContinuousPulse100_*/ (nonlinear magnitude responses)

SinesFadedShortContinuousPulse100*/ (magnetic hysteresis curves)

Directory structure

Each directory/subset is made of up of further subdirectories that are most often used to separate the training, validation and test sets from each other. Thus, a typical directory will look like the following: [DIRECTORY_NAME] ├── Train │ ├── input_x_.wav │ ... │ ├── target_x_.wav │ ... └── Val │ ├── input_y_.wav │ ... │ ├── target_y_.wav │ ... ├── Test │ ├── input_z_.wav │ ... │ ├── target_z_.wav │ ...

While not all of the audio is used for training purposes, all of the subsets share part of this structure to make the corresponding datasets compatible with the dataloader that was used.

The input and target files denoted with the same number x, e.g. input_100_.wav and target_100_.wav make up a pair, such that the target audio is the input audio processed with one of the used effects. In some of the cases, a third file named trajectory_x_.npy can be found, which consists of the corresponding pre-extracted delay trajectory in the NumPy binary file format.
Blue Bot Dataset: Train, Test, Validate
kaggle.com
Updated Nov 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bajan Digital Creations Incorporated (2020). Blue Bot Dataset: Train, Test, Validate [Dataset]. https://www.kaggle.com/hiyaro/bluenetpenta/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bajan Digital Creations Incorporated
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Context

“I remember that, though of humble origin, the sea was always the living pantry. The memories of my uncles spear-fishing in the waters off Inch Marlowe are fond memories. Unfortunately, they are just that, memories!

My children love visiting Barbados. However, their ancestral waters do not have the abundance of life I recalled. They cannot live the childhood that I had and that saddens me.”

S. Antonio Hollingsworth, Founder BDCI Barbados

This dataset was created to give Caribbean developers in the field of artificial intelligence and machine learning a head start in training the next generation of A.I. and machine learning applications. We believe that to meet the challenges of reef collapse due to human activity, artificial intelligence will give small island developing states the edge needed to remain competitive and survive in a rapidly changing world.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

This dataset contains image data of target fish species. It is categorical in nature and is intended for use in computer vision.

This dataset contains images of fish in different natural positions, lighting and water conditions.

The fish are presented in there natural environment.

Some images may contain more than one member of the target species or it may contain another species that is while not dominant may influence the training process.

Data collection period: August - November 2020. Data collection location: Barbados. General data coordinate: 13.1939° N, 59.5432° W. Data collection depth range: 0m to 5m. Data collection climate: Tropical, Marine, Sea. Average Water temperature 29 Celsius

Data collector: S. Antonio Hollingsworth Camera Used: BW Space Pro 4K Zoom Platform: Underwater Robot.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Thanks to:

The UNDP Accelerator Labs for Barbados & the Eastern Caribbean for funding The Blue-Bot Project.

Stacy R. Phillips for project proposal presentations.

S. Antonio Hollingsworth for piloting the remote underwater robot and curating the images of this dataset.

Youcan Robotics for there technical and customer support.

Those dear to us who inspire us to dream of a better tomorrow.

Code attributions:

tensorflow.org: MobileNet V2 pre-trained model used as in transfer learning process of BlueNet

python.org

Inspiration

How can we improve the data collection process in the blue economy?

What is the best way to use A.I. in the blue economy?

Can we use computer vision and artificial intelligence to find and learn the complex patterns that exist on coral reefs?

How do we use this insight to create effective and long term conservation and resilience policies for small island developing states that depend on coral reefs for economic survival?
Z
Data from: Data-driven surrogate model for wind turbine damage equivalent...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haghi, Rad (2024). Data-driven surrogate model for wind turbine damage equivalent load [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12583596
Explore at:
Dataset updated
Oct 29, 2024
Dataset authored and provided by
Haghi, Rad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There are four zip files in this data set:

PythonCode_OpenFAST: The code used to generate 32768 OpenFAST fst files to build the database.

ML_TrainingCode: The code that used to train the TCN-FCNN and FCNN models for both free stream and wake

Trained_Models: All the trained models are saved in Keras format. The models with max in their filenames were trained on maximum values. The models with XY in their naming were trained on wind in the X and Y directions.

data: It includes all the CSV files for training and testing.
m
Prediction of Venous Thromboembolism in Diverse Populations Using Machine...
data.mendeley.com
Updated Oct 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Chen (2023). Prediction of Venous Thromboembolism in Diverse Populations Using Machine Learning and Electronic Health Records [Dataset]. http://doi.org/10.17632/tkwzysr4y6.6
Explore at:
Unique identifier
https://doi.org/10.17632/tkwzysr4y6.6
Dataset updated
Oct 25, 2023
Authors
Robert Chen
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Contains resources needed to train, test, and analyze performance of gradient boosting models used to predict venous thromboembolism (VTE) from electronic health record (EHR) data.

"Code for analyses" folder: Contains code we used for the analyses in our paper. Prediction.ipynb: Contains code needed to run trained models. Small, Medium, and Large.xlsx: Excel templates to correctly format data for prediction generation. Models.zip: Contains trained models. Note that this is 0.4 GB once unzipped. Analysis.ipynb: Contains code used to train the models.

Dependencies: Python 3.10.9; Pandas 1.5.1; LightGBM 3.3.2.
terraceDL: A geomorphology deep learning dataset of agricultural terraces in...
figshare.com
bin
Updated Mar 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Maxwell (2023). terraceDL: A geomorphology deep learning dataset of agricultural terraces in Iowa, USA [Dataset]. http://doi.org/10.6084/m9.figshare.22320373.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22320373.v2
Dataset updated
Mar 22, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Aaron Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Iowa, United States
Description
scripts.zip

arcgisTools.atbx: terrainDerivatives: make terrain derivatives from digital terrain model (Band 1 = TPI (50 m radius circle), Band 2 = square root of slope, Band 3 = TPI (annulus), Band 4 = hillshade, Band 5 = multidirectional hillshades, Band 6 = slopeshade). rasterizeFeatures: convert vector polygons to raster masks (1 = feature, 0 = background).

makeChips.R: R function to break terrain derivatives and chips into image chips of a defined size. makeTerrainDerivatives.R: R function to generated 6-band terrain derivatives from digital terrain data (same as ArcGIS Pro tool). merge_logs.R: R script to merge training logs into a single file. predictToExtents.ipynb: Python notebook to use trained model to predict to new data. trainExperiments.ipynb: Python notebook used to train semantic segmentation models using PyTorch and the Segmentation Models package. assessmentExperiments.ipynb: Python code to generate assessment metrics using PyTorch and the torchmetrics library. graphs_results.R: R code to make graphs with ggplot2 to summarize results. makeChipsList.R: R code to generate lists of chips in a directory. makeMasks.R: R function to make raster masks from vector data (same as rasterizeFeatures ArcGIS Pro tool).

terraceDL.zip

dems: LiDAR DTM data partitioned into training, testing, and validation datasets based on HUC8 watershed boundaries. Original DTM data were provided by the Iowa BMP mapping project: https://www.gis.iastate.edu/BMPs. extents: extents of the training, testing, and validation areas as defined by HUC 8 watershed boundaries. vectors: vector features representing agricultural terraces and partitioned into separate training, testing, and validation datasets. Original digitized features were provided by the Iowa BMP Mapping Project: https://www.gis.iastate.edu/BMPs.
Data from: Global data-driven prediction of fire activity
springernature.figshare.com
bin
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesca Di Giuseppe (2025). Global data-driven prediction of fire activity [Dataset]. http://doi.org/10.6084/m9.figshare.27269748.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27269748.v1
Dataset updated
Apr 2, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Francesca Di Giuseppe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All files within this directory are used to generate the model/data/plots used in the submissions. Software is available at code ocean https://codeocean.com/capsule/8570224/tree

For the generation of the dataframes used for training the various data-driven models we use (Input data not present in this folder are available from ERA5-Land) :

MAKE_PICKLE_*.py

This generates the following pickle files:

TRAIN_*_10PER_2019_2021.pk1

Once those pickle files are generated the XGBoost models are then trained using :

TR_*.py

This generates the model files, which are then used with the input data to generate ERA5 and forecast data :

MAKE_GLOBAL_MAPS_ERA5.py

The output modelled data are then available as :

POF_*_ERA5.nc

Computation of radar plot scores are made using :

SCORES_FULL.py

Output of statistical analysis are availble as:

correlation_fwi.pkl

obs_correlation_fwi.pkl

*RELIABILITY.pkl

*RADAR.pkl

Region specific observations and forecasts are provided in :

NEN*

NWN*

For plotting the plots are made using the following scripts :

FIG_*.ipynb

For access to data note present in this capsule please access the ftp site:

ftp server: ftp.ecmwf.int username: ecmwf_fire password: FhXekWMuy
Smartwatch Purchase Data
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aayush Chourasiya (2022). Smartwatch Purchase Data [Dataset]. https://www.kaggle.com/datasets/albedo0/smartwatch-purchase-data/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aayush Chourasiya
Description
Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.

The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.

----- Version 1 -------

trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".

The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.

Following Python script was used to generate this dataset

import random import csv # Set the number of examples to generate numExamples = 100000 # Generate the training data with open("trainingData.csv", "w", newline="") as csvfile: fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for i in range(numExamples): age = random.randint(18, 70) income = random.randint(25000, 200000) gender = random.choice(["male", "female"]) maritalStatus = random.choice(["single", "married", "divorced"]) hour = random.randint(0, 23) weekend = random.choice([True, False]) # Randomly assign the label based on some arbitrary conditions # assuming 40% of divorcees won't buy a smart watch if maritalStatus == "divorced" and random.random() < 0.4: buySmartWatch = "no" # assuming sales are 30% more likely to occur on weekends. elif weekend == True and random.random() < 1.3: buySmartWatch = "yes" elif gender == "male" and age < 30 and income > 75000: buySmartWatch = "yes" elif gender == "female" and age >= 30 and income > 100000: buySmartWatch = "yes" else: buySmartWatch = "no" writer.writerow({ "age": age, "income": income, "gender": gender, "maritalStatus": maritalStatus, "hour": hour, "weekend": weekend, "buySmartWatch": buySmartWatch })

----- Version 2 -------

trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)

Python script used to generate the data:

import random import csv # Set the number of examples to generate numExamples = 100000 with open("t...
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Z
Data from: Augmented base pairing networks encode RNA-small molecule binding...
data.niaid.nih.gov
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarrazin Gendron, Roman (2023). Augmented base pairing networks encode RNA-small molecule binding preferences [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8338267
Explore at:
Dataset updated
Sep 13, 2023
Dataset provided by
Mallet, Vincent
Oliver, Carlos
Waldispühl, Jérôme
Sarrazin Gendron, Roman
Hamilton, William L
Moitessier, Nicolas
Reinharz, Vladimir
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset used to train and validate the RNAmigos model from "Augmented base pairing networks encode RNA-small molecule binding preferences".

This will give you a cleaned up version of the data used to train the RNAmigos 1.0 models.

If you run python make_nice.py you will generate a CSV file rnamigos1_dataset.csv which contains all the info you need.

The script will also use DecoyFinder to generate the decoys for each pocket.

Pockets

The CSV has one row for each binding pocket.

The columns are:

pdbid: the PDBID this pocket belongs to

model_num: the model number inside the PDB we took

chain: the chain the pocket belongs to

ligand_id: the 3-letter code of the ligand (e.g. ATP) which you can look up on RCSB.org

ligand_resnum: the residue number of the ligand in the PDB

nodelist: a list of nodes separated by ';' in the pocket as a string in the format ..-;...

edgelist: a list of edges separated by ';' in the pocket as a string in the format nodes are in the same format as above, and connected by a '-' char, with an additional label field. e.g. of a two edge list 1aju.A.1-1aju.A.5-CWW;1aju.A.1-1aju.A.2-B53

fp_native_maccs: bit string of the MACCS for the native ligand

split_{k}_train: one col for all the splits we ran (k \in {0-9}) contains True if this pocket was in the train set for this split

split_{k}_test: one col for all the splits we ran (k \in {0-9}) contains True if this pocket was in the test set for this split

Decoys

The folder decoy_db/ has the following structure:

decoy_db _{ligand_id}_{ligand_position} decoyfinder actives.txt decoys.txt pdb actives.txt decoys.txt Each `actives.txt` and `decoys.txt` is a file with one SMILES per line. `decoyfinder/` has decoys computed by DecoyFinder and the acvtives are just the native ligands. `pdb/` has decoys taken from other pockets in the PDB and actives are just the native ligands.
d
Data for three-dimensional distribution of groundwater residence time...
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data for three-dimensional distribution of groundwater residence time metrics in the glaciated United States using metamodels trained on general numerical simulation models [Dataset]. https://catalog.data.gov/dataset/data-for-three-dimensional-distribution-of-groundwater-residence-time-metrics-in-the-glaci
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description
Residence time distribution (RTD) is a critically important characteristic of groundwater flow systems; however, it cannot be measured directly. RTD can be inferred from tracer data with analytical models (few parameters) or with numerical models (many parameters). The second approach permits more variation in system properties but is used less frequently than the first because large-scale numerical models can be resource intensive. With the data and computer codes in this data release users can (1) reconstruct and run 115 General Simulation Models (GSMs) of groundwater flow, (2) calculate groundwater age metrics at selected GSM cells, (3) train a boosted regression tree model using the provided data, (4) predict three-dimensional continuous groundwater age metrics across the Glacial Principal Aquifer, and (5) predict tritium concentrations at wells for comparison with measured tritium concentrations. The computer codes in this data release are in the form of Python scripts and Jupyter Notebooks. Users will need to have these Python resources installed on their computers to run the codes. Instructions for creating the Python environment can be found in the file Creating the Python environment.txt. Users who would rather not run the scripts but who wish to obtain the final data sets can do so by downloading the file Output--Predictions.7z. Users who wish to reproduce the data sets in this release can do so by downloading, unzipping, and running the data workflow in Starn_GW_Residence_Time_Data_and_Scripts.7z. The codes in this file use relative pathnames, so the directory structure within this file should not be changed. The ".7z" file extension indicates 7-Zip files, http://www.7-zip.org Executables--MODFLOW and MODPATH executable files provided for convenience. These are Windows 64-bit versions. Step 1--Create General Simulation Models--Codes to create 115 GSMs Step 2--Data preparation--Calculate residence time distributions at selected GSM cells Step 3--Metamodel training--Train a boosted regression tree metamodel (XGBoost) Step 4--Metamodel prediction--Predict age metrics throughout the Glacial Aquifer Step 5--Tritium simulation --Calculate tritium concentration at selected wells
Data and coding used in paper entitled "MIU: Deep Embedded Building Cluster...
figshare.com
bin
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anqi Lin (2023). Data and coding used in paper entitled "MIU: Deep Embedded Building Cluster Model of Urban Functional Zoning" [Dataset]. http://doi.org/10.6084/m9.figshare.23275238.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23275238.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Anqi Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and coding used in paper entitled "MIU: Deep Embedded Building Cluster Model of Urban Functional Zoning".The compressed package contains 6 folders. Building Footprint: Building vector data were used to extract geometric and compactness featrues. Google Earth Image: VHR images were applied to extract spectral and textural features. Luojia 1-01 Nighttime Light Image: Nighttime data were used to extract brightness features. OSM Street:OSM road networks were used to extract location features. POI of Study Area:POI data were used to generate labels for training the Word2Vec model. Python Code:DEC code was used to process the cluster for generating the MIU; Word2Vec code was used train the Word2Vec model.
Z
NLUCat
data.niaid.nih.gov
huggingface.co
+2more
Updated Mar 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technologies Unit (2024). NLUCat [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10362025
Explore at:
Dataset updated
Mar 4, 2024
Dataset authored and provided by
Language Technologies Unit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NLUCat

Dataset Description

Dataset Summary

NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

This dataset can be used to train models for intent classification, spans identification and examples generation.

This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

In this repository you'll find the following items:

NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team

NLUCat_dataset.json: the completed NLUCat dataset

NLUCat_stats.tsv: statistics about de NLUCat dataset

dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers

reports: folder with the reports done as feedback to the annotators during the annotation process

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

Supported Tasks and Leaderboards

Intent classification, spans identification and examples generation.

Languages

The dataset is in Catalan (ca-ES).

Dataset Structure

Data Instances

Three JSON files, one for each split.

Data Fields

example: str. Example

annotation: dict. Annotation of the example

intent: str. Intent tag

slots: list. List of slots

Tag:str. tag to the slot

Text:str. Text of the slot

Start_char: int. First character of the span

End_char: int. Last character of the span

Example

An example looks as follows:

{ "example": "Demana una ambulància; la meva dona està de part.", "annotation": { "intent": "call_emergency", "slots": [ { "Tag": "service", "Text": "ambulància", "Start_char": 11, "End_char": 21 }, { "Tag": "situation", "Text": "la meva dona està de part", "Start_char": 23, "End_char": 48 } ] } },

Data Splits

NLUCat.train: 9128 examples

NLUCat.dev: 1441 examples

NLUCat.test: 1441 examples

Dataset Creation

Curation Rationale

We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

Source Data

Initial Data Collection and Normalization

We commissioned a company to create fictitious examples for the creation of this dataset.

Who are the source language producers?

We commissioned the writing of the examples to the company m47 labs.

Annotations

Annotation process

The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.* First step: translation or elaboration of the instructions given to the annotators to write the examples.* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

Who are the annotators?

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

Personal and Sensitive Information

No personal or sensitive information included.

The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

Considerations for Using the Data

Social Impact of Dataset

We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

Discussion of Biases

When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Licensing Information

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

DOI

Contributions

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
Z
Learning to Grasp Unknown Objects in Domestic Environments with GP-net+
data.niaid.nih.gov
Updated May 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McDonald, John (2024). Learning to Grasp Unknown Objects in Domestic Environments with GP-net+ [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10083841
Explore at:
Dataset updated
May 30, 2024
Dataset provided by
Konrad, Anna
Villing, Rudi
McDonald, John
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record includes data for the paper "Learning to Grasp Unknown Objects in Domestic Environments", currently under review.Simulation environment with pre-trained GP-net+ modelThe paper presents a simulation environment for grasping objects in domestic environments. The presented objects and furniture units, as well as a pre-trained GP-net+ model can be found in the "gpnetplus_simulation_data.zip" file. After this zip file is downloaded, it can be unpacked it into the GP-net+ directory. It includes all necessary data to use the simulation environment, for example, for testing GP-net+ or other grasping models in simulated domestic environments.

ROS model

The paper additionally presents an ROS package that can be deployed for grasping unknown objects in domestic environments with simulated or real robots. We make a ROS-compatbile model of GP-net+ available in the "ros_gpnet_plus.zip" file, which can be used with the ROS package.

Training dataset

We used the simulation environment in our paper to generate a training dataset and train GP-net+. This training dataset is included in this record and can be used to replicate our results or train modifications of GP-net+.

To improve handling of the training dataset (total size 25GB+), we split the dataset into several .zip files, named val.zip (validation data) and train_[0-6].zip (training data). Download all files individually and extract them into a single folder. Combine all files train_[0-6].zip directory into a single directory called 'train', for example, by using the 'move_train_data.sh' script provided.The final structure for the dataset should look similar to this:gpnet_data

|-- val

|-- depth_image_0000000.npz |-- depth_image_0000001.npz ... |--segmask_image_0052346.npz

|-- train

|-- depth_image_0000000.npz |-- depth_image_0000001.npz ... |-- segmask_image_0602506.npz |-- segmask_image_0602507.npz

For generation of the training and simulation data, the following mesh databases have been used:B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar,"Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set," IEEE Robotics and Automation Magazine, vol. 22, no. 3, pp. 36–52, 2015A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, "BigBIRD: A large-scale 3D database of object instances," 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 509–516, 2014.A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, "ShapeNet: An Information-Rich 3D Model Repository," Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.

D. Morrison, P. Corke, and J. Leitner, "EGAD! An Evolved Grasping Analysis Dataset for Diversity and Reproducibility in Robotic Manipulation," IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4368–4375, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Jia Li (2024). LAIL [Dataset]. http://doi.org/10.6084/m9.figshare.22014596.v1

LAIL

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.22014596.v1

Dataset updated

Jul 30, 2024

Dataset provided by

figshare

Authors

Jia Li

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LAILLAIL is a Large language model-Aware selection approach for In-context-Learning-based code generation named LAIL. LAIL uses LLMs themselves to select examples. It requires LLMs themselves to label a candidate example as a positive example or a negative example for a requirement.## Requirements- openai- tqdm- javaWe also privide a scripts (`/Evaluation/evaluation_setup.sh`) to help set up programming language dependencies that are used in evaluation.`bashbash evaluation_setup.sh`###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take `../Dataset/DevEval` as example:`train.jsonl` and `test.jsonl`:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. `source data` and `test_source data` folders consist of the original code repositories collected from Github.`estimate_prompt` folder contain the constructed prompts to estimate candidate examples.`generation_prompt` folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) `ICL_LAIL` folder provides the selected examples' id in `LAIL_id` by our LAIL. Developers can directly use these provided prompts through `codellama_completion.py` to generate programs. (2) After generating programs, developers need to process generated programs with `process_generation.py`. (3) Finally, developers evaluate the generated programs with the source code in `Evaluation` folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the `LAIL/estimate_examples` package.Take `DevEval` as example:(1) `/Dataset/DevEval/estimate_prompt` folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:`bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt`(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the `/LAIL/LAIL/retriever/train` folder.`bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &`## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the `/LAIL/LAIL/retriever/train` folder.`bashbash run_inference.sh ../Dataset/DevEval`###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( `../LAIL/ICL_LAIL/codellama_completion.py`) to generate programs:`bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False`(2) After generating programs, developers need to process generated programs with `../LAIL/ICL_LAIL/process_generation.py`. `bashpython process_generation.py`###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the `baselines` folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:`bashpython baselines.py`(2) Then, developers use `/baselines/make_prompt.py` to contruct a prompt context using the selected candidate examples as follows:`bashpython make_prompt.py ICLCoder ICLCoder -1`###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in `LAIL/Evaluation`Since the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in `/LAIL/Evaluation/`.## CitationIf you have any questions or suggestions, please email us at `lijiaa@pku.edu.cn`.

Clear search

Close search

Google apps

Main menu

LAIL

CoSyn-point

Data from: Automatic Detection of Ditches and Natural Streams from Digital...

codeparrot

LLM: 7 prompt training dataset

vfillDL: A geomorphology deep learning dataset of valley fill faces...

Code and data sets for "MS²Rescore: Data-driven rescoring dramatically...

Magnetic Tape Recorder Dataset

Blue Bot Dataset: Train, Test, Validate

Context

Content

Acknowledgements

Code attributions:

Inspiration

Data from: Data-driven surrogate model for wind turbine damage equivalent...

Prediction of Venous Thromboembolism in Diverse Populations Using Machine...

terraceDL: A geomorphology deep learning dataset of agricultural terraces in...

Data from: Global data-driven prediction of fire activity

MAKE_PICKLE_*.py

TRAIN_*_10PER_2019_2021.pk1

TR_*.py

MAKE_GLOBAL_MAPS_ERA5.py

POF_*_ERA5.nc

SCORES_FULL.py

correlation_fwi.pkl

obs_correlation_fwi.pkl

*RELIABILITY.pkl

*RADAR.pkl

NEN*

NWN*

FIG_*.ipynb

Smartwatch Purchase Data

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

Data from: Augmented base pairing networks encode RNA-small molecule binding...

Pockets

Decoys

Data for three-dimensional distribution of groundwater residence time...

Data and coding used in paper entitled "MIU: Deep Embedded Building Cluster...

NLUCat

Learning to Grasp Unknown Objects in Domestic Environments with GP-net+

LAIL