Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
/Evaluation/evaluation_setup.sh
) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh
###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval
as example:train.jsonl
and test.jsonl
:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data
and test_source data
folders consist of the original code repositories collected from Github.estimate_prompt
folder contain the constructed prompts to estimate candidate examples.generation_prompt
folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL
folder provides the selected examples' id in LAIL_id
by our LAIL. Developers can directly use these provided prompts through codellama_completion.py
to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py
. (3) Finally, developers evaluate the generated programs with the source code in Evaluation
folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples
package.Take DevEval
as example:(1) /Dataset/DevEval/estimate_prompt
folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt
(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train
folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &
## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train
folder.bashbash run_inference.sh ../Dataset/DevEval
###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py
) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False
(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py
. bashpython process_generation.py
###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines
folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py
(2) Then, developers use /baselines/make_prompt.py
to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1
###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/Evaluation
Since the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/
.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn
.https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
CoSyn-point
CoSyn-point is a collection of diverse computer-generated images that are annotated with queries and answer points. It can be used to train models to return points in the image in response to a user query. The data was created by using the Claude large language model to generate code that can be executed to render an image, The code used to generate this data is open source. Synthetic question-answer data is also available in a seperate repo. Quick links:
📃 CoSyn… See the full description on the dataset page: https://huggingface.co/datasets/allenai/CoSyn-point.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data contains the digital elevation models and polyline shapefiles with the location of channels from the 12 study areas used in this study. It also has the code to generate the datasets used to train the deep learning models to detect channels, ditches, and streams, and calculate the topographic indices. The code to train the models is also included, along with the models with the highest performance in 0.5 m resolution. The channels were mapped differently based on their type: ditches were manually digitized based on the visual analysis of some topographic indices and orthophotos obtained from the DEM. Streams were mapped by initially detecting all natural channel heads, then tracing the downstream channels, and finally manually editing them based on orthophotos.
CodeParrot 🦜 Dataset
What is it?
This is the full CodeParrot dataset. It contains Python files used to train the code generation model in Chapter 10: Training Transformers from Scratch in the NLP with Transformers book. You can find the full code in the accompanying Github repository.
Creation
It was created with the GitHub dataset available via Google's BigQuery. It contains approximately 22 million Python files and is 180 GB (50 GB compressed) big. The… See the full description on the dataset page: https://huggingface.co/datasets/transformersbook/codeparrot.
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247
LLM texts: 3004
See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts
Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv
"Car-free cities
"
"Does the electoral college work?
"
"Exploring Venus
"
"The Face on Mars
"
"Facial action coding system
"
"A Cowboy Who Rode the Waves
"
"Driverless cars
"
How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"
train_essays_7_prompts_v2.csv
) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts. Namely:
Car-free cities
"Does the electoral college work?
"Exploring Venus
"The Face on Mars
"Facial action coding system
"Seeking multiple opinions
"Phones and driving
"This dataset is a derivative of the datasets
as well as the original competition training dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
scripts.zip
arcgisTools.atbx: terrainDerivatives: make terrain derivatives from digital terrain model (Band 1 = TPI (50 m radius circle), Band 2 = square root of slope, Band 3 = TPI (annulus), Band 4 = hillshade, Band 5 = multidirectional hillshades, Band 6 = slopeshade). rasterizeFeatures: convert vector polygons to raster masks (1 = feature, 0 = background).
makeChips.R: R function to break terrain derivatives and chips into image chips of a defined size. makeTerrainDerivatives.R: R function to generated 6-band terrain derivatives from digital terrain data (same as ArcGIS Pro tool). merge_logs.R: R script to merge training logs into a single file. predictToExtents.ipynb: Python notebook to use trained model to predict to new data. trainExperiments.ipynb: Python notebook used to train semantic segmentation models using PyTorch and the Segmentation Models package. assessmentExperiments.ipynb: Python code to generate assessment metrics using PyTorch and the torchmetrics library. graphs_results.R: R code to make graphs with ggplot2 to summarize results. makeChipsList.R: R code to generate lists of chips in a directory. makeMasks.R: R function to make raster masks from vector data (same as rasterizeFeatures ArcGIS Pro tool).
vfillDL.zip
dems: LiDAR DTM data partitioned into training, three testing, and two validation datasets. Original DTM data were obtained from 3DEP (https://www.usgs.gov/3d-elevation-program) and the WV GIS Technical Center (https://wvgis.wvu.edu/) . extents: extents of the training, testing, and validation areas. These extents were defined by the researchers. vectors: vector features representing valley fills and partitioned into separate training, testing, and validation datasets. Extents were created by the researchers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code used to prepare data sets, train and evaluate new MS²PIP models, evaluate MS²Rescore for immunopeptidomics, and generate figures. See README.md for more information on how to use these files and reproduce the results reported in the manuscript titled "MS²Rescore: Data-driven rescoring dramatically boosts immunopeptide identification rates".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the datasets collected and used in the research project:
O. Mikkonen, A. Wright, E. Moliner and V. Välimäki, “Neural Modeling Of Magnetic Tape Recorders,” in Proceedings of the International Conference on Digital Audio Effects (DAFx), Copenhagen, Denmark, 4-7 September 2023.
A pre-print of the article is available in arXiv. The code is open-source and published in GitHub. The accompanying web page can be found from here.
Overview
The data is divided into various subsets, stored in separate directories. The data contains both toy data generated using a software emulation of a reel-to-reel tape recorder, as well as real data collected from a physical device. The various subsets can be used for training, validating, and testing neural network behavior, similarly as was done in the research article.
Toy and Real Data
The toy data was generated using CHOWTape, a physically modeled reel-to-reel tape recorder. The subsets generated with the software emulation are denoted with the string CHOWTAPE
. Two variants of the toy data was produced: in the first variant, the fluctuating delay produced by the simulated tape transport was disabled, and in the second kind, the delay was enabled. The latter variants are denoted with the string WOWFLUTTER
.
The real data is collected using an Akai 4000D reel-to-reel tape recorder. The corresponding subsets are denoted with the string AKAI
. Two tape speeds were used during the recording: 3 3/4 IPS (inches per second) and 7 1/2 IPS, with the corresponding subsets denoted with '3.75IPS' and '7.5IPS' respectively. On top of this, two different brands of magnetic tape were used for capturing the datasets with different tape speeds: Maxell and Scotch, with the corresponding subsets denoted with 'MAXELL' and 'SCOTCH' respectively.
Directories
For training the models, a fraction of the inputs from SignalTrain LA2A Dataset was used. The training, validation, and testing can be replicated using the subsets:
ReelToReel_Dataset_MiniPulse100_AKAI_*/ (hysteretic nonlinearity, real data)
ReelToReel_Dataset_Mini192kHzPulse100_AKAI_*/ (delay generator, real data)
Silence_AKAI_*/ (noise generator, real data)
ReelToReel_Dataset_MiniPulse100_CHOWTAPE*/ (hysteretic nonlinearity, toy data)
ReelToReel_Dataset_MiniPulse100_CHOWTAPE_F[0.6]_SL[60]_TRAJECTORIES/ (delay generator, toy data)
For visualizing the model behavior, the following subsets can be used:
LogSweepsContinuousPulse100_*/ (nonlinear magnitude responses)
SinesFadedShortContinuousPulse100*/ (magnetic hysteresis curves)
Directory structure
Each directory/subset is made of up of further subdirectories that are most often used to separate the training, validation and test sets from each other. Thus, a typical directory will look like the following:
[DIRECTORY_NAME]
├── Train
│ ├── input_x_.wav
│ ...
│ ├── target_x_.wav
│ ...
└── Val
│ ├── input_y_.wav
│ ...
│ ├── target_y_.wav
│ ...
├── Test
│ ├── input_z_.wav
│ ...
│ ├── target_z_.wav
│ ...
While not all of the audio is used for training purposes, all of the subsets share part of this structure to make the corresponding datasets compatible with the dataloader that was used.
The input and target files denoted with the same number x
, e.g. input_100_.wav
and target_100_.wav
make up a pair, such that the target audio is the input audio processed with one of the used effects. In some of the cases, a third file named trajectory_x_.npy
can be found, which consists of the corresponding pre-extracted delay trajectory in the NumPy
binary file format.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
“I remember that, though of humble origin, the sea was always the living pantry. The memories of my uncles spear-fishing in the waters off Inch Marlowe are fond memories. Unfortunately, they are just that, memories!
My children love visiting Barbados. However, their ancestral waters do not have the abundance of life I recalled. They cannot live the childhood that I had and that saddens me.”
S. Antonio Hollingsworth, Founder BDCI Barbados
This dataset was created to give Caribbean developers in the field of artificial intelligence and machine learning a head start in training the next generation of A.I. and machine learning applications. We believe that to meet the challenges of reef collapse due to human activity, artificial intelligence will give small island developing states the edge needed to remain competitive and survive in a rapidly changing world.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
This dataset contains image data of target fish species. It is categorical in nature and is intended for use in computer vision.
This dataset contains images of fish in different natural positions, lighting and water conditions.
The fish are presented in there natural environment.
Some images may contain more than one member of the target species or it may contain another species that is while not dominant may influence the training process.
Data collection period: August - November 2020. Data collection location: Barbados. General data coordinate: 13.1939° N, 59.5432° W. Data collection depth range: 0m to 5m. Data collection climate: Tropical, Marine, Sea. Average Water temperature 29 Celsius
Data collector: S. Antonio Hollingsworth Camera Used: BW Space Pro 4K Zoom Platform: Underwater Robot.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Thanks to:
The UNDP Accelerator Labs for Barbados & the Eastern Caribbean for funding The Blue-Bot Project.
Stacy R. Phillips for project proposal presentations.
S. Antonio Hollingsworth for piloting the remote underwater robot and curating the images of this dataset.
Youcan Robotics for there technical and customer support.
Those dear to us who inspire us to dream of a better tomorrow.
tensorflow.org: MobileNet V2 pre-trained model used as in transfer learning process of BlueNet
python.org
How can we improve the data collection process in the blue economy?
What is the best way to use A.I. in the blue economy?
Can we use computer vision and artificial intelligence to find and learn the complex patterns that exist on coral reefs?
How do we use this insight to create effective and long term conservation and resilience policies for small island developing states that depend on coral reefs for economic survival?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
There are four zip files in this data set:
PythonCode_OpenFAST: The code used to generate 32768 OpenFAST fst files to build the database.
ML_TrainingCode: The code that used to train the TCN-FCNN and FCNN models for both free stream and wake
Trained_Models: All the trained models are saved in Keras format. The models with max in their filenames were trained on maximum values. The models with XY in their naming were trained on wind in the X and Y directions.
data: It includes all the CSV files for training and testing.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Contains resources needed to train, test, and analyze performance of gradient boosting models used to predict venous thromboembolism (VTE) from electronic health record (EHR) data.
"Code for analyses" folder: Contains code we used for the analyses in our paper. Prediction.ipynb: Contains code needed to run trained models. Small, Medium, and Large.xlsx: Excel templates to correctly format data for prediction generation. Models.zip: Contains trained models. Note that this is 0.4 GB once unzipped. Analysis.ipynb: Contains code used to train the models.
Dependencies: Python 3.10.9; Pandas 1.5.1; LightGBM 3.3.2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
scripts.zip
arcgisTools.atbx: terrainDerivatives: make terrain derivatives from digital terrain model (Band 1 = TPI (50 m radius circle), Band 2 = square root of slope, Band 3 = TPI (annulus), Band 4 = hillshade, Band 5 = multidirectional hillshades, Band 6 = slopeshade). rasterizeFeatures: convert vector polygons to raster masks (1 = feature, 0 = background).
makeChips.R: R function to break terrain derivatives and chips into image chips of a defined size. makeTerrainDerivatives.R: R function to generated 6-band terrain derivatives from digital terrain data (same as ArcGIS Pro tool). merge_logs.R: R script to merge training logs into a single file. predictToExtents.ipynb: Python notebook to use trained model to predict to new data. trainExperiments.ipynb: Python notebook used to train semantic segmentation models using PyTorch and the Segmentation Models package. assessmentExperiments.ipynb: Python code to generate assessment metrics using PyTorch and the torchmetrics library. graphs_results.R: R code to make graphs with ggplot2 to summarize results. makeChipsList.R: R code to generate lists of chips in a directory. makeMasks.R: R function to make raster masks from vector data (same as rasterizeFeatures ArcGIS Pro tool).
terraceDL.zip
dems: LiDAR DTM data partitioned into training, testing, and validation datasets based on HUC8 watershed boundaries. Original DTM data were provided by the Iowa BMP mapping project: https://www.gis.iastate.edu/BMPs. extents: extents of the training, testing, and validation areas as defined by HUC 8 watershed boundaries. vectors: vector features representing agricultural terraces and partitioned into separate training, testing, and validation datasets. Original digitized features were provided by the Iowa BMP Mapping Project: https://www.gis.iastate.edu/BMPs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All files within this directory are used to generate the model/data/plots used in the submissions. Software is available at code ocean https://codeocean.com/capsule/8570224/tree
For the generation of the dataframes used for training the various data-driven models we use (Input data not present in this folder are available from ERA5-Land) :
This generates the following pickle files:
Once those pickle files are generated the XGBoost models are then trained using :
This generates the model files, which are then used with the input data to generate ERA5 and forecast data :
The output modelled data are then available as :
Computation of radar plot scores are made using :
Output of statistical analysis are availble as:
Region specific observations and forecasts are provided in :
For plotting the plots are made using the following scripts :
For access to data note present in this capsule please access the ftp site:
ftp server: ftp.ecmwf.int username: ecmwf_fire password: FhXekWMuy
Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.
The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.
----- Version 1 -------
trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".
The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.
Following Python script was used to generate this dataset
import random
import csv
# Set the number of examples to generate
numExamples = 100000
# Generate the training data
with open("trainingData.csv", "w", newline="") as csvfile:
fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i in range(numExamples):
age = random.randint(18, 70)
income = random.randint(25000, 200000)
gender = random.choice(["male", "female"])
maritalStatus = random.choice(["single", "married", "divorced"])
hour = random.randint(0, 23)
weekend = random.choice([True, False])
# Randomly assign the label based on some arbitrary conditions
# assuming 40% of divorcees won't buy a smart watch
if maritalStatus == "divorced" and random.random() < 0.4:
buySmartWatch = "no"
# assuming sales are 30% more likely to occur on weekends.
elif weekend == True and random.random() < 1.3:
buySmartWatch = "yes"
elif gender == "male" and age < 30 and income > 75000:
buySmartWatch = "yes"
elif gender == "female" and age >= 30 and income > 100000:
buySmartWatch = "yes"
else:
buySmartWatch = "no"
writer.writerow({
"age": age,
"income": income,
"gender": gender,
"maritalStatus": maritalStatus,
"hour": hour,
"weekend": weekend,
"buySmartWatch": buySmartWatch
})
----- Version 2 -------
trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)
Python script used to generate the data:
import random
import csv
# Set the number of examples to generate
numExamples = 100000
with open("t...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text
. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.
It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.
An example
An example test sentence:
Test Sentence:
{"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by
American songwriters Gerry Goffin and Carole King."}
An example of ontology:
Ontology: Music Ontology
Expected Output:
{
"id": "ont_k_music_test_n",
"sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.",
"triples": [
{
"sub": "The Loco-Motion",
"rel": "publication date",
"obj": "01 January 1962"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Gerry Goffin"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Carole King"
},]
}
The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.
The structure of the repo is as the following.
benchmark
the code used to generate the benchmarkevaluation
evaluation scripts for calculating the resultsThis benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.
[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used to train and validate the RNAmigos model from "Augmented base pairing networks encode RNA-small molecule binding preferences".
This will give you a cleaned up version of the data used to train the RNAmigos 1.0 models.
If you run python make_nice.py
you will generate a CSV file rnamigos1_dataset.csv
which contains all the info you need.
The script will also use DecoyFinder to generate the decoys for each pocket.
The CSV has one row for each binding pocket.
The columns are:
pdbid: the PDBID this pocket belongs to
model_num: the model number inside the PDB we took
chain: the chain the pocket belongs to
ligand_id: the 3-letter code of the ligand (e.g. ATP) which you can look up on RCSB.org
ligand_resnum: the residue number of the ligand in the PDB
nodelist: a list of nodes separated by ';' in the pocket as a string in the format ..-;...
edgelist: a list of edges separated by ';' in the pocket as a string in the format nodes are in the same format as above, and connected by a '-' char, with an additional label field. e.g. of a two edge list 1aju.A.1-1aju.A.5-CWW;1aju.A.1-1aju.A.2-B53
fp_native_maccs: bit string of the MACCS for the native ligand
split_{k}_train: one col for all the splits we ran (k \in {0-9}) contains True if this pocket was in the train set for this split
split_{k}_test: one col for all the splits we ran (k \in {0-9}) contains True if this pocket was in the test set for this split
The folder decoy_db/
has the following structure:
decoy_db
_{ligand_id}_{ligand_position}
decoyfinder
actives.txt
decoys.txt
pdb
actives.txt
decoys.txt
Each `actives.txt` and `decoys.txt` is a file with one SMILES per line.
`decoyfinder/` has decoys computed by DecoyFinder and the acvtives are just the native ligands.
`pdb/` has decoys taken from other pockets in the PDB and actives are just the native ligands.
Residence time distribution (RTD) is a critically important characteristic of groundwater flow systems; however, it cannot be measured directly. RTD can be inferred from tracer data with analytical models (few parameters) or with numerical models (many parameters). The second approach permits more variation in system properties but is used less frequently than the first because large-scale numerical models can be resource intensive. With the data and computer codes in this data release users can (1) reconstruct and run 115 General Simulation Models (GSMs) of groundwater flow, (2) calculate groundwater age metrics at selected GSM cells, (3) train a boosted regression tree model using the provided data, (4) predict three-dimensional continuous groundwater age metrics across the Glacial Principal Aquifer, and (5) predict tritium concentrations at wells for comparison with measured tritium concentrations. The computer codes in this data release are in the form of Python scripts and Jupyter Notebooks. Users will need to have these Python resources installed on their computers to run the codes. Instructions for creating the Python environment can be found in the file Creating the Python environment.txt. Users who would rather not run the scripts but who wish to obtain the final data sets can do so by downloading the file Output--Predictions.7z. Users who wish to reproduce the data sets in this release can do so by downloading, unzipping, and running the data workflow in Starn_GW_Residence_Time_Data_and_Scripts.7z. The codes in this file use relative pathnames, so the directory structure within this file should not be changed. The ".7z" file extension indicates 7-Zip files, http://www.7-zip.org Executables--MODFLOW and MODPATH executable files provided for convenience. These are Windows 64-bit versions. Step 1--Create General Simulation Models--Codes to create 115 GSMs Step 2--Data preparation--Calculate residence time distributions at selected GSM cells Step 3--Metamodel training--Train a boosted regression tree metamodel (XGBoost) Step 4--Metamodel prediction--Predict age metrics throughout the Glacial Aquifer Step 5--Tritium simulation --Calculate tritium concentration at selected wells
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and coding used in paper entitled "MIU: Deep Embedded Building Cluster Model of Urban Functional Zoning".The compressed package contains 6 folders. Building Footprint: Building vector data were used to extract geometric and compactness featrues. Google Earth Image: VHR images were applied to extract spectral and textural features. Luojia 1-01 Nighttime Light Image: Nighttime data were used to extract brightness features. OSM Street:OSM road networks were used to extract location features. POI of Study Area:POI data were used to generate labels for training the Word2Vec model. Python Code:DEC code was used to process the cluster for generating the MIU; Word2Vec code was used train the Word2Vec model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NLUCat
Dataset Description
Dataset Summary
NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.
The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).
The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.
The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)
This dataset can be used to train models for intent classification, spans identification and examples generation.
This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.
In this repository you'll find the following items:
NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team
NLUCat_dataset.json: the completed NLUCat dataset
NLUCat_stats.tsv: statistics about de NLUCat dataset
dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers
reports: folder with the reports done as feedback to the annotators during the annotation process
This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.
Supported Tasks and Leaderboards
Intent classification, spans identification and examples generation.
Languages
The dataset is in Catalan (ca-ES).
Dataset Structure
Data Instances
Three JSON files, one for each split.
Data Fields
example: str
. Example
annotation: dict
. Annotation of the example
intent: str
. Intent tag
slots: list
. List of slots
Tag:str
. tag to the slot
Text:str
. Text of the slot
Start_char: int
. First character of the span
End_char: int
. Last character of the span
Example
An example looks as follows:
{ "example": "Demana una ambulància; la meva dona està de part.", "annotation": { "intent": "call_emergency", "slots": [ { "Tag": "service", "Text": "ambulància", "Start_char": 11, "End_char": 21 }, { "Tag": "situation", "Text": "la meva dona està de part", "Start_char": 23, "End_char": 48 } ] } },
Data Splits
NLUCat.train: 9128 examples
NLUCat.dev: 1441 examples
NLUCat.test: 1441 examples
Dataset Creation
Curation Rationale
We created this dataset to contribute to the development of language models in Catalan, a low-resource language.
When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.
Source Data
Initial Data Collection and Normalization
We commissioned a company to create fictitious examples for the creation of this dataset.
Who are the source language producers?
We commissioned the writing of the examples to the company m47 labs.
Annotations
Annotation process
The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.* First step: translation or elaboration of the instructions given to the annotators to write the examples.* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.
Who are the annotators?
The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
Personal and Sensitive Information
No personal or sensitive information included.
The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.
Considerations for Using the Data
Social Impact of Dataset
We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.
Discussion of Biases
When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.
Other Known Limitations
[N/A]
Additional Information
Dataset Curators
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
Licensing Information
This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit, provide a link to the license, and indicate if changes were made.
Citation Information
DOI
Contributions
The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record includes data for the paper "Learning to Grasp Unknown Objects in Domestic Environments", currently under review.Simulation environment with pre-trained GP-net+ modelThe paper presents a simulation environment for grasping objects in domestic environments. The presented objects and furniture units, as well as a pre-trained GP-net+ model can be found in the "gpnetplus_simulation_data.zip" file. After this zip file is downloaded, it can be unpacked it into the GP-net+ directory. It includes all necessary data to use the simulation environment, for example, for testing GP-net+ or other grasping models in simulated domestic environments.
ROS model
The paper additionally presents an ROS package that can be deployed for grasping unknown objects in domestic environments with simulated or real robots. We make a ROS-compatbile model of GP-net+ available in the "ros_gpnet_plus.zip" file, which can be used with the ROS package.
Training dataset
We used the simulation environment in our paper to generate a training dataset and train GP-net+. This training dataset is included in this record and can be used to replicate our results or train modifications of GP-net+.
To improve handling of the training dataset (total size 25GB+), we split the dataset into several .zip files, named val.zip (validation data) and train_[0-6].zip (training data). Download all files individually and extract them into a single folder. Combine all files train_[0-6].zip directory into a single directory called 'train', for example, by using the 'move_train_data.sh' script provided.The final structure for the dataset should look similar to this:gpnet_data
|-- val
|-- depth_image_0000000.npz
|-- depth_image_0000001.npz
...
|--segmask_image_0052346.npz
|-- train
|-- depth_image_0000000.npz
|-- depth_image_0000001.npz
...
|-- segmask_image_0602506.npz
|-- segmask_image_0602507.npz
For generation of the training and simulation data, the following mesh databases have been used:B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar,"Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set," IEEE Robotics and Automation Magazine, vol. 22, no. 3, pp. 36–52, 2015A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, "BigBIRD: A large-scale 3D database of object instances," 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 509–516, 2014.A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, "ShapeNet: An Information-Rich 3D Model Repository," Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
D. Morrison, P. Corke, and J. Leitner, "EGAD! An Evolved Grasping Analysis Dataset for Diversity and Reproducibility in Robotic Manipulation," IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4368–4375, 2020
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
/Evaluation/evaluation_setup.sh
) to help set up programming language dependencies that are used in evaluation.bashbash evaluation_setup.sh
###### DatasetThe datasets contain DevEval, MBJP, MBPP, MBCPP, and HumanEval. DevEval is a repository-level code generation dataset, which is collected from real-word code repositories. The dataset aligns with real-world code repositories in multiple dimensions. Thus, we take DevEval as the example to demonstrate how to process the dataset. Take ../Dataset/DevEval
as example:train.jsonl
and test.jsonl
:(1) We randomly select two domains to evaluate LAIL and baselines, including the scientific engineering domain and text processing domain. (2) We randomly split the tasks of the two domains into the training set and the test set. Finally, we acquire 101 examples in the training set and 49 examples in the test set. (3) Given a requirement from a repository, we use tree-sitter to parse the repository and acquire all functions of the repository. (4) We treat functions contained in the repository as the candidate pool. Then LAIL and baselines retrieve a few functions from thecandidate pool as demonstration examples. source data
and test_source data
folders consist of the original code repositories collected from Github.estimate_prompt
folder contain the constructed prompts to estimate candidate examples.generation_prompt
folder contains the constructed prompts where the demonstration examples are selected by LAIL and different baselines. For example:(1) ICL_LAIL
folder provides the selected examples' id in LAIL_id
by our LAIL. Developers can directly use these provided prompts through codellama_completion.py
to generate programs. (2) After generating programs, developers need to process generated programs with process_generation.py
. (3) Finally, developers evaluate the generated programs with the source code in Evaluation
folder.############ LAIL ### Estimate candidate examples by LLMs themselvesWe leverage LLM themseleves to estimate candidate examples. The code is storaged in the LAIL/estimate_examples
package.Take DevEval
as example:(1) /Dataset/DevEval/estimate_prompt
folder contains the constructed prompts to estimate candidate examples.(2) Developers run the following command to estimate candidate examples by CodeLlama-7B:bashbash make_estimation_prompt.sh ../Dataset/DevEval/estimation_prompt
(3) According to the probability feedback of LLMs, we acquire the positive and negative examples.###### Train a neural retriever(1) We use the labeled positive and negative examples to train a neural retriever with contrastive learning. The code is storaged in the /LAIL/LAIL/retriever/train
folder.bashexport CUDA_VISIBLE_DEVICES=0nohup python run.py \ --output_dir=/saved_models \ --model_type=roberta \ --config_name=microsoft/graphcodebert-base \ --model_name_or_path=microsoft/graphcodebert-base \ --tokenizer_name=microsoft/graphcodebert-base \ --do_train \ --train_data_file=/id.jsonl \ --epoch 100 \ --block_size 128 \ --train_batch_size 16 \ --learning_rate 1e-4 \ --max_grad_norm 1.0 \ --seed 123456 >mbpp.txt 2>&1 &
## Select a few demonstration examples using the trained retriever(2) Given a test requirement, developers use the trained retriever to select a few demonstration examples.The code is storaged in the /LAIL/LAIL/retriever/train
folder.bashbash run_inference.sh ../Dataset/DevEval
###### Code Generation(1) After acquired the prompt context consisting of a few selected examples, developers input a test requirement and the prompt context into LLMs and acquire desired programs.For example, developers use CodeLlama ( ../LAIL/ICL_LAIL/codellama_completion.py
) to generate programs:bashexport CUDA_VISIBLE_DEVICES=0torchrun --nproc_per_node=1 --master_port=16665 codellama_completion.py Salesforce/CodeLlama-7b ../Dataset/DevEval/prompt_LAIL.jsonl --temperature=0.8 --max_batch_size=4 --output_base=output_random --get_logits=False
(2) After generating programs, developers need to process generated programs with ../LAIL/ICL_LAIL/process_generation.py
. bashpython process_generation.py
###### BaselinesThis paper contains seven baselines that use different approaches to select demonstration examples for ICL_based code generation.(1) The source code is in the baselines
folder and each baseline is in a individual folder.Developers can acquire the selected examples of all baselines by runing source code as follows:bashpython baselines.py
(2) Then, developers use /baselines/make_prompt.py
to contruct a prompt context using the selected candidate examples as follows:bashpython make_prompt.py ICLCoder ICLCoder -1
###### EvaluationIn this paper, we use Pass@k to evaluate the performances of LAIL and baselines by the source code in LAIL/Evaluation
Since the DevEval dataset is a repository-level code generation which is complex to evaluate, developers can use the following pipeline to evaluate different approaches by the source code in /LAIL/Evaluation/
.## CitationIf you have any questions or suggestions, please email us at lijiaa@pku.edu.cn
.