Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SDC-Scissor tool for Cost-effective Simulation-based Test Selection in Self-driving Cars Software
This dataset provides test cases for self-driving cars with the BeamNG simulator. Check out the repository and demo video to get started.
GitHub: github.com/ChristianBirchler/sdc-scissor
This project extends the tool competition platform from the Cyber-Phisical Systems Testing Competition which was part of the SBST Workshop in 2021.
Usage
Demo
Installation
The tool can either be run with Docker or locally using Poetry.
When running the simulations a working installation of BeamNG.research is required. Additionally, this simulation cannot be run in a Docker container but must run locally.
To install the application use one of the following approaches:
docker build --tag sdc-scissor .
poetry install
Using the Tool
The tool can be used with the following two commands:
docker run --volume "$(pwd)/results:/out" --rm sdc-scissor [COMMAND] [OPTIONS]
(this will write all files written to /out
to the local folder results
)poetry run python sdc-scissor.py [COMMAND] [OPTIONS]
There are multiple commands to use. For simplifying the documentation only the command and their options are described.
generate-tests --out-path /path/to/store/tests
label-tests --road-scenarios /path/to/tests --result-folder /path/to/store/labeled/tests
evaluate-models --dataset /path/to/train/set --save
split-train-test-data --scenarios /path/to/scenarios --train-dir /path/for/train/data --test-dir /path/for/test/data --train-ratio 0.8
predict-tests --scenarios /path/to/scenarios --classifier /path/to/model.joblib
evaluate --scenarios /path/to/test/scenarios --classifier /path/to/model.joblib
The possible parameters are always documented with --help
.
Linting
The tool is verified the linters flake8 and pylint. These are automatically enabled in Visual Studio Code and can be run manually with the following commands:
poetry run flake8 . poetry run pylint **/*.py
License
The software we developed is distributed under GNU GPL license. See the LICENSE.md file.
Contacts
Christian Birchler - Zurich University of Applied Science (ZHAW), Switzerland - birc@zhaw.ch
Nicolas Ganz - Zurich University of Applied Science (ZHAW), Switzerland - gann@zhaw.ch
Sajad Khatiri - Zurich University of Applied Science (ZHAW), Switzerland - mazr@zhaw.ch
Dr. Alessio Gambi - Passau University, Germany - alessio.gambi@uni-passau.de
Dr. Sebastiano Panichella - Zurich University of Applied Science (ZHAW), Switzerland - panc@zhaw.ch
References
If you use this tool in your research, please cite the following papers:
@INPROCEEDINGS{Birchler2022,
author={Birchler, Christian and Ganz, Nicolas and Khatiri, Sajad and Gambi, Alessio, and Panichella, Sebastiano},
booktitle={2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER),
title={Cost-effective Simulationbased Test Selection in Self-driving Cars Software with SDC-Scissor},
year={2022},
}
CodeParrot 🦜 Dataset Cleaned and filtered (train)
Dataset Description
A dataset of Python files from Github. It is a more filtered version of the train split codeparrot-clean-train of codeparrot-clean. The additional filters aim at detecting configuration and test files, as well as outlier files that are unlikely to help the model learn code. The first three filters are applied with a probability of 0.7:
files with a mention of "test file" or "configuration file" or… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).
Image datasets:
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('cifar10', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
MSVD-CTN Dataset This dataset contains CTN annotations for the MSVD-CTN benchmark dataset in JSON format. It has three files for the train, test, and validation splits. For project details, visit https://narrativebridge.github.io/.
Dataset Structure Each JSON file contains a dictionary where the keys are the video IDs and the values are the corresponding Causal-Temporal Narrative (CTN) captions. The CTN captions are represented as a dictionary with two keys: "Cause" and "Effect", containing the cause and effect statements, respectively.
Example:
json { "video_id_1": { "Cause": "a person performed an action", "Effect": "a specific outcome occurred" }, "video_id_2": { "Cause": "another cause statement", "Effect": "another effect statement" } }
Loading the Datasets To load the datasets, use a JSON parsing library in your preferred programming language. For example, in Python, you can use the json module:
import json
with open("msvd_CTN_train.json", "r") as f:
msvd_train_data = json.load(f)
Access the CTN captions
for video_id, ctn_caption in msvd_train_data.items():
cause = ctn_caption["Cause"]
effect = ctn_caption["Effect"]
# Process the cause and effect statements as needed
License The MSVD-CTN benchmark dataset is licensed under the Creative Commons Attribution Non Commercial No Derivatives 4.0 International (CC BY-NC-ND 4.0) license.
A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.
RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.
Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".
Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):
dataset | partition | split | refs | images |
---|---|---|---|---|
refcoco | train | 40000 | 19213 | |
refcoco | val | 5000 | 4559 | |
refcoco | test | 5000 | 4527 | |
refcoco | unc | train | 42404 | 16994 |
refcoco | unc | val | 3811 | 1500 |
refcoco | unc | testA | 1975 | 750 |
refcoco | unc | testB | 1810 | 750 |
refcoco+ | unc | train | 42278 | 16992 |
refcoco+ | unc | val | 3805 | 1500 |
refcoco+ | unc | testA | 1975 | 750 |
refcoco+ | unc | testB | 1798 | 750 |
refcocog | train | 44822 | 24698 | |
refcocog | val | 5000 | 4650 | |
refcocog | umd | train | 42226 | 21899 |
refcocog | umd | val | 2573 | 1300 |
refcocog | umd | test | 5023 | 2600 |
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.
The dataset includes the following files:
Each file is structured to support the training and evaluation of seismic denoising models.
The HDF5 files named noise contain two main datasets:
Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:
To load these files in a Python environment, use the following approach:
```python
import h5py
import numpy as np
# Open the HDF5 file in read mode
with h5py.File('train_noise.hdf5', 'r') as file:
# Print all the main keys in the file
print("Keys in the HDF5 file:", list(file.keys()))
if 'traces' in file:
# Access the dataset
data = file['traces'][:10] # Load the first 10 traces
if 'metadata' in file:
# Access the dataset
trace_name = file['metadata'][:10] # Load the first 10 metadata entries```
Ensure that the path to the file is correctly specified relative to your Python script.
To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:
```bash
pip install numpy
pip install h5py
```
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data from an NIH HTS of 17K compounds against five isozymes of cytochrome P450 screening for inhibition. The activity score is taken from the NIH assay and merged with all the 2-D descriptors from the program Molecular Operating Environment (MOE). The datasets are separated by isozyme and then balanced between actives and inactives. Finally the balanced datasets are subject to an 80/20 training/test split. Link to python script of data manipulation...
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
EACL Hackashop Keyword Challenge Datasets
In this repository you can find ids of articles used for the keyword extraction challenge at
EACL Hackashop on News Media Content Analysis and Automated Report Generation (http://embeddia.eu/hackashop2021/). The article ids can be used to generate train-test split used in paper:
Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.
Train and test splits are provided for Latvian, Estonian, Russian and Croatian.
The articles with the corresponding ID-s can be extracted from the following datasets:
- Estonian and Russian (use the eearticles2015-2019 dataset): https://www.clarin.si/repository/xmlui/handle/11356/1408
- Latvian: https://www.clarin.si/repository/xmlui/handle/11356/1409
- Croatian: https://www.clarin.si/repository/xmlui/handle/11356/1410
dataset_ids folder is organized in the following way:
- latvian – containing latvian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the latvian_test.json: a json file with ids from test articles to replicate the data
- estonian – containing estonian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the estonian_test.json: a json file with ids from test articles to replicate the data
- russian – containing russian_train.json: a json file with ids from train articles to replicate the train data used in Koloski et al. (2020), the russian_test.json: a json file with ids from test articles to replicate the data
- croatian - containing croatian_id_train.tsv file with sites and ids (note that just ids are not unique across dataset, therefore site information also needs to be included to obtain a unique article identifier) of articles in the train set, and the croatian_id_test.tsv file with sites and ids of articles in the test set.
In addition, scripts are provided for extracting articles (see folder parse containing scripts parse.py and build_croatian_dataset.py, requirements for scripts are pandas and bs4 Python libraries):
parse.py is used for extraction of Estonian, Russian and Latvian train and test datasets:
Instructions:
ESTONIAN-RUSSIAN
1) Retrieve the data ee_articles_2015_2019.zip
2) Create a folder 'data' and subfolder 'ee'
3) Unzip them in the 'data/ee' folder
To extract train/test Estonian articles:
run function 'build_dataset(lang="ee", opt="nat")' in the parse.py script
To extract train/test Russian articles:
run function 'build_dataset(lang="ee", opt="rus")' in the parse.py script
LATVIAN:
1) Retrieve the latvian data
2) Unzip it in 'data/lv' folder
3) To extract train/test Latvian articles:
run function 'build_dataset(lang="lv", opt="nat")' in the parse.py script
build_croatian_dataset.py is used for extraction of Croatian train and test datasets:
Instructions:
CROATIAN:
1) Retrieve the Croatian data (file 'STY_24sata_articles_hr_PUB-01.csv')
2) put the script 'build_croatian_dataset.py' in the same folder as the extracted data and run it (e.g., python build_croatian_dataset.py).
For additional questions: {Boshko.Koloski,Matej.Martinc,Senja.Pollak}@ijs.si
CodeInsight Dataset
The CodeInsight dataset is designed to help train and evaluate models for Python code generation and comprehension. It contains expertly curated examples sourced from real-world coding challenges, complete with natural language descriptions, code snippets, and unit tests.
Dataset Structure
Train split: 1,551 examples Test split: 1,860 examples
Each example includes:
problem_id: Example's ID Code: Python code snippet Natural Language: Description of… See the full description on the dataset page: https://huggingface.co/datasets/Nbeau/CodeInsight.
Causal inference is one of the hallmarks of human intelligence.
Corr2cause is a large-scale dataset of more than 400K samples, on which seventeen existing LLMs are evaluated in the related paper.
Overall, Corr2cause contains 415,944 samples, with 18.57% in valid samples. The average length of the premise is 424.11 tokens, and hypothesis 10.83 tokens. The data is split into 411,452 training samples, 2,246 development and test samples, respectively. Since the main purpose of the dataset is to benchmark the performance of LLMs, the test and development sets have been prioritized to have a comprehensive coverage over all sizes of graphs.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('corr2cause', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects. The language models trained on this corpus - including 41 monolingual models, and 2 multilingual models - can be found at https://tfhub.dev/google/collections/wiki40b-lm/1.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wiki40b', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.
There are two features: - text: wikihow answers texts. - headline: bold lines as summary.
There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.
Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wikihow', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
The Oxford-IIIT pet dataset is a 37 category pet image dataset with roughly 200 images for each class. The images have large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed and species. Additionally, head bounding boxes are provided for the training split, allowing using this dataset for simple object detection tasks. In the test split, the bounding boxes are empty.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('oxford_iiit_pet', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.
The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:
The resulting tar-ball may then be processed by TFDS.
To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.
To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:
771 778 794 387 650
363 691 764 923 427
737 369 430 531 124
755 930 755 59 168
The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imagenet2012', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.
Overview nEMO is a simulated dataset of emotional speech in the Polish language. The corpus contains over 3 hours of samples recorded with the participation of nine actors portraying six emotional states: anger, fear, happiness, sadness, surprise, and a neutral state. The text material used was carefully selected to represent the phonetics of the Polish language. The corpus is available for free under the Creative Commons license (CC BY-NC-SA 4.0).
The dataset is available on Hugging Face and GitHub.
Data Fields
file_id - filename, i.e. {speaker_id}_{emotion}_{sentence_id},
audio (audio) - dictionary containing audio array, path and sampling rate (available when accessed via datasets library),
emotion - label corresponding to emotional state,
raw_text - original (orthographic) transcription of the audio,
normalized_text - normalized transcription of the audio,
speaker_id - id of speaker,
gender - gender of the speaker,
age - age of the speaker.
Usage The nEMO dataset can be loaded and processed using the datasets library:
from datasets import load_dataset
nemo = load_dataset("amu-cai/nEMO", split="train")
To work with the nEMO dataset on GitHub, you may clone the repository and access the files directly within the samples folder. Corresponding metadata can be found in the data.tsv file.
The nEMO dataset is provided as a whole, without predefined training and test splits. This allows researchers and developers flexibility in creating their splits based on the specific needs.
Supported Tasks
Audio classification: This dataset was mainly created for the task of speech emotion recognition. Each recording is labeled with one of six emotional states (anger, fear, happiness, sadness, surprised, and neutral). Additionally, each sample is labeled with speaker id and speaker gender. Because of that, the dataset can also be used for different audio classification tasks. Automatic Speech Recognition: The dataset includes orthographic and normalized transcriptions for each audio recording, making it a useful resource for automatic speech recognition (ASR) tasks. The sentences were carefully selected to cover a wide range of phonemes in the Polish language. Text-to-Speech: The dataset contains emotional audio recordings with transcriptions, which can be valuable for developing TTS systems that produce emotionally expressive speech.
Additional Information Licensing Information The dataset is available under the Creative Commons license (CC BY-NC-SA 4.0).
Citation Information You can access the nEMO paper at arXiv. Please cite the paper when referencing the nEMO dataset as:
@misc{christop2024nemo, title={nEMO: Dataset of Emotional Speech in Polish}, author={Iwona Christop}, year={2024}, eprint={2404.06292}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Contributions Thanks to @iwonachristop for adding this dataset.
CodeParrot 🦜 Dataset Cleaned and filtered (validation)
Dataset Description
A dataset of Python files from Github. It is a more filtered version of the validation split codeparrot-clean-valid of codeparrot-clean. The additional filters aim at detecting configuration and test files, as well as outlier files that are unlikely to help the model learn code. The first three filters are applied with a probability of 0.7:
files with a mention of "test file" or "configuration… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de