32 datasets found

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
zenodo.org
bin
Updated Aug 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044636
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4044636
Dataset updated
Aug 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is gathered on Sep. 17th 2020.

The dataset has more 5.4K Python repositories that are hosted on GitHub.

It contains more than 1.1M type annotations.

Please note that this is the first version of the dataset. In the second version, we will provide processed Python projects in JSON files that contain relevant features and hints for ML-based type inference task.
d
Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND
b2find.dkrz.de
Updated Oct 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/63d20a5e-3584-5096-a34d-d3f93fcc8857
Explore at:
Dataset updated
Oct 24, 2023
Description
Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder
d
Observational Large Ensemble
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McKinnon, Karen (2023). Observational Large Ensemble [Dataset]. https://search.dataone.org/view/sha256%3A5b1fd53e1fd12d09e5c7c8ff4656eb14d826ac6487fab323ceccae77f7421b6f
Explore at:
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
McKinnon, Karen
Description
These python datasets contain the results presented in the above paper with regard to the variability in trends over North America during DJF due to sampling of internal variability. Two types of files are available. The netcdf file contains samples from the synthetic ensemble of DJF temperatures over North America from 1966-2015. The synthetic ensemble is centered on the observed trend. Recentering the ensemble on the ensemble mean trend from the NCAR CESM1 LENS will create the Observational Large Ensemble, in which each sample can be viewed as a temperature history that could have occurred given various samplings of internal variability. The synthetic ensemble can also be recentered on any other estimate of the forced response to climate change. While the dataset is both land and ocean, it has only been validated over land. The second type of file, presented as python datasets (.npz) contains the results presented in the McKinnon et al (2017) reference. In particular, it contains the 50-year trends for both the observations and the NCAR CESM1 Large Ensemble that actually occurred, and could have occurred given a different sampling of internal variability. The bootstrap results can be compared to the true spread across the NCAR CESM1 Large Ensemble for validation, as was done in the manuscript. Each of these files is named based on the observational dataset, variable, time span, and spatial domain. They contain: BETA: the empirical OLS trend BOOTSAMPLES: the OLS trends estimated after bootstrapping INTERANNUALVAR: the interannual variance in the data after modeling and removing the forced trend empiricalAR1: the empirical AR(1) coefficient estimated from the residuals around the forced trend The first dimension of all variables is 42, which is a stack of the ensemble mean behavior (index 0), the forty members of the NCAR Large Ensemble (indices 1:40), and the observations (last index, -1). The second dimension is spatial. See latlon.npz for the latitude and longitude vectors. The third dimension, when present, is the bootstrap samples. We have saved 1000 bootstrap samples.
P
CodeSyntax Dataset
paperswithcode.com
Updated Nov 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Da Shen; Xinyun Chen; Chenguang Wang; Koushik Sen; Dawn Song (2022). CodeSyntax Dataset [Dataset]. https://paperswithcode.com/dataset/codesyntax
Explore at:
Dataset updated
Nov 9, 2022
Authors
Da Shen; Xinyun Chen; Chenguang Wang; Koushik Sen; Dawn Song
Description
CodeSyntax is a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. It contains 18,701 code samples annotated with 1,342,050 relation edges in 43 relation types for Python, and 13,711 code samples annotated with 864,411 relation edges in 39 relation types for Java. It is designed to evaluate the performance of language models on code syntax understanding.
Z
VSAT-2D Example Dataset
data.niaid.nih.gov
zenodo.org
Updated Apr 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Méndez-Hernández, Hugo (2021). VSAT-2D Example Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4671109
Explore at:
Dataset updated
Apr 9, 2021
Dataset authored and provided by
Méndez-Hernández, Hugo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset to run Example.py script of the Valparaíso Stacking Analysis Tool (VSAT-2D). The Valparaíso Stacking Analysis Tool (VSAT-2D) provides a series of tools for selecting, stacking, and analyzing moment-0 intensity maps from interferometric datasets. It is intended for stacking samples of moment-0 extracted from interferometric datasets, belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT-2D on smaller datasets containing any type of astronomical object.

VSAT-2D can be downloaded from the github repository link.
P
Project CodeNet Dataset
paperswithcode.com
Updated Jun 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruchir Puri; David S. Kung; Geert Janssen; Wei zhang; Giacomo Domeniconi; Vladimir Zolotov; Julian Dolby; Jie Chen; Mihir Choudhury; Lindsey Decker; Veronika Thost; Luca Buratti; Saurabh Pujar; Shyam Ramji; Ulrich Finkler; Susan Malaika; Frederick Reiss (2022). Project CodeNet Dataset [Dataset]. https://paperswithcode.com/dataset/project-codenet
Explore at:
Dataset updated
Jun 10, 2022
Authors
Ruchir Puri; David S. Kung; Geert Janssen; Wei zhang; Giacomo Domeniconi; Vladimir Zolotov; Julian Dolby; Jie Chen; Mihir Choudhury; Lindsey Decker; Veronika Thost; Luca Buratti; Saurabh Pujar; Shyam Ramji; Ulrich Finkler; Susan Malaika; Frederick Reiss
Description
Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.

The rich annotation of Project CodeNet enables research in code search, code completion, code-code translation, and a myriad of other use cases. We also extracted several benchmarks in Python, Java and C++ to drive innovation in deep learning and machine learning models in code classification and code similarity.

Citation @inproceedings{puri2021codenet, author = {Ruchir Puri and David Kung and Geert Janssen and Wei Zhang and Giacomo Domeniconi and Vladmir Zolotov and Julian Dolby and Jie Chen and Mihir Choudhury and Lindsey Decker and Veronika Thost and Luca Buratti and Saurabh Pujar and Ulrich Finkler}, title = {Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks}, year = {2021}, }
Dataset for Generation of multiple true false questions
zenodo.org
explore.openaire.eu
zip
Updated Nov 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Regina Kasakowskij; Regina Kasakowskij; Thomas Kasakowskij; Niels Seidel; Niels Seidel; Thomas Kasakowskij (2022). Dataset for Generation of multiple true false questions [Dataset]. http://doi.org/10.5281/zenodo.7303300
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7303300
Dataset updated
Nov 8, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Regina Kasakowskij; Regina Kasakowskij; Thomas Kasakowskij; Niels Seidel; Niels Seidel; Thomas Kasakowskij
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Generation of multiple true-false questions

This project provides a Natural Language Pipeline for processing German Textbook sections as an input generating Multiple True-False Questions using GPT2.

Assessments are an important part of the learning cycle and enable the development and promotion of competencies. However, the manual creation of assessments is very time-consuming. Therefore, the number of tasks in learning systems is often limited. In this repository, we provide an algorithm that can automatically generate an arbitrary number of German True False statements from a textbook using the GPT-2 model. The algorithm was evaluated with a selection of textbook chapters from four academic disciplines (see `data` folder) and rated by individual domain experts. One-third of the generated MTF Questions are suitable for learning. The algorithm provides instructors with an easier way to create assessments on chapters of textbooks to test factual knowledge.

As a type of Multiple-Choice question, Multiple True False (MTF) Questions are, among other question types, a simple and efficient way to objectively test factual knowledge. The learner is challenged to distinguish between true and false statements. MTF questions can be presented differently, e.g. by locating a true statement from a series of false statements, identifying false statements among a list of true statements, or separately evaluating each statement as either true or false. Learners must evaluate each statement individually because a question stem can contain both incorrect and correct statements. Thus, MTF Questions as a machine-gradable format have the potential to identify learners’ misconceptions and knowledge gaps.

Example MTF question:

Check the correct statements:

[ ] All trees have green leafs.

[ ] Trees grow towards the sky.

[ ] Leafes can fall from a tree.

Features

- generation of false statements

- automatic selection of true statements

- selection of an arbitrary similarity for true and false statements as well as the number of false statements

- generating false statements by adding or deleting negations as well as using a german gpt2

Setup

Installation

1. Create a new environment: `conda create -n mtfenv python=3.9`

2. Activate the environment: `conda activate mtfenv`

3. Install dependencies using anaconda:

```

conda install -y -c conda-forge pdfplumber

conda install -y -c conda-forge nltk

conda install -y -c conda-forge pypdf2

conda install -y -c conda-forge pylatexenc

conda install -y -c conda-forge packaging

conda install -y -c conda-forge transformers

conda install -y -c conda-forge essential_generators

conda install -y -c conda-forge xlsxwriter

```

3. Download spacy: `python3.9 -m spacy download de_core_news_lg`

Getting started

After installation, you can execute the bash script `bash run.sh` in the terminal to compile MTF questions for the provided textbook chapters.

To create MTF questions for your own texts use the following command:

`python3 main.py --answers 1 --similarity 0.66 --input ./

The parameter `answers` indicates how many false answers should be generated.

By configuring the parameter `similarity` you can determine what portion of a sentence should remain the same. The remaining portion will be extracted and used to generate a false part of the sentence.

## History and roadmap

* Outlook third iteration: Automatic augmentation of text chapters with generated questions

* Second iteration: Generation of multiple true-false questions with improved text summarizer and German GPT2 sentence generator

* First iteration: Generation of multiple true false questions in the Bachelor thesis of Mirjam Wiemeler

Publications, citations, license

Publications

Kasakowskij, R., Kasakowskij, T. & Seidel, N., (2022). Generation of Multiple True False Questions. In: Henning, P. A., Striewe, M. & Wölfel, M. (Hrsg.), 20. Fachtagung Bildungstechnologien (DELFI). Bonn: Gesellschaft für Informatik e.V.. (S. 147-152). DOI: [10.18420/delfi2022-026](https://dl.gi.de/handle/20.500.12116/38826)

Citation of the Dataset

Kasakowskij, R., Kasakowskij, T., & Seidel, N. (2022). Dataset for Generation of multiple true false questions. Zenodo. https://doi.org/10.5281/zenodo.7303300

The source code and data are maintained at GitHub: https://github.com/D2L2/multiple-true-false-question-generation

Contact

Regina Kasakowskij (M.A.) - regina.kasakowskij@fernuni-hagen.de

Dr. Niels Seidel - niels.seidel@fernuni-hagen.de

License Distributed under the MIT License. See [LICENSE.txt](https://gitlab.pi6.fernuni-hagen.de/la-diva/adaptive-assessment/generationofmultipletruefalsequestions/-/blob/master/LICENSE.txt) for more information.

Acknowledgments This research was supported by CATALPA - Center of Advanced Technology for Assisted Learning and Predictive Analytics of the FernUniversität in Hagen, Germany.

This project was carried out as part of research in the CATALPA project [LA DIVA](https://www.fernuni-hagen.de/forschung/schwerpunkte/catalpa/forschung/projekte/la-diva.shtml)
VSAT-3D Example Dataset
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Apr 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo Méndez-Hernández; Hugo Méndez-Hernández (2021). VSAT-3D Example Dataset [Dataset]. http://doi.org/10.5281/zenodo.4671101
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4671101
Dataset updated
Apr 9, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hugo Méndez-Hernández; Hugo Méndez-Hernández
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset to run Example.py script of the Valparaíso Stacking Analysis Tool (VSAT-3D). The Valparaíso Stacking Analysis Tool (VSAT-3D) provides a series of tools for selecting, stacking, and analyzing 3D spectra. It is intended for stacking samples of datacubes extracted from interferometric datasets, belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT-3D on smaller datasets containing any type of astronomical object.

VSAT-3D can be downloaded from the github repository link.
Aerial Semantic Drone Dataset
kaggle.com
Updated May 25, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lalu Erfandi Maula Yusnu (2021). Aerial Semantic Drone Dataset [Dataset]. https://www.kaggle.com/nunenuh/semantic-drone/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 25, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lalu Erfandi Maula Yusnu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Aerial Semantic Drone Dataset

The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above the ground. A high-resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.

This dataset is taken from https://www.kaggle.com/awsaf49/semantic-drone-dataset. We remove and add files and information that we needed for our research purpose. We create our tiff files with a resolution of 1200x800 pixel in 24 channel with each channel represent classes that have been preprocessed from png files label. We reduce the resolution and compress the tif files with tiffile python library.

If you have any problem with tif dataset that we have been modified you can contact nunenuh@gmail.com and gaungalif@gmail.com.

This dataset was a copy from the original dataset (link below), we provide and add some improvement in the semantic data and classes. There are the availability of semantic data in png and tiff format with a smaller size as needed.

Semantic Annotation

The images are labelled densely using polygons and contain the following 24 classes:

unlabeled paved-area dirt grass gravel water rocks pool vegetation roof wall window door fence fence-pole person dog car bicycle tree bald-tree ar-marker obstacle conflicting

Directory Structure and Files

> images > labels/png > labels/tiff - class_to_idx.json - classes.csv - classes.json - idx_to_class.json

Included Data

400 training images in jpg format can be found in "aerial_semantic_drone/images"

Dense semantic annotations in png format can be found in "aerial_semantic_drone/labels/png"

Dense semantic annotations in tiff format can be found in "aerial_semantic_drone/labels/tiff"

Semantic class definition in csv format can be found in "aerial_semantic_drone/classes.csv"

Semantic class definition in json can be found in "aerial_semantic_drone/classes.json"

Index to class name file can be found in "aerial_semantic_drone/idx_to_class.json"

Class name to index file can be found in "aerial_semantic_drone/idx_to_class.json"

Contact

aerial@icg.tugraz.at

Citation

If you use this dataset in your research, please cite the following URL: www.dronedataset.icg.tugraz.at

License

The Drone Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:

That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Graz University of Technology) do not accept any responsibility for errors or omissions. That you include a reference to the Semantic Drone Dataset in any work that makes use of the dataset. For research papers or other media link to the Semantic Drone Dataset webpage.

That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character. That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain. That all rights not expressly granted to you are reserved by us (Graz University of Technology).
T
wine_quality
tensorflow.org
beta.dataverse.org
+1more
Updated Nov 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wine_quality [Dataset]. https://www.tensorflow.org/datasets/catalog/wine_quality
Explore at:
Dataset updated
Nov 23, 2022
Description
Two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

Number of Instances: red wine - 1599; white wine - 4898

Input variables (based on physicochemical tests):

fixed acidity

volatile acidity

citric acid

residual sugar

chlorides

free sulfur dioxide

total sulfur dioxide

density

pH

sulphates

alcohol

Output variable (based on sensory data):

quality (score between 0 and 10)

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wine_quality', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
d
Database of Stream Crossings in the United States
catalog.data.gov
data.usgs.gov
+2more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Database of Stream Crossings in the United States [Dataset]. https://catalog.data.gov/dataset/database-of-stream-crossings-in-the-united-states
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
United States
Description
This USGS data release is intended to provide a baselayer of information on likely stream crossings throughout the United States. The geopackage provides likely crossings of infrastructure and streams and provides observed information that helps validate modeled crossings and build knowledge about associated conditions through time (e.g. crossing type, crossing condition). Stream crossings were developed by intersecting the 2020 United States Census Bureau Topologically Integrated Geographic Encoding and Referencing (TIGER) U.S. road lines with the National Hydrography Dataset High Resolution flowlines. The current version of this data release specifically focuses on road stream crossings (i.e. TIGER2020 Roads) but is designed to support additions of other crossing types that may be included in future iterations (e.g. rail). In total 6,608,268 crossings are included in the dataset and 496,564 observations from the U.S. Department of Transportation, Federal Highway Administration's 2019 National Bridge Inventory (NBI)are included to help identify crossing types of bridges and culverts. This data release also contains Python code that documents methods of data development.
VSAT-1D Example Dataset
zenodo.org
application/gzip
Updated Mar 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo Méndez-Hernández; Hugo Méndez-Hernández (2021). VSAT-1D Example Dataset [Dataset]. http://doi.org/10.5281/zenodo.4624030
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4624030
Dataset updated
Mar 20, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hugo Méndez-Hernández; Hugo Méndez-Hernández
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset to run example script of the Valparaíso Stacking Analysis Tool (VSAT-1D). The Valparaíso Stacking Analysis Tool (VSAT) provides a series of tools for selecting, stacking, and analyzing 1D spectra. It is intended for stacking samples of spectra belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT on smaller datasets containing any type of astronomical object.

VSAT can be downloaded from the github repository link.
m
HUN GW Model code v01
demo.dev.magda.io
researchdata.edu.au
+2more
zip
Updated Dec 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2022). HUN GW Model code v01 [Dataset]. https://demo.dev.magda.io/dataset/ds-dga-266195fc-4c99-4bd5-9788-f540b72c5b15
Explore at:
zipAvailable download formats
Dataset updated
Dec 4, 2022
Dataset provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract The dataset was derived by the Bioregional Assessment Programme without the use of source datasets. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. Computer code and templates used to create the Hunter groundwater model. Broadly speaking, there are two types of files: those in templates_and_inputs that are template files used by the code; and everything else, which is the computer code itself. An example of a type …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme without the use of source datasets. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. Computer code and templates used to create the Hunter groundwater model. Broadly speaking, there are two types of files: those in templates_and_inputs that are template files used by the code; and everything else, which is the computer code itself. An example of a type of file in templates_and_inputs are all the uaXXXX.txt, which describe the parameters used in uncertainty analysis XXXX. Much of the computer code is in the form of python scripts, and most of these are run using either preprocess.py or postprocess.py (using subprocess.call). Each of the python scripts employs optparse, and so is largely self documenting. Each of the python scripts also requires an index file as an input, which is an XML file and contains all meta-data associated with the model building process, so that the scripts can discover where the raw data is needed to build the model. The HUN GW Model v01 contains the index file (index.xml) used to build the Hunter groundwater model. Finally, the "code" directory contains a snapshot of the MOOSE C++ code used to run the model. Dataset History Computer code and templates were written by hand. Dataset Citation Bioregional Assessment Programme (2016) HUN GW Model code v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/e54a1246-0076-4799-9ecf-6d673cf5b1da.
Z
Data from: IntelliGraphs: Datasets for Benchmarking Knowledge Graph...
data.niaid.nih.gov
zenodo.org
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Groth (2023). IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7824817
Explore at:
Dataset updated
Jun 15, 2023
Dataset provided by
Paul Groth
Emile van Krieken
Peter Bloem
Thiviyan Thanapalasingam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntelliGraphs is a collection of datasets for benchmarking Knowledge Graph Generation models. It consists of three synthetic datasets (syn-paths, syn-tipr, syn-types) and two real-world datasets (wd-movies, wd-articles). There is also a Python package available that loads these datasets and verifies new graphs using semantics that was pre-defined for each dataset. It can also be used as a testbed for developing new generative models.
Vegetation type of China; Python code for calculating ecosystem resilience...
figshare.com
text/x-python
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Zhang (2024). Vegetation type of China; Python code for calculating ecosystem resilience and early warning signals [Dataset]. http://doi.org/10.6084/m9.figshare.24999290.v1
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24999290.v1
Dataset updated
Aug 2, 2024
Dataset provided by
figshare
Authors
Yu Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China
Description
Vegetation type dataset is derived from the 1:1,000,000 Atlas of the Vegetation of China. This atlas is another summarizing achievement of the vegetation ecology workers in China over the past 40 years, following the publication of monographs such as Vegetation of China, which is a basic map of the country's natural resources and natural conditions. Prepared by more than 250 experts from 53 units, including relevant institutes of the Chinese Academy of Sciences, relevant ministries and departments of provinces and districts, and institutions of higher learning, and officially published by the Science Press for domestic and international public distribution.The Bayesian dynamic linear model proposed by Liu et al. (2019, https://doi.org/10.1038/s41558-019-0583-9) was used to calculate the time-varying measure of resilience. We have modified the parameters of the code to be more suitable for the Loess Plateau and Qinba Mountains in China. According to the results of resilience, we could obtain the early warning signals of forest.
Z
Python Annotated Code Search (PACS) Datasets & Pretrained Models
data.niaid.nih.gov
zenodo.org
Updated Aug 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Van Cutsem (2020). Python Annotated Code Search (PACS) Datasets & Pretrained Models [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4001601
Explore at:
Dataset updated
Aug 27, 2020
Dataset provided by
Geert Heyman
Tom Van Cutsem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This upload contains datasets and pre-trained models used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets and models will be made available here: http://github.com/nokia/codesearch

Datasets There are three types of datasets:

snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated

code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test

training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20

The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora:

staqc-py-cleaned was derived from the Python StaQC snippet collection. See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset, LICENSE.

conala-curated was derived from the conala corpus. See https://conala-corpus.github.io/ , LICENSE

The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange, LICENSE).

Pre-trained models Each model can embed queries and (annotated) code snippets in the same space. The models are released under a BSD 3-Clause License.

ncs-embedder-so-ds-feb20

ncs-embedder-staqc-py

tnbow-embedder-so-ds-feb20

use-embedder-pacs

ensemble-embedder-pacs

UWB Ranging and Localization Dataset for "High-Accuracy Ranging and...

zenodo.org

zip

Updated Nov 3, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Flueratoru Laura; Flueratoru Laura (2021). UWB Ranging and Localization Dataset for "High-Accuracy Ranging and Localization with Ultra-Wideband Communication for Energy-Constrained Devices" [Dataset]. http://doi.org/10.5281/zenodo.4686379

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4686379

Dataset updated

Nov 3, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Flueratoru Laura; Flueratoru Laura

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

UWB Ranging and Localization Dataset for "High-Accuracy Ranging and Localization
with Ultra-Wideband Communication for Energy-Constrained Devices"

This dataset accompanies the paper "High-Accuracy Ranging and Localization with
Ultra-Wideband Communication for Energy-Constrained Devices," by L. Flueratoru, S.
Wehrli, M. Magno, S. Lohan, D. Niculescu, accepted for publication in the IEEE
Internet of Things Journal. Please refer to the paper for more information
about analyzing the data. If you find this dataset useful, please consider citing
our paper in your work.

This dataset is split into two parts: "ranging" and "localization." Both parts
contain measurements acquired with 3db Access and Decawave MDEK1001 UWB devices.
In the "3db" and "decawave" datasets, when a recording has the same name, it means
that the measurements were acquired at the exact same locations with the two
types of devices. The "3db" ranging dataset contains, apart from these, more
measurements acquired in various LOS and NLOS scenarios. In the directory
"images" you can find photos of some of the setups.

The "ranging" and "localization" directories both contain a "data" directory
which holds the datasets and a "code" directory with Python scripts that show
how to read and analyze the data.

The 3db Access ranging recordings contain the following data:
- True distance
- Measured distance
- Channel on which the measurements were acquired (can be 6.5, 7, or 7.5 GHz)
- Time of arrival as identified by the chipset
- Channel impulse response (CIR)
- Line-of-sight (LOS)/non-line-of-sight (NLOS) scenario (encoded as 0 and 1, respectively)
- If NLOS, the type of NLOS obstruction and its tickness.

The Decawave ranging recordings contain the following data:
- True distance
- Measured distance
- Line-of-sight (LOS)/non-line-of-sight (NLOS) scenario (encoded as 0 and 1, respectively)
- If NLOS, the type of NLOS obstruction and its tickness.

The MDEK kit operates only on the 6.5 GHz channel and cannot output the CIR
without further code modifications, which is why this data is not available
for the Decawave dataset.

The localization dataset includes the following data:
- True location as measured by an HTC Vive system
- Estimated location using a Gauss-Newton trilateration algorithm (please refer
to the paper for more details)
- Distance measurements between each anchor and the tag.

T
conll2003
tensorflow.org
opendatalab.com
+1more
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). conll2003 [Dataset]. https://www.tensorflow.org/datasets/catalog/conll2003
Explore at:
Dataset updated
Dec 22, 2022
Description
The shared task of CoNLL-2003 concerns language-independent named entity recognition and concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('conll2003', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
A
‘COVID-19 dataset in Japan’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 dataset in Japan’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-dataset-in-japan-2665/beaf3665/?iid=011-326&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Japan
Description
Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.

--- Dataset description provided by original source is as follows ---

1. Context

This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture

Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data

This dataset can be retrieved with CovsirPhy (Python library).

pip install covsirphy --upgrade

import covsirphy as cs data_loader = cs.DataLoader() japan_data = data_loader.japan() # The number of cases (Total/each province) clean_df = japan_data.cleaned() # Metadata meta_df = japan_data.meta()

Please refer to CovsirPhy Documentation: Japan-specific dataset.

Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.

1.1 Total number of cases in Japan

covid_jpn_total.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020) - discharged - fatal

The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)

In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.

Manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)

The number of vaccinated people: - Vaccinated_1st: the number of vaccinated persons for the first time on the date - Vaccinated_2nd: the number of vaccinated persons with the second dose on the date - Vaccinated_3rd: the number of vaccinated persons with the third dose on the date

Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸（新型コロナワクチン情報）

1.2 The number of cases at prefecture level

covid_jpn_prefecture.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - discharged - fatal

The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)

Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)

Note: covid_jpn_prefecture.groupby("Date").sum() does not match covid_jpn_total. When you analyse total data in Japan, please use covid_jpn_total data.

1.3 Metadata of each prefecture

covid_jpn_metadata.csv - Population (Total, Male, Female): 厚生労働省厚生統計要覧（2017年度）第１－５表 - Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)

Hospital_bed: With the primary data of 厚生労働省感染症指定医療機関の指定状況（平成31年4月1日現在）, 厚生労働省第二種感染症指定医療機関の指定状況（平成31年4月1日現在）, 厚生労働省医療施設動態調査（令和２年１月末概数）, 厚生労働省感染症指定医療機関について and secondary data of COVID-19 Japan 都道府県別感染症病床数,

Specific: Hospital beds of medical institutions designated for specific infectious diseases

Type-I: Hospital beds of medical institutions designated for type I infectious diseases

Type-II: Hospital beds of medical institutions designated for type II infectious diseases

Tuberculosis: Hospital beds of medical institutions designated for tuberculosis (outpatient care)

Care: long term care bed of hospitals

Total: Beds of all hospitals

Clinic_bed: With the primary data of 医療施設動態調査（令和２年１月末概数） ,

Care: long term care beds of clinics

Total: Beds of all clinics

Location: Data is from LinkData 都道府県庁所在地 (Public Domain) (secondary data).

Latitude

Longitude

Admin

Capital: Prefectural capital city. Data is from LinkData 都道府県庁所在地 (Public Domain) (secondary data).

Region: Region name. Data is from WIkipedia (secondary data). "Kyushu-Okinawa region" was separated to "Kyushu" and "Okinawa" by this datasets' author.

Num: Prefecture code (JIS X 0401: Hokkaido=1,...Okinawa=47). Data is from 国土交通省 GIS HP Pref code. cf. (not source) Japan VIsitor: Japan Prefectures Map.

2. Acknowledgements

To create this dataset, edited and transformed data of the following sites was used.

厚生労働省 Ministry of Health, Labour and Welfare, Japan:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English) 厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別感染症病床数 (CC BY)

Wikipedia: Wikipedia

LinkData: LinkData (Public Domain)

Inspiration

Changes in number of cases over time

Percentage of patients without symptoms / mild or severe symptoms

What to do next to prevent outbreak

License and how to cite

Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan

--- Original source retains full ownership of the source dataset ---
f
Examples of tweets texts (Portuguese).
figshare.com
plos.figshare.com
xls
Updated Feb 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sylvia Iasulaitis; Alan Demétrius Baria Valejo; Bruno Cardoso Greco; Vinicius Gonçalves Perillo; Guilherme Henrique Messias; Isabella Vicari (2025). Examples of tweets texts (Portuguese). [Dataset]. http://doi.org/10.1371/journal.pone.0316626.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316626.t007
Dataset updated
Feb 3, 2025
Dataset provided by
PLOS ONE
Authors
Sylvia Iasulaitis; Alan Demétrius Baria Valejo; Bruno Cardoso Greco; Vinicius Gonçalves Perillo; Guilherme Henrique Messias; Isabella Vicari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The main objective of this study is to describe the process of collecting data extracted from Twitter (X) during the Brazilian presidential elections in 2022, encompassing the post-election period and the event of the attack on the buildings of the executive, legislative, and judiciary branches in January 2023. The work of collecting data took one year. Additionally, the study provides an overview of the general characteristics of the dataset created from 282 million tweets, named “The Interfaces Twitter Elections Dataset” (ITED-Br), the third most extensive dataset of tweets with political purposes. The process of collecting and creating the database for this study went through three major stages, subdivided into several processes: (1) A preliminary analysis of the platform and its operation; (2) Contextual analysis, creation of the conceptual model, and definition of Keywords and (3) Implementation of the Data Collection Strategy. Python algorithms were developed to model each primary collection type. The “token farm” algorithm, was employed to iterate over available API keys. While Twitter is generally a “public” access platform and fits into big data standards, extracting valuable information is not trivial due to the volume, speed, and heterogeneity of data. This study concludes that acquiring informational value requires expertise not only in sociopolitical areas but also in computational and informational studies, highlighting the interdisciplinary nature of such research.

Facebook

Twitter

Click to copy link

Link copied

Cite

Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044636

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4044636

Dataset updated

Aug 24, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is gathered on Sep. 17th 2020.
The dataset has more 5.4K Python repositories that are hosted on GitHub.
It contains more than 1.1M type annotations.
Please note that this is the first version of the dataset. In the second version, we will provide processed Python projects in JSON files that contain relevant features and hints for ML-based type inference task.

Clear search

Close search

Google apps

Main menu

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND

Observational Large Ensemble

CodeSyntax Dataset

VSAT-2D Example Dataset

Project CodeNet Dataset

Dataset for Generation of multiple true false questions

VSAT-3D Example Dataset

Aerial Semantic Drone Dataset

Aerial Semantic Drone Dataset

Semantic Annotation

Directory Structure and Files

Included Data

Contact

Citation

License

wine_quality

Database of Stream Crossings in the United States

VSAT-1D Example Dataset

HUN GW Model code v01

Data from: IntelliGraphs: Datasets for Benchmarking Knowledge Graph...

Vegetation type of China; Python code for calculating ecosystem resilience...

Python Annotated Code Search (PACS) Datasets & Pretrained Models

UWB Ranging and Localization Dataset for "High-Accuracy Ranging and...

conll2003

‘COVID-19 dataset in Japan’ analyzed by Analyst-2

1. Context

1.1 Total number of cases in Japan

1.2 The number of cases at prefecture level

1.3 Metadata of each prefecture

2. Acknowledgements

Inspiration

License and how to cite

Examples of tweets texts (Portuguese).

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type InferenceSee More Versions

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference