32 datasets found
  1. Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • zenodo.org
    bin
    Updated Aug 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044636
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 24, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is gathered on Sep. 17th 2020.
    • The dataset has more 5.4K Python repositories that are hosted on GitHub.
    • It contains more than 1.1M type annotations.
    • Please note that this is the first version of the dataset. In the second version, we will provide processed Python projects in JSON files that contain relevant features and hints for ML-based type inference task.
  2. d

    Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Oct 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/63d20a5e-3584-5096-a34d-d3f93fcc8857
    Explore at:
    Dataset updated
    Oct 24, 2023
    Description

    Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder

  3. d

    Observational Large Ensemble

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McKinnon, Karen (2023). Observational Large Ensemble [Dataset]. https://search.dataone.org/view/sha256%3A5b1fd53e1fd12d09e5c7c8ff4656eb14d826ac6487fab323ceccae77f7421b6f
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    McKinnon, Karen
    Description

    These python datasets contain the results presented in the above paper with regard to the variability in trends over North America during DJF due to sampling of internal variability. Two types of files are available. The netcdf file contains samples from the synthetic ensemble of DJF temperatures over North America from 1966-2015. The synthetic ensemble is centered on the observed trend. Recentering the ensemble on the ensemble mean trend from the NCAR CESM1 LENS will create the Observational Large Ensemble, in which each sample can be viewed as a temperature history that could have occurred given various samplings of internal variability. The synthetic ensemble can also be recentered on any other estimate of the forced response to climate change. While the dataset is both land and ocean, it has only been validated over land. The second type of file, presented as python datasets (.npz) contains the results presented in the McKinnon et al (2017) reference. In particular, it contains the 50-year trends for both the observations and the NCAR CESM1 Large Ensemble that actually occurred, and could have occurred given a different sampling of internal variability. The bootstrap results can be compared to the true spread across the NCAR CESM1 Large Ensemble for validation, as was done in the manuscript. Each of these files is named based on the observational dataset, variable, time span, and spatial domain. They contain: BETA: the empirical OLS trend BOOTSAMPLES: the OLS trends estimated after bootstrapping INTERANNUALVAR: the interannual variance in the data after modeling and removing the forced trend empiricalAR1: the empirical AR(1) coefficient estimated from the residuals around the forced trend The first dimension of all variables is 42, which is a stack of the ensemble mean behavior (index 0), the forty members of the NCAR Large Ensemble (indices 1:40), and the observations (last index, -1). The second dimension is spatial. See latlon.npz for the latitude and longitude vectors. The third dimension, when present, is the bootstrap samples. We have saved 1000 bootstrap samples.

  4. P

    CodeSyntax Dataset

    • paperswithcode.com
    Updated Nov 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Da Shen; Xinyun Chen; Chenguang Wang; Koushik Sen; Dawn Song (2022). CodeSyntax Dataset [Dataset]. https://paperswithcode.com/dataset/codesyntax
    Explore at:
    Dataset updated
    Nov 9, 2022
    Authors
    Da Shen; Xinyun Chen; Chenguang Wang; Koushik Sen; Dawn Song
    Description

    CodeSyntax is a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. It contains 18,701 code samples annotated with 1,342,050 relation edges in 43 relation types for Python, and 13,711 code samples annotated with 864,411 relation edges in 39 relation types for Java. It is designed to evaluate the performance of language models on code syntax understanding.

  5. Z

    VSAT-2D Example Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Méndez-Hernández, Hugo (2021). VSAT-2D Example Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4671109
    Explore at:
    Dataset updated
    Apr 9, 2021
    Dataset authored and provided by
    Méndez-Hernández, Hugo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset to run Example.py script of the Valparaíso Stacking Analysis Tool (VSAT-2D). The Valparaíso Stacking Analysis Tool (VSAT-2D) provides a series of tools for selecting, stacking, and analyzing moment-0 intensity maps from interferometric datasets. It is intended for stacking samples of moment-0 extracted from interferometric datasets, belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT-2D on smaller datasets containing any type of astronomical object.

    VSAT-2D can be downloaded from the github repository link.

  6. P

    Project CodeNet Dataset

    • paperswithcode.com
    Updated Jun 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruchir Puri; David S. Kung; Geert Janssen; Wei zhang; Giacomo Domeniconi; Vladimir Zolotov; Julian Dolby; Jie Chen; Mihir Choudhury; Lindsey Decker; Veronika Thost; Luca Buratti; Saurabh Pujar; Shyam Ramji; Ulrich Finkler; Susan Malaika; Frederick Reiss (2022). Project CodeNet Dataset [Dataset]. https://paperswithcode.com/dataset/project-codenet
    Explore at:
    Dataset updated
    Jun 10, 2022
    Authors
    Ruchir Puri; David S. Kung; Geert Janssen; Wei zhang; Giacomo Domeniconi; Vladimir Zolotov; Julian Dolby; Jie Chen; Mihir Choudhury; Lindsey Decker; Veronika Thost; Luca Buratti; Saurabh Pujar; Shyam Ramji; Ulrich Finkler; Susan Malaika; Frederick Reiss
    Description

    Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.

    The rich annotation of Project CodeNet enables research in code search, code completion, code-code translation, and a myriad of other use cases. We also extracted several benchmarks in Python, Java and C++ to drive innovation in deep learning and machine learning models in code classification and code similarity.

    Citation @inproceedings{puri2021codenet, author = {Ruchir Puri and David Kung and Geert Janssen and Wei Zhang and Giacomo Domeniconi and Vladmir Zolotov and Julian Dolby and Jie Chen and Mihir Choudhury and Lindsey Decker and Veronika Thost and Luca Buratti and Saurabh Pujar and Ulrich Finkler}, title = {Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks}, year = {2021}, }

  7. Dataset for Generation of multiple true false questions

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Nov 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Regina Kasakowskij; Regina Kasakowskij; Thomas Kasakowskij; Niels Seidel; Niels Seidel; Thomas Kasakowskij (2022). Dataset for Generation of multiple true false questions [Dataset]. http://doi.org/10.5281/zenodo.7303300
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Regina Kasakowskij; Regina Kasakowskij; Thomas Kasakowskij; Niels Seidel; Niels Seidel; Thomas Kasakowskij
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Generation of multiple true-false questions

    This project provides a Natural Language Pipeline for processing German Textbook sections as an input generating Multiple True-False Questions using GPT2.

    Assessments are an important part of the learning cycle and enable the development and promotion of competencies. However, the manual creation of assessments is very time-consuming. Therefore, the number of tasks in learning systems is often limited. In this repository, we provide an algorithm that can automatically generate an arbitrary number of German True False statements from a textbook using the GPT-2 model. The algorithm was evaluated with a selection of textbook chapters from four academic disciplines (see `data` folder) and rated by individual domain experts. One-third of the generated MTF Questions are suitable for learning. The algorithm provides instructors with an easier way to create assessments on chapters of textbooks to test factual knowledge.

    As a type of Multiple-Choice question, Multiple True False (MTF) Questions are, among other question types, a simple and efficient way to objectively test factual knowledge. The learner is challenged to distinguish between true and false statements. MTF questions can be presented differently, e.g. by locating a true statement from a series of false statements, identifying false statements among a list of true statements, or separately evaluating each statement as either true or false. Learners must evaluate each statement individually because a question stem can contain both incorrect and correct statements. Thus, MTF Questions as a machine-gradable format have the potential to identify learners’ misconceptions and knowledge gaps.

    Example MTF question:

    Check the correct statements:

    [ ] All trees have green leafs.

    [ ] Trees grow towards the sky.

    [ ] Leafes can fall from a tree.

    Features

    - generation of false statements

    - automatic selection of true statements

    - selection of an arbitrary similarity for true and false statements as well as the number of false statements

    - generating false statements by adding or deleting negations as well as using a german gpt2

    Setup

    Installation

    1. Create a new environment: `conda create -n mtfenv python=3.9`

    2. Activate the environment: `conda activate mtfenv`

    3. Install dependencies using anaconda:

    ```

    conda install -y -c conda-forge pdfplumber

    conda install -y -c conda-forge nltk

    conda install -y -c conda-forge pypdf2

    conda install -y -c conda-forge pylatexenc

    conda install -y -c conda-forge packaging

    conda install -y -c conda-forge transformers

    conda install -y -c conda-forge essential_generators

    conda install -y -c conda-forge xlsxwriter

    ```

    3. Download spacy: `python3.9 -m spacy download de_core_news_lg`

    Getting started

    After installation, you can execute the bash script `bash run.sh` in the terminal to compile MTF questions for the provided textbook chapters.

    To create MTF questions for your own texts use the following command:

    `python3 main.py --answers 1 --similarity 0.66 --input ./

    The parameter `answers` indicates how many false answers should be generated.

    By configuring the parameter `similarity` you can determine what portion of a sentence should remain the same. The remaining portion will be extracted and used to generate a false part of the sentence.

    ## History and roadmap

    * Outlook third iteration: Automatic augmentation of text chapters with generated questions

    * Second iteration: Generation of multiple true-false questions with improved text summarizer and German GPT2 sentence generator

    * First iteration: Generation of multiple true false questions in the Bachelor thesis of Mirjam Wiemeler

    Publications, citations, license

    Publications

    • Kasakowskij, R., Kasakowskij, T. & Seidel, N., (2022). Generation of Multiple True False Questions. In: Henning, P. A., Striewe, M. & Wölfel, M. (Hrsg.), 20. Fachtagung Bildungstechnologien (DELFI). Bonn: Gesellschaft für Informatik e.V.. (S. 147-152). DOI: [10.18420/delfi2022-026](https://dl.gi.de/handle/20.500.12116/38826)

    Citation of the Dataset

    The source code and data are maintained at GitHub: https://github.com/D2L2/multiple-true-false-question-generation

    Contact

    • Regina Kasakowskij (M.A.) - regina.kasakowskij@fernuni-hagen.de
    • Dr. Niels Seidel - niels.seidel@fernuni-hagen.de

    License Distributed under the MIT License. See [LICENSE.txt](https://gitlab.pi6.fernuni-hagen.de/la-diva/adaptive-assessment/generationofmultipletruefalsequestions/-/blob/master/LICENSE.txt) for more information.

    Acknowledgments This research was supported by CATALPA - Center of Advanced Technology for Assisted Learning and Predictive Analytics of the FernUniversität in Hagen, Germany.

    This project was carried out as part of research in the CATALPA project [LA DIVA](https://www.fernuni-hagen.de/forschung/schwerpunkte/catalpa/forschung/projekte/la-diva.shtml)

  8. VSAT-3D Example Dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Méndez-Hernández; Hugo Méndez-Hernández (2021). VSAT-3D Example Dataset [Dataset]. http://doi.org/10.5281/zenodo.4671101
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 9, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hugo Méndez-Hernández; Hugo Méndez-Hernández
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset to run Example.py script of the Valparaíso Stacking Analysis Tool (VSAT-3D). The Valparaíso Stacking Analysis Tool (VSAT-3D) provides a series of tools for selecting, stacking, and analyzing 3D spectra. It is intended for stacking samples of datacubes extracted from interferometric datasets, belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT-3D on smaller datasets containing any type of astronomical object.

    VSAT-3D can be downloaded from the github repository link.

  9. Aerial Semantic Drone Dataset

    • kaggle.com
    Updated May 25, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lalu Erfandi Maula Yusnu (2021). Aerial Semantic Drone Dataset [Dataset]. https://www.kaggle.com/nunenuh/semantic-drone/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 25, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lalu Erfandi Maula Yusnu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Aerial Semantic Drone Dataset

    The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above the ground. A high-resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.

    This dataset is taken from https://www.kaggle.com/awsaf49/semantic-drone-dataset. We remove and add files and information that we needed for our research purpose. We create our tiff files with a resolution of 1200x800 pixel in 24 channel with each channel represent classes that have been preprocessed from png files label. We reduce the resolution and compress the tif files with tiffile python library.

    If you have any problem with tif dataset that we have been modified you can contact nunenuh@gmail.com and gaungalif@gmail.com.

    This dataset was a copy from the original dataset (link below), we provide and add some improvement in the semantic data and classes. There are the availability of semantic data in png and tiff format with a smaller size as needed.

    Semantic Annotation

    The images are labelled densely using polygons and contain the following 24 classes:

    unlabeled paved-area dirt grass gravel water rocks pool vegetation roof wall window door fence fence-pole person dog car bicycle tree bald-tree ar-marker obstacle conflicting

    Directory Structure and Files

    > images
    > labels/png
    > labels/tiff
     - class_to_idx.json
     - classes.csv
     - classes.json
     - idx_to_class.json
    

    Included Data

    • 400 training images in jpg format can be found in "aerial_semantic_drone/images"
    • Dense semantic annotations in png format can be found in "aerial_semantic_drone/labels/png"
    • Dense semantic annotations in tiff format can be found in "aerial_semantic_drone/labels/tiff"
    • Semantic class definition in csv format can be found in "aerial_semantic_drone/classes.csv"
    • Semantic class definition in json can be found in "aerial_semantic_drone/classes.json"
    • Index to class name file can be found in "aerial_semantic_drone/idx_to_class.json"
    • Class name to index file can be found in "aerial_semantic_drone/idx_to_class.json"

    Contact

    aerial@icg.tugraz.at

    Citation

    If you use this dataset in your research, please cite the following URL: www.dronedataset.icg.tugraz.at

    License

    The Drone Dataset is made freely available to academic and non-academic entities for non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use the data given that you agree:

    That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, we (Graz University of Technology) do not accept any responsibility for errors or omissions. That you include a reference to the Semantic Drone Dataset in any work that makes use of the dataset. For research papers or other media link to the Semantic Drone Dataset webpage.

    That you do not distribute this dataset or modified versions. It is permissible to distribute derivative works in as far as they are abstract representations of this dataset (such as models trained on it or additional annotations that do not directly include any of our data) and do not allow to recover the dataset or something similar in character. That you may not use the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain. That all rights not expressly granted to you are reserved by us (Graz University of Technology).

  10. T

    wine_quality

    • tensorflow.org
    • beta.dataverse.org
    • +1more
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wine_quality [Dataset]. https://www.tensorflow.org/datasets/catalog/wine_quality
    Explore at:
    Dataset updated
    Nov 23, 2022
    Description

    Two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

    Number of Instances: red wine - 1599; white wine - 4898

    Input variables (based on physicochemical tests):

    1. fixed acidity
    2. volatile acidity
    3. citric acid
    4. residual sugar
    5. chlorides
    6. free sulfur dioxide
    7. total sulfur dioxide
    8. density
    9. pH
    10. sulphates
    11. alcohol

    Output variable (based on sensory data):

    1. quality (score between 0 and 10)

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wine_quality', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  11. d

    Database of Stream Crossings in the United States

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Database of Stream Crossings in the United States [Dataset]. https://catalog.data.gov/dataset/database-of-stream-crossings-in-the-united-states
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    This USGS data release is intended to provide a baselayer of information on likely stream crossings throughout the United States. The geopackage provides likely crossings of infrastructure and streams and provides observed information that helps validate modeled crossings and build knowledge about associated conditions through time (e.g. crossing type, crossing condition). Stream crossings were developed by intersecting the 2020 United States Census Bureau Topologically Integrated Geographic Encoding and Referencing (TIGER) U.S. road lines with the National Hydrography Dataset High Resolution flowlines. The current version of this data release specifically focuses on road stream crossings (i.e. TIGER2020 Roads) but is designed to support additions of other crossing types that may be included in future iterations (e.g. rail). In total 6,608,268 crossings are included in the dataset and 496,564 observations from the U.S. Department of Transportation, Federal Highway Administration's 2019 National Bridge Inventory (NBI)are included to help identify crossing types of bridges and culverts. This data release also contains Python code that documents methods of data development.

  12. VSAT-1D Example Dataset

    • zenodo.org
    application/gzip
    Updated Mar 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Méndez-Hernández; Hugo Méndez-Hernández (2021). VSAT-1D Example Dataset [Dataset]. http://doi.org/10.5281/zenodo.4624030
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 20, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hugo Méndez-Hernández; Hugo Méndez-Hernández
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset to run example script of the Valparaíso Stacking Analysis Tool (VSAT-1D). The Valparaíso Stacking Analysis Tool (VSAT) provides a series of tools for selecting, stacking, and analyzing 1D spectra. It is intended for stacking samples of spectra belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT on smaller datasets containing any type of astronomical object.

    VSAT can be downloaded from the github repository link.

  13. m

    HUN GW Model code v01

    • demo.dev.magda.io
    • researchdata.edu.au
    • +2more
    zip
    Updated Dec 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2022). HUN GW Model code v01 [Dataset]. https://demo.dev.magda.io/dataset/ds-dga-266195fc-4c99-4bd5-9788-f540b72c5b15
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 4, 2022
    Dataset provided by
    Bioregional Assessment Program
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The dataset was derived by the Bioregional Assessment Programme without the use of source datasets. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. Computer code and templates used to create the Hunter groundwater model. Broadly speaking, there are two types of files: those in templates_and_inputs that are template files used by the code; and everything else, which is the computer code itself. An example of a type …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme without the use of source datasets. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. Computer code and templates used to create the Hunter groundwater model. Broadly speaking, there are two types of files: those in templates_and_inputs that are template files used by the code; and everything else, which is the computer code itself. An example of a type of file in templates_and_inputs are all the uaXXXX.txt, which describe the parameters used in uncertainty analysis XXXX. Much of the computer code is in the form of python scripts, and most of these are run using either preprocess.py or postprocess.py (using subprocess.call). Each of the python scripts employs optparse, and so is largely self documenting. Each of the python scripts also requires an index file as an input, which is an XML file and contains all meta-data associated with the model building process, so that the scripts can discover where the raw data is needed to build the model. The HUN GW Model v01 contains the index file (index.xml) used to build the Hunter groundwater model. Finally, the "code" directory contains a snapshot of the MOOSE C++ code used to run the model. Dataset History Computer code and templates were written by hand. Dataset Citation Bioregional Assessment Programme (2016) HUN GW Model code v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/e54a1246-0076-4799-9ecf-6d673cf5b1da.

  14. Z

    Data from: IntelliGraphs: Datasets for Benchmarking Knowledge Graph...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Groth (2023). IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7824817
    Explore at:
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Paul Groth
    Emile van Krieken
    Peter Bloem
    Thiviyan Thanapalasingam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntelliGraphs is a collection of datasets for benchmarking Knowledge Graph Generation models. It consists of three synthetic datasets (syn-paths, syn-tipr, syn-types) and two real-world datasets (wd-movies, wd-articles). There is also a Python package available that loads these datasets and verifies new graphs using semantics that was pre-defined for each dataset. It can also be used as a testbed for developing new generative models.

  15. Vegetation type of China; Python code for calculating ecosystem resilience...

    • figshare.com
    text/x-python
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Zhang (2024). Vegetation type of China; Python code for calculating ecosystem resilience and early warning signals [Dataset]. http://doi.org/10.6084/m9.figshare.24999290.v1
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    figshare
    Authors
    Yu Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    China
    Description

    Vegetation type dataset is derived from the 1:1,000,000 Atlas of the Vegetation of China. This atlas is another summarizing achievement of the vegetation ecology workers in China over the past 40 years, following the publication of monographs such as Vegetation of China, which is a basic map of the country's natural resources and natural conditions. Prepared by more than 250 experts from 53 units, including relevant institutes of the Chinese Academy of Sciences, relevant ministries and departments of provinces and districts, and institutions of higher learning, and officially published by the Science Press for domestic and international public distribution.The Bayesian dynamic linear model proposed by Liu et al. (2019, https://doi.org/10.1038/s41558-019-0583-9) was used to calculate the time-varying measure of resilience. We have modified the parameters of the code to be more suitable for the Loess Plateau and Qinba Mountains in China. According to the results of resilience, we could obtain the early warning signals of forest.

  16. Z

    Python Annotated Code Search (PACS) Datasets & Pretrained Models

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom Van Cutsem (2020). Python Annotated Code Search (PACS) Datasets & Pretrained Models [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4001601
    Explore at:
    Dataset updated
    Aug 27, 2020
    Dataset provided by
    Geert Heyman
    Tom Van Cutsem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload contains datasets and pre-trained models used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets and models will be made available here: http://github.com/nokia/codesearch

    Datasets There are three types of datasets:

    snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated

    code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test

    training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20

    The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora:

    staqc-py-cleaned was derived from the Python StaQC snippet collection. See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset, LICENSE.

    conala-curated was derived from the conala corpus. See https://conala-corpus.github.io/ , LICENSE

    The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange, LICENSE).

    Pre-trained models Each model can embed queries and (annotated) code snippets in the same space. The models are released under a BSD 3-Clause License.

    ncs-embedder-so-ds-feb20

    ncs-embedder-staqc-py

    tnbow-embedder-so-ds-feb20

    use-embedder-pacs

    ensemble-embedder-pacs

  17. UWB Ranging and Localization Dataset for "High-Accuracy Ranging and...

    • zenodo.org
    zip
    Updated Nov 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flueratoru Laura; Flueratoru Laura (2021). UWB Ranging and Localization Dataset for "High-Accuracy Ranging and Localization with Ultra-Wideband Communication for Energy-Constrained Devices" [Dataset]. http://doi.org/10.5281/zenodo.4686379
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 3, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Flueratoru Laura; Flueratoru Laura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    UWB Ranging and Localization Dataset for "High-Accuracy Ranging and Localization 
    with Ultra-Wideband Communication for Energy-Constrained Devices"
    
    This dataset accompanies the paper "High-Accuracy Ranging and Localization with
    Ultra-Wideband Communication for Energy-Constrained Devices," by L. Flueratoru, S.
    Wehrli, M. Magno, S. Lohan, D. Niculescu, accepted for publication in the IEEE
    Internet of Things Journal. Please refer to the paper for more information
    about analyzing the data. If you find this dataset useful, please consider citing 
    our paper in your work.
    
    This dataset is split into two parts: "ranging" and "localization." Both parts
    contain measurements acquired with 3db Access and Decawave MDEK1001 UWB devices. 
    In the "3db" and "decawave" datasets, when a recording has the same name, it means
    that the measurements were acquired at the exact same locations with the two
    types of devices. The "3db" ranging dataset contains, apart from these, more
    measurements acquired in various LOS and NLOS scenarios. In the directory
    "images" you can find photos of some of the setups.
    
    The "ranging" and "localization" directories both contain a "data" directory
    which holds the datasets and a "code" directory with Python scripts that show
    how to read and analyze the data.
    
    The 3db Access ranging recordings contain the following data:
    - True distance
    - Measured distance
    - Channel on which the measurements were acquired (can be 6.5, 7, or 7.5 GHz)
    - Time of arrival as identified by the chipset
    - Channel impulse response (CIR)
    - Line-of-sight (LOS)/non-line-of-sight (NLOS) scenario (encoded as 0 and 1, respectively)
    - If NLOS, the type of NLOS obstruction and its tickness.
    
    The Decawave ranging recordings contain the following data:
    - True distance
    - Measured distance
    - Line-of-sight (LOS)/non-line-of-sight (NLOS) scenario (encoded as 0 and 1, respectively)
    - If NLOS, the type of NLOS obstruction and its tickness.
    
    The MDEK kit operates only on the 6.5 GHz channel and cannot output the CIR 
    without further code modifications, which is why this data is not available 
    for the Decawave dataset.
    
    The localization dataset includes the following data:
    - True location as measured by an HTC Vive system
    - Estimated location using a Gauss-Newton trilateration algorithm (please refer 
    to the paper for more details)
    - Distance measurements between each anchor and the tag.
    
    
    
  18. T

    conll2003

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Dec 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). conll2003 [Dataset]. https://www.tensorflow.org/datasets/catalog/conll2003
    Explore at:
    Dataset updated
    Dec 22, 2022
    Description

    The shared task of CoNLL-2003 concerns language-independent named entity recognition and concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('conll2003', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  19. A

    ‘COVID-19 dataset in Japan’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 dataset in Japan’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-dataset-in-japan-2665/beaf3665/?iid=011-326&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Japan
    Description

    Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    1. Context

    This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture

    Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data

    This dataset can be retrieved with CovsirPhy (Python library).

    pip install covsirphy --upgrade
    
    import covsirphy as cs
    data_loader = cs.DataLoader()
    japan_data = data_loader.japan()
    # The number of cases (Total/each province)
    clean_df = japan_data.cleaned()
    # Metadata
    meta_df = japan_data.meta()
    

    Please refer to CovsirPhy Documentation: Japan-specific dataset.

    Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.

    1.1 Total number of cases in Japan

    covid_jpn_total.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020) - discharged - fatal

    The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)

    In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.

    Manually collected the data from Ministry of Health, Labour and Welfare HP:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English)

    The number of vaccinated people: - Vaccinated_1st: the number of vaccinated persons for the first time on the date - Vaccinated_2nd: the number of vaccinated persons with the second dose on the date - Vaccinated_3rd: the number of vaccinated persons with the third dose on the date

    Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸 新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸(新型コロナワクチン情報)

    1.2 The number of cases at prefecture level

    covid_jpn_prefecture.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - discharged - fatal

    The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)

    Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English)

    Note: covid_jpn_prefecture.groupby("Date").sum() does not match covid_jpn_total. When you analyse total data in Japan, please use covid_jpn_total data.

    1.3 Metadata of each prefecture

    covid_jpn_metadata.csv - Population (Total, Male, Female): 厚生労働省 厚生統計要覧(2017年度)第1-5表 - Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)

    2. Acknowledgements

    To create this dataset, edited and transformed data of the following sites was used.

    厚生労働省 Ministry of Health, Labour and Welfare, Japan:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English) 厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

    国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

    Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別 感染症病床数 (CC BY)

    Wikipedia: Wikipedia

    LinkData: LinkData (Public Domain)

    Inspiration

    1. Changes in number of cases over time
    2. Percentage of patients without symptoms / mild or severe symptoms
    3. What to do next to prevent outbreak

    License and how to cite

    Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan

    --- Original source retains full ownership of the source dataset ---

  20. f

    Examples of tweets texts (Portuguese).

    • figshare.com
    • plos.figshare.com
    xls
    Updated Feb 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sylvia Iasulaitis; Alan Demétrius Baria Valejo; Bruno Cardoso Greco; Vinicius Gonçalves Perillo; Guilherme Henrique Messias; Isabella Vicari (2025). Examples of tweets texts (Portuguese). [Dataset]. http://doi.org/10.1371/journal.pone.0316626.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Sylvia Iasulaitis; Alan Demétrius Baria Valejo; Bruno Cardoso Greco; Vinicius Gonçalves Perillo; Guilherme Henrique Messias; Isabella Vicari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The main objective of this study is to describe the process of collecting data extracted from Twitter (X) during the Brazilian presidential elections in 2022, encompassing the post-election period and the event of the attack on the buildings of the executive, legislative, and judiciary branches in January 2023. The work of collecting data took one year. Additionally, the study provides an overview of the general characteristics of the dataset created from 282 million tweets, named “The Interfaces Twitter Elections Dataset” (ITED-Br), the third most extensive dataset of tweets with political purposes. The process of collecting and creating the database for this study went through three major stages, subdivided into several processes: (1) A preliminary analysis of the platform and its operation; (2) Contextual analysis, creation of the conceptual model, and definition of Keywords and (3) Implementation of the Data Collection Strategy. Python algorithms were developed to model each primary collection type. The “token farm” algorithm, was employed to iterate over available API keys. While Twitter is generally a “public” access platform and fits into big data standards, extracting valuable information is not trivial due to the volume, speed, and heterogeneity of data. This study concludes that acquiring informational value requires expertise not only in sociopolitical areas but also in computational and informational studies, highlighting the interdisciplinary nature of such research.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044636
Organization logo

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference

Related Article
Explore at:
binAvailable download formats
Dataset updated
Aug 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description
  • Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is gathered on Sep. 17th 2020.
  • The dataset has more 5.4K Python repositories that are hosted on GitHub.
  • It contains more than 1.1M type annotations.
  • Please note that this is the first version of the dataset. In the second version, we will provide processed Python projects in JSON files that contain relevant features and hints for ML-based type inference task.
Search
Clear search
Close search
Google apps
Main menu