U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public su ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AI4Arctic / ASIP Sea Ice Dataset - version 2 (ASID-v2) contain 461 Sentinel-1 Synthetic Aperture Radar (SAR) scenes matched with sea ice charts produced by the Danish Meteorological Institute in 2018-2019. Ice charts contain sea ice concentration, stage of development and form of ice, provided in manual drawn polygons. The ice charts have been projected into the the S1 geometry for easy use as labels in deep learning or other machine learning algorithm training processes. The dataset also includes AMSR2 microwave radiometer sensor measurements to compliment the learning of the of sea ice concentrations although in a much lower resolution than the Sentinel-1 data. Details are described in the manual that is published together with the dataset.The manual has been revised, the latest is the 30-09-2020 version.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SynSpeech Dataset (Small Version) is an English-language synthetic speech dataset created using OpenVoice and LibriSpeech-100 for bench-marking disentangled speech representation learning methods. It includes 50 unique speakers, each with 500 distinct sentences spoken in a “default” style at a 16kHz sampling rate. Data is organized by speaker ID, with a synspeech_Small_Metadata.csv
file detailing speaker information, gender, speaking style, text, and file paths. This dataset is ideal for tasks in representation learning, speaker and content factorization, and TTS synthesis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.
Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.
By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.
We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.
All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ASIP Sea Ice Dataset - version 1, contains 912 Sentinel-1 (S1) Synthetic Aperture Radar (SAR) scenes matched with sea ice charts produced by the Danish Meteorological Institute from 2014-2017. Ice charts containing sea ice concentrations provided in manual drawn polygon over the scene has been projected into the the S1 geometry for easy use as labels in deep learning or other machine learning algorithm training processes. The dataset also includes AMSR2 microwave radiometer sensor measurements to compliment the learning of the of sea ice concentrations although in a much lower resolution than the S1 data.Details are described in the manual that is published together with the dataset.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The MNIST dataset is one of the best known image classification problems out there, and a veritable classic of the field of machine learning. This dataset is more challenging version of the same root problem: classifying letters from images. This is a multiclass classification dataset of glyphs of English letters A - J.
This dataset is used extensively in the Udacity Deep Learning course, and is available in the Tensorflow Github repo (under Examples). I'm not aware of any license governing the use of this data, so I'm posting it here so that the community can use it with Kaggle kernels.
notMNIST _large.zip
is a large but dirty version of the dataset with 529,119 images, and notMNIST_small.zip
is a small hand-cleaned version of the dataset, with 18726 images. The dataset was assembled by Yaroslav Bulatov, and can be obtained on his blog. According to this blog entry there is about a 6.5% label error rate on the large uncleaned dataset, and a 0.5% label error rate on the small hand-cleaned dataset.
The two files each containing 28x28 grayscale images of letters A - J, organized into directories by letter. notMNIST_large.zip
contains 529,119 images and notMNIST_small.zip
contains 18726 images.
Thanks to Yaroslav Bulatov for putting together the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LSD4WSD V2.0
Learning SAR Dataset for Wet Snow Detection - Full Analysis Version.
The aim of this dataset is to provide a basis for automatic learning to detect wet snow. It is based on Sentinel-1 SAR GRD satellite images acquired between August 2020 and August 2021 over the French Alps. The new version of this dataset is no longer simply restricted to a classification task, and provides a set of metadata for each sample.
Modification and improvements of the version 2.0.0 :
info.pdf
).topography
, metadata
and physics
.physics
: addition of direct information from the CROCUS model for 3 simulations: Liquid Water Content, snow height and minimum snowpack temperature.topography
: information on the slope, altitude and average orientation of the sample.metadata
: information on the date of the sample, the mountain massif and the run (ascending or descending).We leave it up to the user to use the Group Kfold method to validate the models using the alpine massif information.
Finally, it consists of 2467516 samples of size 15 by 15 by 9. For each sample, the 9 metadata are provided, using in particular the Crocus physical model:
The 9 channels are in the following order:
* The reference image selected is that of August 9th 2020, as a reference image without snow (cf. Nagler&al)
An overview of the distribution and a summary of the sample statistics can be found in the file info.pdf.
The data is stored in .hdf5 format with gzip compression. We provide a python script to read and request the data. The script is dataset_load.py. It is based on the h5py, numpy and pandas libraries. It allows to select a part or the whole dataset using requests on the metadata. The script is documented and can be used as described in the README.md file
The processing chain is available at the following Github address.
The authors would like to acknowledge the support from the National Centre for Space Studies (CNES) in providing computing facilities and access to SAR images via the PEPS platform.
The authors would like to deeply thank Mathieu Fructus for running the Crocus simulations.
Erratum :
In the dataloader file, the name of the "aquisition" column must be added twice, see the correction below.:
dtst_ld = Dataset_loader(path_dataset,shuffle=False,descrp=["date","massif","aquisition","aquisition","elevation","slope","orientation","tmin","hsnow","tel",],)
If you have any comments, questions or suggestions, please contact the authors:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fashion-MNIST
is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST
to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
* Source
Here's an example of how the data looks (each class takes three-rows):
https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">
train
(86% of images - 60,000 images) set and test
(14% of images - 10,000 images) set only.train
set split to provide 80% of its images to the training set and 20% of its images to the validation set@online{xiao2017/online,
author = {Han Xiao and Kashif Rasul and Roland Vollgraf},
title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
date = {2017-08-28},
year = {2017},
eprintclass = {cs.LG},
eprinttype = {arXiv},
eprint = {cs.LG/1708.07747},
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is supplementary data for: Waldron, A., Pecci, F., Stoianov, I. (2020). Regularization of an Inverse Problem for Parameter Estimation in Water Distribution Networks. Journal of Water Resources and Planning Management, 146(9):04020076 (https://doi.org/10.1061/(ASCE)WR.1943-5452.0001273).
The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence. Any use of this dataset must credit the authors by citing the above paper.
BWFLnet is an operational network in Bristol, UK, operated by Bristol Water in collaboration with the InfraSense Labs at Imperial College London and Cla-Val Ltd. The data provided is a the product of a long term research partnership between Bristol Water, Infrasense Labs at Imperial College London and Cla-Val on the design and control of dynamically adaptive networks. We acknowledge the financial support of EPSRC (EP/P004229/1, Dynamically Adaptive and Resilient Water Supply Networks for a Sustainable Future).
All data provided is recorded hydraulic data with locations and names anonymised. The authors hope that the publication of this dataset will facilitate the reproducibility of research in hydraulic model calibration as well as broader research in the water distribution sector.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Learning SAR Dataset for Wet Snow Detection - Full Analysis Version.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets used in the article Implementation and empirical evaluation of a quantum machine learning pipeline for local classification. The original versions of these datasets, which have undergone a preprocessing procedure (as described in the paper), have been taken from the UCI Machine Learning Repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains additional data for the publication "A Deep Dive into Machine Learning Density Functional Theory for Materials Science and Chemistry". Its goal is to enable interested people to reproduce the citation analysis carried out in the aforementioned publication.
Prerequesites
The following software versions were used for the python version of this dataset:
Python: 3.8.6
Scholarly: 1.2.0
Pyzotero: 1.4.24
Numpy: 1.20.1
Contents
results/ : Contains the .csv files that were the results of the citation analysis. Paper groupings follow the ones outlined in the publication.
scripts/ : Contains scripts to perform the citation analysis.
Zotero.cached.pkl : Contains the cached Zotero library.
Usage
In order to reproduce the results of the citation analysis, you can use citation_analysis.py in conjunction with cached Zotero library. Manual additions can be verified using the check_consistency script.
Please note that you will need a Tor key for the citation analysis, and access to our Zotero library if you don't want to use the cached version. If you need this access, simply contact us.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RTAnews dataset is a collections of multi-label Arabic texts, collected form Russia Today in Arabic news portal. It consists of 23,837 texts (news articles) distributed over 40 categories, and divided into 15,001 texts for the training and 8,836 texts for the test.
The original dataset (without preprocessing), a preprocessed version of the dataset, versions of the dataset in MEKA and Mulan formats, single-label version, and WEAK version all are available.
For any enquiry or support regarding the dataset, please feel free to contact us via bassalemi at gmail dot com
The dataset is gathered on Sep. 17th 2020 from GitHub. It has clean and complete versions (from v0.7): The clean version has 5.1K type-checked Python repositories and 1.2M type annotations. The complete version has 5.2K Python repositories and 3.3M type annotations. The dataset's source files are type-checked using mypy (clean version). The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository. {"references": ["A. Mir, E. Latoskinas and G. Gousios, "ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference," in 2021 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021 pp. 585-589. doi: 10.1109/MSR52588.2021.00079"]}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This includes all 3 different versions of the COALA dataset.COALA100 dataset is a collection of antibiotic resistance genes from 15 databases along with metadata information from these databases which includes the respective antibiotic class.The COALA70 dataset is the cd-hitted by 70% threshold version of the COALA100 dataset.The COALA40 dataset is the cd-hitted by 40% threshold version of the COALA100 dataset.All three datasets are in fasta format. The last section of the description line has the antibiotic label the gene confers resistance to. The second last section is the database name from where the gene was collected. All other sections convey information about the gene.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The "TeaLeafAgeQuality" dataset is curated for tea leaf classification, detection and quality prediction based on leaf age. This dataset encompasses a comprehensive collection of tea leaf images categorized into four classes corresponding to their age-based quality:
Category T1: Age 1 and 2 days, representing the highest quality tea leaves. (562 Raw Images) Category T2: Age 3 to 4 days, indicating good quality tea leaves. (615 Raw Images) Category T3: Age 5 to 7 days, indicating average or below-average quality tea leaves. (508 Raw Images) Category T4: Age 7+ days, denoting tea leaves unsuitable for brewing drinkable tea. (523 Raw Images)
Each category includes images depicting tea leaves at various stages of their age progression, facilitating research and analysis into the relationship between leaf age and tea quality. The dataset aims to contribute to the advancement of deep learning models for tea leaf classification and quality assessment.
This dataset comprises three versions: the first is raw, unannotated data, offering a pure, unmodified collection of tea leaves collected from the different tea gardens located at Sylhet, Bangladesh. The second version includes precise annotations, classified into four categories: T1, T2, T3, and T4, for targeted analysis. Finally, the third version contains both annotated and augmented data, enhancing the dataset for more advanced research applications. Each version caters to different levels of data analysis, from basic to complex.
The Mechanical MNIST Crack Path dataset contains Finite Element simulation results from phase-field models of quasi-static brittle fracture in heterogeneous material domains subjected to prescribed loading and boundary conditions. For all samples, the material domain is a square with a side length of 1. There is an initial crack of fixed length (0.25) on the left edge of each domain. The bottom edge of the domain is fixed in x (horizontal) and y (vertical), the right edge of the domain is fixed in x and free in y, and the left edge is free in both x and y. The top edge is free in x, and in y it is displaced such that, at each step, the displacement increases linearly from zero at the top right corner to the maximum displacement on the top left corner. Maximum displacement starts at 0.0 and increases to 0.02 by increments of 0.0001 (200 simulation steps in total). The heterogeneous material distribution is obtained by adding rigid circular inclusions to the domain using the Fashion MNIST...
This is the replication package for the paper: "A Machine Learning Based Ensemble Method for Automatic Classification of Decisions". It contains the source code and dataset of our experiment for the replication by other researchers. In the meanwhile, we provide brief description of the files in the replication package in the following.
experiment.py contains the source code for our experiment, which is conducted on Windows 10 and Python 3.7.0. Note that you may get slightly different experiment results when conducting the experiments on different environment configurations.
requirements.txt records all the installation packages and their version numbers needed for the current program to run. You can use "pip install -r requirements.txt" to rebuild the project and install all dependencies. Note that you may get slightly different experiment results when using different packages or versions.
decisions.xlsx contains 848 labelled sentence-level decisions from the Hibernate developer mailing list.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public su ...