100+ datasets found

U
Machine learning model that estimates total monthly and annual per capita...
data.usgs.gov
datasets.ai
+2more
Updated Sep 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski (2024). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. http://doi.org/10.5066/P9FUL880
Explore at:
Unique identifier
https://doi.org/10.5066/P9FUL880
Dataset updated
Sep 17, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Jan 1, 2000 - Dec 31, 2020
Description
This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public su ...
m
Software code quality and source code metrics dataset
data.mendeley.com
narcis.nl
Updated Feb 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayed Mohsin Reza (2021). Software code quality and source code metrics dataset [Dataset]. http://doi.org/10.17632/77p6rzb73n.2
Explore at:
Unique identifier
https://doi.org/10.17632/77p6rzb73n.2
Dataset updated
Feb 17, 2021
Authors
Sayed Mohsin Reza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.
d
AI4Arctic / ASIP Sea Ice Dataset - version 2
data.dtu.dk
pdf
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler; Leif Toudal Pedersen; David Malmgren-Hansen; Allan Aasbjerg Nielsen; Henning Skriver (2023). AI4Arctic / ASIP Sea Ice Dataset - version 2 [Dataset]. http://doi.org/10.11583/DTU.13011134.v3
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.13011134.v3
Dataset updated
Jul 12, 2023
Dataset provided by
Technical University of Denmark
Authors
Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler; Leif Toudal Pedersen; David Malmgren-Hansen; Allan Aasbjerg Nielsen; Henning Skriver
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The AI4Arctic / ASIP Sea Ice Dataset - version 2 (ASID-v2) contain 461 Sentinel-1 Synthetic Aperture Radar (SAR) scenes matched with sea ice charts produced by the Danish Meteorological Institute in 2018-2019. Ice charts contain sea ice concentration, stage of development and form of ice, provided in manual drawn polygons. The ice charts have been projected into the the S1 geometry for easy use as labels in deep learning or other machine learning algorithm training processes. The dataset also includes AMSR2 microwave radiometer sensor measurements to compliment the learning of the of sea ice concentrations although in a much lower resolution than the Sentinel-1 data. Details are described in the manual that is published together with the dataset.The manual has been revised, the latest is the 30-09-2020 version.
f
SynSpeech Dataset (Small Version)
figshare.com
csv
Updated Nov 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yusuf Brima (2024). SynSpeech Dataset (Small Version) [Dataset]. http://doi.org/10.6084/m9.figshare.27627840.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27627840.v1
Dataset updated
Nov 7, 2024
Dataset provided by
figshare
Authors
Yusuf Brima
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SynSpeech Dataset (Small Version) is an English-language synthetic speech dataset created using OpenVoice and LibriSpeech-100 for bench-marking disentangled speech representation learning methods. It includes 50 unique speakers, each with 500 distinct sentences spoken in a “default” style at a 16kHz sampling rate. Data is organized by speaker ID, with a synspeech_Small_Metadata.csv file detailing speaker information, gender, speaking style, text, and file paths. This dataset is ideal for tasks in representation learning, speaker and content factorization, and TTS synthesis.
m
English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...
data.mendeley.com
Updated Feb 9, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
Explore at:
Unique identifier
https://doi.org/10.17632/cdcztymf4k.1
Dataset updated
Feb 9, 2017
Authors
H. Bahadir Sahin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.
d
ASIP Sea Ice Dataset - version 1
data.dtu.dk
bin
Updated Mar 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Malmgren-Hansen; Leif Toudal Pedersen; Allan Aasbjerg Nielsen; Henning Skriver; Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler (2020). ASIP Sea Ice Dataset - version 1 [Dataset]. http://doi.org/10.11583/DTU.11920416.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.11920416.v1
Dataset updated
Mar 6, 2020
Dataset provided by
Technical University of Denmark
Authors
David Malmgren-Hansen; Leif Toudal Pedersen; Allan Aasbjerg Nielsen; Henning Skriver; Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The ASIP Sea Ice Dataset - version 1, contains 912 Sentinel-1 (S1) Synthetic Aperture Radar (SAR) scenes matched with sea ice charts produced by the Danish Meteorological Institute from 2014-2017. Ice charts containing sea ice concentrations provided in manual drawn polygon over the scene has been projected into the the S1 geometry for easy use as labels in deep learning or other machine learning algorithm training processes. The dataset also includes AMSR2 microwave radiometer sensor measurements to compliment the learning of the of sea ice concentrations although in a much lower resolution than the S1 data.Details are described in the manual that is published together with the dataset.
notMNIST
kaggle.com
opendatalab.com
+3more
Updated Feb 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jwjohnson314 (2018). notMNIST [Dataset]. https://www.kaggle.com/datasets/jwjohnson314/notmnist/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
jwjohnson314
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The MNIST dataset is one of the best known image classification problems out there, and a veritable classic of the field of machine learning. This dataset is more challenging version of the same root problem: classifying letters from images. This is a multiclass classification dataset of glyphs of English letters A - J.

This dataset is used extensively in the Udacity Deep Learning course, and is available in the Tensorflow Github repo (under Examples). I'm not aware of any license governing the use of this data, so I'm posting it here so that the community can use it with Kaggle kernels.

Content

notMNIST _large.zip is a large but dirty version of the dataset with 529,119 images, and notMNIST_small.zip is a small hand-cleaned version of the dataset, with 18726 images. The dataset was assembled by Yaroslav Bulatov, and can be obtained on his blog. According to this blog entry there is about a 6.5% label error rate on the large uncleaned dataset, and a 0.5% label error rate on the small hand-cleaned dataset.

The two files each containing 28x28 grayscale images of letters A - J, organized into directories by letter. notMNIST_large.zip contains 529,119 images and notMNIST_small.zip contains 18726 images.

Acknowledgements

Thanks to Yaroslav Bulatov for putting together the dataset.
LSD4WSD : An Open Dataset for Wet Snow Detection with SAR Data and Physical...
zenodo.org
explore.openaire.eu
bin, pdf +1
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthieu Gallet; Matthieu Gallet; Abdourrahmane Atto; Abdourrahmane Atto; Fatima Karbou; Fatima Karbou; Emmanuel Trouvé; Emmanuel Trouvé (2024). LSD4WSD : An Open Dataset for Wet Snow Detection with SAR Data and Physical Labelling [Dataset]. http://doi.org/10.5281/zenodo.10046730
Explore at:
text/x-python, bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10046730
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matthieu Gallet; Matthieu Gallet; Abdourrahmane Atto; Abdourrahmane Atto; Fatima Karbou; Fatima Karbou; Emmanuel Trouvé; Emmanuel Trouvé
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LSD4WSD V2.0
Learning SAR Dataset for Wet Snow Detection - Full Analysis Version.
The aim of this dataset is to provide a basis for automatic learning to detect wet snow. It is based on Sentinel-1 SAR GRD satellite images acquired between August 2020 and August 2021 over the French Alps. The new version of this dataset is no longer simply restricted to a classification task, and provides a set of metadata for each sample.
Modification and improvements of the version 2.0.0 :
Number of massif: add 7 new massif to cover the all Sentinel-1 images (cf info.pdf).
Acquisition: add images of the descending pass in addition to those originally used in the ascending pass.
Sample: reduction in the size of the samples considered to 15 by 15 to facilitate evaluation at the central pixel.
Sample: increased density of extracted windows, with a distance of approximately 500 meters between the centers of the windows.
Sample: removal of the pre-processing involving the use of logarithms.
Sample: removal of the pre-processing involving the normalisation.
Labels: new structure for the labels part: dictionary with keys: topography, metadata and physics.
Labels: physics: addition of direct information from the CROCUS model for 3 simulations: Liquid Water Content, snow height and minimum snowpack temperature.
Labels: topography: information on the slope, altitude and average orientation of the sample.
Labels: metadata : information on the date of the sample, the mountain massif and the run (ascending or descending).
Dataset: removal of the train/test split*
We leave it up to the user to use the Group Kfold method to validate the models using the alpine massif information.
Finally, it consists of 2467516 samples of size 15 by 15 by 9. For each sample, the 9 metadata are provided, using in particular the Crocus physical model:
topography:
elevation (meters) (average),
orientation (degrees) (average),
slope (degrees) (average),
metadata:
name of the alpine massif,
date of acquisition,
type of acquisition (ascending/descending),
physics
Liquid Water Content (km/m2),
snow height (m),
minimum snowpack temperature (Celsius degree).
The 9 channels are in the following order:
Sentinel-1 polarimetric channels: VV, VH and the combination C: VV/VH in linear,
Topographical features: altitude, orientation, slope
Polarimetric ratio with a reference summer image: VV/VVref, VH/VHref, C/Cref
* The reference image selected is that of August 9th 2020, as a reference image without snow (cf. Nagler&al)
An overview of the distribution and a summary of the sample statistics can be found in the file info.pdf.
The data is stored in .hdf5 format with gzip compression. We provide a python script to read and request the data. The script is dataset_load.py. It is based on the h5py, numpy and pandas libraries. It allows to select a part or the whole dataset using requests on the metadata. The script is documented and can be used as described in the README.md file
The processing chain is available at the following Github address.
The authors would like to acknowledge the support from the National Centre for Space Studies (CNES) in providing computing facilities and access to SAR images via the PEPS platform.
The authors would like to deeply thank Mathieu Fructus for running the Crocus simulations.
Erratum :
In the dataloader file, the name of the "aquisition" column must be added twice, see the correction below.:
dtst_ld = Dataset_loader(path_dataset,shuffle=False,descrp=["date","massif","aquisition","aquisition","elevation","slope","orientation","tmin","hsnow","tel",],)
If you have any comments, questions or suggestions, please contact the authors:
matthieu.gallet@univ-smb.fr
fatima.karbou@meteo.fr
abdourrahmane.atto@univ-smb.fr
emmanuel.trouve@univ-smb.fr
R
Data from: Fashion Mnist Dataset
universe.roboflow.com
opendatalab.com
+3more
zip
Updated Aug 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Clothing
Description
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf

https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online, author = {Han Xiao and Kashif Rasul and Roland Vollgraf}, title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms}, date = {2017-08-28}, year = {2017}, eprintclass = {cs.LG}, eprinttype = {arXiv}, eprint = {cs.LG/1708.07747}, }
m
BWFLnet + data
data.mendeley.com
Updated Jul 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Waldron (2020). BWFLnet + data [Dataset]. http://doi.org/10.17632/srt4vr5k38.3
Explore at:
Unique identifier
https://doi.org/10.17632/srt4vr5k38.3
Dataset updated
Jul 11, 2020
Authors
Alexander Waldron
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is supplementary data for: Waldron, A., Pecci, F., Stoianov, I. (2020). Regularization of an Inverse Problem for Parameter Estimation in Water Distribution Networks. Journal of Water Resources and Planning Management, 146(9):04020076 (https://doi.org/10.1061/(ASCE)WR.1943-5452.0001273).

The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence. Any use of this dataset must credit the authors by citing the above paper.

BWFLnet is an operational network in Bristol, UK, operated by Bristol Water in collaboration with the InfraSense Labs at Imperial College London and Cla-Val Ltd. The data provided is a the product of a long term research partnership between Bristol Water, Infrasense Labs at Imperial College London and Cla-Val on the design and control of dynamically adaptive networks. We acknowledge the financial support of EPSRC (EP/P004229/1, Dynamically Adaptive and Resilient Water Supply Networks for a Sustainable Future).

All data provided is recorded hydraulic data with locations and names anonymised. The authors hope that the publication of this dataset will facilitate the reproducibility of research in hydraulic model calibration as well as broader research in the water distribution sector.
i
LSD4WSD VX: An Open Dataset for Wet Snow Detection with SAR Data and...
ieee-dataport.org
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
matthieu gallet (2023). LSD4WSD VX: An Open Dataset for Wet Snow Detection with SAR Data and Physical Labelling - Full Analysis Version [Dataset]. https://ieee-dataport.org/documents/lsd4wsd-vx-open-dataset-wet-snow-detection-sar-data-and-physical-labelling-full-analysis
Explore at:
Dataset updated
Oct 30, 2023
Authors
matthieu gallet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Learning SAR Dataset for Wet Snow Detection - Full Analysis Version.
QML Pipeline Datasets
figshare.com
application/x-gzip
Updated Mar 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enrico Zardini (2023). QML Pipeline Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.22333102.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22333102.v1
Dataset updated
Mar 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Enrico Zardini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets used in the article Implementation and empirical evaluation of a quantum machine learning pipeline for local classification. The original versions of these datasets, which have undergone a preprocessing procedure (as described in the paper), have been taken from the UCI Machine Learning Repository.
h
Dataset and scripts for A Deep Dive into Machine Learning Density Functional...
rodare.hzdr.de
zip
Updated Oct 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fiedler, Lenz; Shah, Karan; Cangi, Attila; Bussmann, Michael (2021). Dataset and scripts for A Deep Dive into Machine Learning Density Functional Theory for Materials Science and Chemistry [Dataset]. http://doi.org/10.14278/rodare.1197
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.14278/rodare.1197
Dataset updated
Oct 1, 2021
Dataset provided by
HZDR / CASUS
Authors
Fiedler, Lenz; Shah, Karan; Cangi, Attila; Bussmann, Michael
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains additional data for the publication "A Deep Dive into Machine Learning Density Functional Theory for Materials Science and Chemistry". Its goal is to enable interested people to reproduce the citation analysis carried out in the aforementioned publication.

Prerequesites

The following software versions were used for the python version of this dataset:

Python: 3.8.6

Scholarly: 1.2.0

Pyzotero: 1.4.24

Numpy: 1.20.1

Contents

results/ : Contains the .csv files that were the results of the citation analysis. Paper groupings follow the ones outlined in the publication.

scripts/ : Contains scripts to perform the citation analysis.

Zotero.cached.pkl : Contains the cached Zotero library.

Usage

In order to reproduce the results of the citation analysis, you can use citation_analysis.py in conjunction with cached Zotero library. Manual additions can be verified using the check_consistency script.
Please note that you will need a Tor key for the citation analysis, and access to our Zotero library if you don't want to use the cached version. If you need this access, simply contact us.
m
RTAnews: A Benchmark for Multi-label Arabic Text Categorization
data.mendeley.com
semantichub.ijs.si
Updated Aug 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bassam Al-Salemi (2018). RTAnews: A Benchmark for Multi-label Arabic Text Categorization [Dataset]. http://doi.org/10.17632/322pzsdxwy.1
Explore at:
Unique identifier
https://doi.org/10.17632/322pzsdxwy.1
Dataset updated
Aug 18, 2018
Authors
Bassam Al-Salemi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RTAnews dataset is a collections of multi-label Arabic texts, collected form Russia Today in Arabic news portal. It consists of 23,837 texts (news articles) distributed over 40 categories, and divided into 15,001 texts for the training and 8,836 texts for the test.

The original dataset (without preprocessing), a preprocessed version of the dataset, versions of the dataset in MEKA and Mulan formats, single-label version, and WEAK version all are available.

For any enquiry or support regarding the dataset, please feel free to contact us via bassalemi at gmail dot com
o
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
explore.openaire.eu
Updated Sep 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios (2020). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4601051
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4601051
Dataset updated
Sep 22, 2020
Authors
Amir M. Mir; Evaldas Latoskinas; Georgios Gousios
Description
The dataset is gathered on Sep. 17th 2020 from GitHub. It has clean and complete versions (from v0.7): The clean version has 5.1K type-checked Python repositories and 1.2M type annotations. The complete version has 5.2K Python repositories and 3.3M type annotations. The dataset's source files are type-checked using mypy (clean version). The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository. {"references": ["A. Mir, E. Latoskinas and G. Gousios, "ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference," in 2021 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021 pp. 585-589. doi: 10.1109/MSR52588.2021.00079"]}
COALA dataset from 'Transfer learning improves antibiotic resistance class...
figshare.com
zip
Updated Dec 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Hamid (2019). COALA dataset from 'Transfer learning improves antibiotic resistance class prediction' [Dataset]. http://doi.org/10.6084/m9.figshare.11413302.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11413302.v2
Dataset updated
Dec 19, 2019
Dataset provided by
figshare
Authors
Nafiz Hamid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This includes all 3 different versions of the COALA dataset.COALA100 dataset is a collection of antibiotic resistance genes from 15 databases along with metadata information from these databases which includes the respective antibiotic class.The COALA70 dataset is the cd-hitted by 70% threshold version of the COALA100 dataset.The COALA40 dataset is the cd-hitted by 40% threshold version of the COALA100 dataset.All three datasets are in fasta format. The last section of the description line has the antibiotic label the gene confers resistance to. The second last section is the database name from where the gene was collected. All other sections convey information about the gene.
f
CK4Gen, High Utility Synthetic Survival Datasets
figshare.com
zip
Updated Nov 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Kuo (2024). CK4Gen, High Utility Synthetic Survival Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.27611388.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27611388.v1
Dataset updated
Nov 5, 2024
Dataset provided by
figshare
Authors
Nicholas Kuo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
===###Overview:This repository provides high-utility synthetic survival datasets generated using the CK4Gen framework, optimised to retain critical clinical characteristics for use in research and educational settings. Each dataset is based on a carefully curated ground truth dataset, processed with standardised variable definitions and analytical approaches, ensuring a consistent baseline for survival analysis.###===###Description:The repository includes synthetic versions of four widely utilised and publicly accessible survival analysis datasets, each anchored in foundational studies and aligned with established ground truth variations to support robust clinical research and training.#---GBSG2: Based on Schumacher et al. [1]. The study evaluated the effects of hormonal treatment and chemotherapy duration in node-positive breast cancer patients, tracking recurrence-free and overall survival among 686 women over a median of 5 years. Our synthetic version is derived from a variation of the GBSG2 dataset available in the lifelines package [2], formatted to match the descriptions in Sauerbrei et al. [3], which we treat as the ground truth.ACTG320: Based on Hammer et al. [4]. The study investigates the impact of adding the protease inhibitor indinavir to a standard two-drug regimen for HIV-1 treatment. The original clinical trial involved 1,151 patients with prior zidovudine exposure and low CD4 cell counts, tracking outcomes over a median follow-up of 38 weeks. Our synthetic dataset is derived from a variation of the ACTG320 dataset available in the sksurv package [5], which we treat as the ground truth dataset.WHAS500: Based on Goldberg et al. [6]. The study follows 500 patients to investigate survival rates following acute myocardial infarction (MI), capturing a range of factors influencing MI incidence and outcomes. Our synthetic data replicates a ground truth variation from the sksurv package, which we treat as the ground truth dataset.FLChain: Based on Dispenzieri et al. [7]. The study assesses the prognostic relevance of serum immunoglobulin free light chains (FLCs) for overall survival in a large cohort of 15,859 participants. Our synthetic version is based on a variation available in the sksurv package, which we treat as the ground truth dataset.###===###Notes:Please find an in-depth discussion on these datasets, as well as their generation process, in the link below, to our paper:https://arxiv.org/abs/2410.16872Kuo, et al. "CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare." arXiv preprint arXiv:2410.16872 (2024).###===###References:[1]: Schumacher, et al. “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German breast cancer study group.”, Journal of Clinical Oncology, 1994.[2]: Davidson-Pilon “lifelines: Survival Analysis in Python”, Journal of Open Source Software, 2019.[3]: Sauerbrei, et al. “Modelling the effects of standard prognostic factors in node-positive breast cancer”, British Journal of Cancer, 1999.[4]: Hammer, et al. “A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less”, New England Journal of Medicine, 1997.[5]: Pölsterl “scikit-survival: A library for time-to-event analysis built on top of scikit-learn”, Journal of Machine Learning Research, 2020.[6]: Goldberg, et al. “Incidence and case fatality rates of acute myocardial infarction (1975–1984): the Worcester heart attack study”, American Heart Journal, 1988.[7]: Dispenzieri, et al. “Use of nonclonal serum immunoglobulin free light chains to predict overall survival in the general population”, in Mayo Clinic Proceedings, 2012.
m
TeaLeafAgeQuality: Age-Stratified Tea Leaf Quality Classification Dataset
data.mendeley.com
Updated Jan 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Mohsin Kabir (2024). TeaLeafAgeQuality: Age-Stratified Tea Leaf Quality Classification Dataset [Dataset]. http://doi.org/10.17632/7t964jmmy3.1
Explore at:
Unique identifier
https://doi.org/10.17632/7t964jmmy3.1
Dataset updated
Jan 2, 2024
Authors
Md Mohsin Kabir
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "TeaLeafAgeQuality" dataset is curated for tea leaf classification, detection and quality prediction based on leaf age. This dataset encompasses a comprehensive collection of tea leaf images categorized into four classes corresponding to their age-based quality:

Category T1: Age 1 and 2 days, representing the highest quality tea leaves. (562 Raw Images) Category T2: Age 3 to 4 days, indicating good quality tea leaves. (615 Raw Images) Category T3: Age 5 to 7 days, indicating average or below-average quality tea leaves. (508 Raw Images) Category T4: Age 7+ days, denoting tea leaves unsuitable for brewing drinkable tea. (523 Raw Images)

Each category includes images depicting tea leaves at various stages of their age progression, facilitating research and analysis into the relationship between leaf age and tea quality. The dataset aims to contribute to the advancement of deep learning models for tea leaf classification and quality assessment.

This dataset comprises three versions: the first is raw, unannotated data, offering a pure, unmodified collection of tea leaves collected from the different tea gardens located at Sylhet, Bangladesh. The second version includes precise annotations, classified into four categories: T1, T2, T3, and T4, for targeted analysis. Finally, the third version contains both annotated and augmented data, enhancing the dataset for more advanced research applications. Each version caters to different levels of data analysis, from basic to complex.
d
Mechanical MNIST crack path extended version
search.dataone.org
datadryad.org
+1more
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saeed Mohammadzadeh; Emma Lejeune (2025). Mechanical MNIST crack path extended version [Dataset]. http://doi.org/10.5061/dryad.rv15dv486
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.rv15dv486
Dataset updated
May 3, 2025
Dataset provided by
Dryad Digital Repository
Authors
Saeed Mohammadzadeh; Emma Lejeune
Time period covered
Jan 1, 2021
Description
The Mechanical MNIST Crack Path dataset contains Finite Element simulation results from phase-field models of quasi-static brittle fracture in heterogeneous material domains subjected to prescribed loading and boundary conditions. For all samples, the material domain is a square with a side length of 1. There is an initial crack of fixed length (0.25) on the left edge of each domain. The bottom edge of the domain is fixed in x (horizontal) and y (vertical), the right edge of the domain is fixed in x and free in y, and the left edge is free in both x and y. The top edge is free in x, and in y it is displaced such that, at each step, the displacement increases linearly from zero at the top right corner to the maximum displacement on the top left corner. Maximum displacement starts at 0.0 and increases to 0.02 by increments of 0.0001 (200 simulation steps in total). The heterogeneous material distribution is obtained by adding rigid circular inclusions to the domain using the Fashion MNIST...
Z
Replication Package for the Paper: "A Machine Learning Based Ensemble Method...
data.niaid.nih.gov
Updated Jul 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous (2020). Replication Package for the Paper: "A Machine Learning Based Ensemble Method for Automatic Classification of Decisions" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3842756
Explore at:
Dataset updated
Jul 25, 2020
Dataset authored and provided by
Anonymous
Description
This is the replication package for the paper: "A Machine Learning Based Ensemble Method for Automatic Classification of Decisions". It contains the source code and dataset of our experiment for the replication by other researchers. In the meanwhile, we provide brief description of the files in the replication package in the following.

code folder

experiment.py contains the source code for our experiment, which is conducted on Windows 10 and Python 3.7.0. Note that you may get slightly different experiment results when conducting the experiments on different environment configurations.

requirements.txt records all the installation packages and their version numbers needed for the current program to run. You can use "pip install -r requirements.txt" to rebuild the project and install all dependencies. Note that you may get slightly different experiment results when using different packages or versions.

dataset folder

decisions.xlsx contains 848 labelled sentence-level decisions from the Hibernate developer mailing list.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski (2024). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. http://doi.org/10.5066/P9FUL880

Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0)

Explore at:

10 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5066/P9FUL880

Dataset updated

Sep 17, 2024

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Authors

License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Time period covered

Jan 1, 2000 - Dec 31, 2020

Description

This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public su ...

Clear search

Close search

Google apps

Main menu

Machine learning model that estimates total monthly and annual per capita...

Software code quality and source code metrics dataset

AI4Arctic / ASIP Sea Ice Dataset - version 2

SynSpeech Dataset (Small Version)

English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

ASIP Sea Ice Dataset - version 1

notMNIST

Context

Content

Acknowledgements

LSD4WSD : An Open Dataset for Wet Snow Detection with SAR Data and Physical...

Data from: Fashion Mnist Dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation:

BWFLnet + data

LSD4WSD VX: An Open Dataset for Wet Snow Detection with SAR Data and...

QML Pipeline Datasets

Dataset and scripts for A Deep Dive into Machine Learning Density Functional...

RTAnews: A Benchmark for Multi-label Arabic Text Categorization

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

COALA dataset from 'Transfer learning improves antibiotic resistance class...

CK4Gen, High Utility Synthetic Survival Datasets

TeaLeafAgeQuality: Age-Stratified Tea Leaf Quality Classification Dataset

Mechanical MNIST crack path extended version

Replication Package for the Paper: "A Machine Learning Based Ensemble Method...

Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0)See More Versions

Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0)