100+ datasets found
  1. U

    Machine learning model that estimates total monthly and annual per capita...

    • data.usgs.gov
    • datasets.ai
    • +2more
    Updated Sep 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski (2024). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. http://doi.org/10.5066/P9FUL880
    Explore at:
    Dataset updated
    Sep 17, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Jan 1, 2000 - Dec 31, 2020
    Description

    This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public su ...

  2. m

    Software code quality and source code metrics dataset

    • data.mendeley.com
    • narcis.nl
    Updated Feb 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayed Mohsin Reza (2021). Software code quality and source code metrics dataset [Dataset]. http://doi.org/10.17632/77p6rzb73n.2
    Explore at:
    Dataset updated
    Feb 17, 2021
    Authors
    Sayed Mohsin Reza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains quality, source code metrics information of 60 versions under 10 different repositories. The dataset is extracted into 3 levels: (1) Class (2) Method (3) Package. The dataset is created upon analyzing 9,420,246 lines of code and 173,237 classes. The provided dataset contains one quality_attributes folder and three associated files: repositories.csv, versions.csv, and attribute-details.csv. The first file (repositories.csv) contains general information(repository name, repository URL, number of commits, stars, forks, etc) in order to understand the size, popularity, and maintainability. File versions.csv contains general information (version unique ID, number of classes, packages, external classes, external packages, version repository link) to provide an overview of versions and how overtime the repository continues to grow. File attribute-details.csv contains detailed information (attribute name, attribute short form, category, and description) about extracted static analysis metrics and code quality attributes. The short form is used in the real dataset as a unique identifier to show value for packages, classes, and methods.

  3. d

    AI4Arctic / ASIP Sea Ice Dataset - version 2

    • data.dtu.dk
    pdf
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler; Leif Toudal Pedersen; David Malmgren-Hansen; Allan Aasbjerg Nielsen; Henning Skriver (2023). AI4Arctic / ASIP Sea Ice Dataset - version 2 [Dataset]. http://doi.org/10.11583/DTU.13011134.v3
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 12, 2023
    Dataset provided by
    Technical University of Denmark
    Authors
    Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler; Leif Toudal Pedersen; David Malmgren-Hansen; Allan Aasbjerg Nielsen; Henning Skriver
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The AI4Arctic / ASIP Sea Ice Dataset - version 2 (ASID-v2) contain 461 Sentinel-1 Synthetic Aperture Radar (SAR) scenes matched with sea ice charts produced by the Danish Meteorological Institute in 2018-2019. Ice charts contain sea ice concentration, stage of development and form of ice, provided in manual drawn polygons. The ice charts have been projected into the the S1 geometry for easy use as labels in deep learning or other machine learning algorithm training processes. The dataset also includes AMSR2 microwave radiometer sensor measurements to compliment the learning of the of sea ice concentrations although in a much lower resolution than the Sentinel-1 data. Details are described in the manual that is published together with the dataset.The manual has been revised, the latest is the 30-09-2020 version.

  4. f

    SynSpeech Dataset (Small Version)

    • figshare.com
    csv
    Updated Nov 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yusuf Brima (2024). SynSpeech Dataset (Small Version) [Dataset]. http://doi.org/10.6084/m9.figshare.27627840.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    figshare
    Authors
    Yusuf Brima
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SynSpeech Dataset (Small Version) is an English-language synthetic speech dataset created using OpenVoice and LibriSpeech-100 for bench-marking disentangled speech representation learning methods. It includes 50 unique speakers, each with 500 distinct sentences spoken in a “default” style at a 16kHz sampling rate. Data is organized by speaker ID, with a synspeech_Small_Metadata.csv file detailing speaker information, gender, speaking style, text, and file paths. This dataset is ideal for tasks in representation learning, speaker and content factorization, and TTS synthesis.

  5. m

    English/Turkish Wikipedia Named-Entity Recognition and Text Categorization...

    • data.mendeley.com
    Updated Feb 9, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Bahadir Sahin (2017). English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset [Dataset]. http://doi.org/10.17632/cdcztymf4k.1
    Explore at:
    Dataset updated
    Feb 9, 2017
    Authors
    H. Bahadir Sahin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TWNERTC and EWNERTC are collections of automatically categorized and annotated sentences obtained from Turkish and English Wikipedia for named-entity recognition and text categorization.

    Firstly, we construct large-scale gazetteers by using a graph crawler algorithm to extract relevant entity and domain information from a semantic knowledge base, Freebase. The final gazetteers has 77 domains (categories) and more than 1000 fine-grained entity types for both languages. Turkish gazetteers contains approximately 300K named-entities and English gazetteers has approximately 23M named-entities.

    By leveraging large-scale gazetteers and linked Wikipedia articles, we construct TWNERTC and EWNERTC. Since the categorization and annotation processes are automated, the raw collections are prone to ambiguity. Hence, we introduce two noise reduction methodologies: (a) domain-dependent (b) domain-independent. We produce two different versions by post-processing raw collections. As a result of this process, we introduced 3 versions of TWNERTC and EWNERTC: (a) raw (b) domain-dependent post-processed (c) domain-independent post-processed. Turkish collections have approximately 700K sentences for each version (varies between versions), while English collections contain more than 7M sentences.

    We also introduce "Coarse-Grained NER" versions of the same datasets. We reduce fine-grained types into "organization", "person", "location" and "misc" by mapping each fine-grained type to the most similar coarse-grained version. Note that this process also eliminated many domains and fine-grained annotations due to lack of information for coarse-grained NER. Hence, "Coarse-Grained NER" labelled datasets contain only 25 domains and number of sentences are decreased compared to "Fine-Grained NER" versions.

    All processes are explained in our published white paper for Turkish; however, major methods (gazetteers creation, automatic categorization/annotation, noise reduction) do not change for English.

  6. d

    ASIP Sea Ice Dataset - version 1

    • data.dtu.dk
    bin
    Updated Mar 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Malmgren-Hansen; Leif Toudal Pedersen; Allan Aasbjerg Nielsen; Henning Skriver; Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler (2020). ASIP Sea Ice Dataset - version 1 [Dataset]. http://doi.org/10.11583/DTU.11920416.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 6, 2020
    Dataset provided by
    Technical University of Denmark
    Authors
    David Malmgren-Hansen; Leif Toudal Pedersen; Allan Aasbjerg Nielsen; Henning Skriver; Roberto Saldo; Matilde Brandt Kreiner; Jørgen Buus-Hinkler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The ASIP Sea Ice Dataset - version 1, contains 912 Sentinel-1 (S1) Synthetic Aperture Radar (SAR) scenes matched with sea ice charts produced by the Danish Meteorological Institute from 2014-2017. Ice charts containing sea ice concentrations provided in manual drawn polygon over the scene has been projected into the the S1 geometry for easy use as labels in deep learning or other machine learning algorithm training processes. The dataset also includes AMSR2 microwave radiometer sensor measurements to compliment the learning of the of sea ice concentrations although in a much lower resolution than the S1 data.Details are described in the manual that is published together with the dataset.

  7. notMNIST

    • kaggle.com
    • opendatalab.com
    • +3more
    Updated Feb 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jwjohnson314 (2018). notMNIST [Dataset]. https://www.kaggle.com/datasets/jwjohnson314/notmnist/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    jwjohnson314
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The MNIST dataset is one of the best known image classification problems out there, and a veritable classic of the field of machine learning. This dataset is more challenging version of the same root problem: classifying letters from images. This is a multiclass classification dataset of glyphs of English letters A - J.

    This dataset is used extensively in the Udacity Deep Learning course, and is available in the Tensorflow Github repo (under Examples). I'm not aware of any license governing the use of this data, so I'm posting it here so that the community can use it with Kaggle kernels.

    Content

    notMNIST _large.zip is a large but dirty version of the dataset with 529,119 images, and notMNIST_small.zip is a small hand-cleaned version of the dataset, with 18726 images. The dataset was assembled by Yaroslav Bulatov, and can be obtained on his blog. According to this blog entry there is about a 6.5% label error rate on the large uncleaned dataset, and a 0.5% label error rate on the small hand-cleaned dataset.

    The two files each containing 28x28 grayscale images of letters A - J, organized into directories by letter. notMNIST_large.zip contains 529,119 images and notMNIST_small.zip contains 18726 images.

    Acknowledgements

    Thanks to Yaroslav Bulatov for putting together the dataset.

  8. LSD4WSD : An Open Dataset for Wet Snow Detection with SAR Data and Physical...

    • zenodo.org
    • explore.openaire.eu
    bin, pdf +1
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthieu Gallet; Matthieu Gallet; Abdourrahmane Atto; Abdourrahmane Atto; Fatima Karbou; Fatima Karbou; Emmanuel Trouvé; Emmanuel Trouvé (2024). LSD4WSD : An Open Dataset for Wet Snow Detection with SAR Data and Physical Labelling [Dataset]. http://doi.org/10.5281/zenodo.10046730
    Explore at:
    text/x-python, bin, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matthieu Gallet; Matthieu Gallet; Abdourrahmane Atto; Abdourrahmane Atto; Fatima Karbou; Fatima Karbou; Emmanuel Trouvé; Emmanuel Trouvé
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LSD4WSD V2.0

    Learning SAR Dataset for Wet Snow Detection - Full Analysis Version.

    The aim of this dataset is to provide a basis for automatic learning to detect wet snow. It is based on Sentinel-1 SAR GRD satellite images acquired between August 2020 and August 2021 over the French Alps. The new version of this dataset is no longer simply restricted to a classification task, and provides a set of metadata for each sample.

    Modification and improvements of the version 2.0.0 :

    • Number of massif: add 7 new massif to cover the all Sentinel-1 images (cf info.pdf).
    • Acquisition: add images of the descending pass in addition to those originally used in the ascending pass.
    • Sample: reduction in the size of the samples considered to 15 by 15 to facilitate evaluation at the central pixel.
    • Sample: increased density of extracted windows, with a distance of approximately 500 meters between the centers of the windows.
    • Sample: removal of the pre-processing involving the use of logarithms.
    • Sample: removal of the pre-processing involving the normalisation.
    • Labels: new structure for the labels part: dictionary with keys: topography, metadata and physics.
    • Labels: physics: addition of direct information from the CROCUS model for 3 simulations: Liquid Water Content, snow height and minimum snowpack temperature.
    • Labels: topography: information on the slope, altitude and average orientation of the sample.
    • Labels: metadata : information on the date of the sample, the mountain massif and the run (ascending or descending).
    • Dataset: removal of the train/test split*

    We leave it up to the user to use the Group Kfold method to validate the models using the alpine massif information.

    Finally, it consists of 2467516 samples of size 15 by 15 by 9. For each sample, the 9 metadata are provided, using in particular the Crocus physical model:

    • topography:
      • elevation (meters) (average),
      • orientation (degrees) (average),
      • slope (degrees) (average),
    • metadata:
      • name of the alpine massif,
      • date of acquisition,
      • type of acquisition (ascending/descending),
    • physics
      • Liquid Water Content (km/m2),
      • snow height (m),
      • minimum snowpack temperature (Celsius degree).

    The 9 channels are in the following order:

    • Sentinel-1 polarimetric channels: VV, VH and the combination C: VV/VH in linear,
    • Topographical features: altitude, orientation, slope
    • Polarimetric ratio with a reference summer image: VV/VVref, VH/VHref, C/Cref

    * The reference image selected is that of August 9th 2020, as a reference image without snow (cf. Nagler&al)

    An overview of the distribution and a summary of the sample statistics can be found in the file info.pdf.

    The data is stored in .hdf5 format with gzip compression. We provide a python script to read and request the data. The script is dataset_load.py. It is based on the h5py, numpy and pandas libraries. It allows to select a part or the whole dataset using requests on the metadata. The script is documented and can be used as described in the README.md file

    The processing chain is available at the following Github address.

    The authors would like to acknowledge the support from the National Centre for Space Studies (CNES) in providing computing facilities and access to SAR images via the PEPS platform.

    The authors would like to deeply thank Mathieu Fructus for running the Crocus simulations.

    Erratum :

    In the dataloader file, the name of the "aquisition" column must be added twice, see the correction below.:

    dtst_ld = Dataset_loader(path_dataset,shuffle=False,descrp=["date","massif","aquisition","aquisition","elevation","slope","orientation","tmin","hsnow","tel",],)

    If you have any comments, questions or suggestions, please contact the authors:

    • matthieu.gallet@univ-smb.fr
    • fatima.karbou@meteo.fr
    • abdourrahmane.atto@univ-smb.fr
    • emmanuel.trouve@univ-smb.fr

  9. R

    Data from: Fashion Mnist Dataset

    • universe.roboflow.com
    • opendatalab.com
    • +3more
    zip
    Updated Aug 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 10, 2022
    Dataset authored and provided by
    Popular Benchmarks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Clothing
    Description

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Authors:

    Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

    All images were sized 28x28 in the original dataset

    Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

    Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

    Version 1 (original-images_Original-FashionMNIST-Splits):

    • Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.
    • This version was not trained

    Version 3 (original-images_trainSetSplitBy80_20):

    • Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
    • https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

    Citation:

    @online{xiao2017/online,
     author    = {Han Xiao and Kashif Rasul and Roland Vollgraf},
     title    = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
     date     = {2017-08-28},
     year     = {2017},
     eprintclass = {cs.LG},
     eprinttype  = {arXiv},
     eprint    = {cs.LG/1708.07747},
    }
    
  10. m

    BWFLnet + data

    • data.mendeley.com
    Updated Jul 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Waldron (2020). BWFLnet + data [Dataset]. http://doi.org/10.17632/srt4vr5k38.3
    Explore at:
    Dataset updated
    Jul 11, 2020
    Authors
    Alexander Waldron
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is supplementary data for: Waldron, A., Pecci, F., Stoianov, I. (2020). Regularization of an Inverse Problem for Parameter Estimation in Water Distribution Networks. Journal of Water Resources and Planning Management, 146(9):04020076 (https://doi.org/10.1061/(ASCE)WR.1943-5452.0001273).

    The files associated with this dataset are licensed under a Creative Commons Attribution 4.0 International licence. Any use of this dataset must credit the authors by citing the above paper.

    BWFLnet is an operational network in Bristol, UK, operated by Bristol Water in collaboration with the InfraSense Labs at Imperial College London and Cla-Val Ltd. The data provided is a the product of a long term research partnership between Bristol Water, Infrasense Labs at Imperial College London and Cla-Val on the design and control of dynamically adaptive networks. We acknowledge the financial support of EPSRC (EP/P004229/1, Dynamically Adaptive and Resilient Water Supply Networks for a Sustainable Future).

    All data provided is recorded hydraulic data with locations and names anonymised. The authors hope that the publication of this dataset will facilitate the reproducibility of research in hydraulic model calibration as well as broader research in the water distribution sector.

  11. i

    LSD4WSD VX: An Open Dataset for Wet Snow Detection with SAR Data and...

    • ieee-dataport.org
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    matthieu gallet (2023). LSD4WSD VX: An Open Dataset for Wet Snow Detection with SAR Data and Physical Labelling - Full Analysis Version [Dataset]. https://ieee-dataport.org/documents/lsd4wsd-vx-open-dataset-wet-snow-detection-sar-data-and-physical-labelling-full-analysis
    Explore at:
    Dataset updated
    Oct 30, 2023
    Authors
    matthieu gallet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Learning SAR Dataset for Wet Snow Detection - Full Analysis Version.

  12. QML Pipeline Datasets

    • figshare.com
    application/x-gzip
    Updated Mar 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrico Zardini (2023). QML Pipeline Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.22333102.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Mar 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Enrico Zardini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets used in the article Implementation and empirical evaluation of a quantum machine learning pipeline for local classification. The original versions of these datasets, which have undergone a preprocessing procedure (as described in the paper), have been taken from the UCI Machine Learning Repository.

  13. h

    Dataset and scripts for A Deep Dive into Machine Learning Density Functional...

    • rodare.hzdr.de
    zip
    Updated Oct 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fiedler, Lenz; Shah, Karan; Cangi, Attila; Bussmann, Michael (2021). Dataset and scripts for A Deep Dive into Machine Learning Density Functional Theory for Materials Science and Chemistry [Dataset]. http://doi.org/10.14278/rodare.1197
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 1, 2021
    Dataset provided by
    HZDR / CASUS
    Authors
    Fiedler, Lenz; Shah, Karan; Cangi, Attila; Bussmann, Michael
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains additional data for the publication "A Deep Dive into Machine Learning Density Functional Theory for Materials Science and Chemistry". Its goal is to enable interested people to reproduce the citation analysis carried out in the aforementioned publication.

    Prerequesites

    The following software versions were used for the python version of this dataset:

    Python: 3.8.6

    Scholarly: 1.2.0

    Pyzotero: 1.4.24

    Numpy: 1.20.1

    Contents

    results/ : Contains the .csv files that were the results of the citation analysis. Paper groupings follow the ones outlined in the publication.

    scripts/ : Contains scripts to perform the citation analysis.

    Zotero.cached.pkl : Contains the cached Zotero library.

    Usage

    In order to reproduce the results of the citation analysis, you can use citation_analysis.py in conjunction with cached Zotero library. Manual additions can be verified using the check_consistency script.
    Please note that you will need a Tor key for the citation analysis, and access to our Zotero library if you don't want to use the cached version. If you need this access, simply contact us.

  14. m

    RTAnews: A Benchmark for Multi-label Arabic Text Categorization

    • data.mendeley.com
    • semantichub.ijs.si
    Updated Aug 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bassam Al-Salemi (2018). RTAnews: A Benchmark for Multi-label Arabic Text Categorization [Dataset]. http://doi.org/10.17632/322pzsdxwy.1
    Explore at:
    Dataset updated
    Aug 18, 2018
    Authors
    Bassam Al-Salemi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RTAnews dataset is a collections of multi-label Arabic texts, collected form Russia Today in Arabic news portal. It consists of 23,837 texts (news articles) distributed over 40 categories, and divided into 15,001 texts for the training and 8,836 texts for the test.

    The original dataset (without preprocessing), a preprocessed version of the dataset, versions of the dataset in MEKA and Mulan formats, single-label version, and WEAK version all are available.

    For any enquiry or support regarding the dataset, please feel free to contact us via bassalemi at gmail dot com

  15. o

    Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • explore.openaire.eu
    Updated Sep 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir M. Mir; Evaldas Latoskinas; Georgios Gousios (2020). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4601051
    Explore at:
    Dataset updated
    Sep 22, 2020
    Authors
    Amir M. Mir; Evaldas Latoskinas; Georgios Gousios
    Description

    The dataset is gathered on Sep. 17th 2020 from GitHub. It has clean and complete versions (from v0.7): The clean version has 5.1K type-checked Python repositories and 1.2M type annotations. The complete version has 5.2K Python repositories and 3.3M type annotations. The dataset's source files are type-checked using mypy (clean version). The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository. {"references": ["A. Mir, E. Latoskinas and G. Gousios, "ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference," in 2021 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021 pp. 585-589. doi: 10.1109/MSR52588.2021.00079"]}

  16. COALA dataset from 'Transfer learning improves antibiotic resistance class...

    • figshare.com
    zip
    Updated Dec 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nafiz Hamid (2019). COALA dataset from 'Transfer learning improves antibiotic resistance class prediction' [Dataset]. http://doi.org/10.6084/m9.figshare.11413302.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 19, 2019
    Dataset provided by
    figshare
    Authors
    Nafiz Hamid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This includes all 3 different versions of the COALA dataset.COALA100 dataset is a collection of antibiotic resistance genes from 15 databases along with metadata information from these databases which includes the respective antibiotic class.The COALA70 dataset is the cd-hitted by 70% threshold version of the COALA100 dataset.The COALA40 dataset is the cd-hitted by 40% threshold version of the COALA100 dataset.All three datasets are in fasta format. The last section of the description line has the antibiotic label the gene confers resistance to. The second last section is the database name from where the gene was collected. All other sections convey information about the gene.

  17. f

    CK4Gen, High Utility Synthetic Survival Datasets

    • figshare.com
    zip
    Updated Nov 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Kuo (2024). CK4Gen, High Utility Synthetic Survival Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.27611388.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 5, 2024
    Dataset provided by
    figshare
    Authors
    Nicholas Kuo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ===###Overview:This repository provides high-utility synthetic survival datasets generated using the CK4Gen framework, optimised to retain critical clinical characteristics for use in research and educational settings. Each dataset is based on a carefully curated ground truth dataset, processed with standardised variable definitions and analytical approaches, ensuring a consistent baseline for survival analysis.###===###Description:The repository includes synthetic versions of four widely utilised and publicly accessible survival analysis datasets, each anchored in foundational studies and aligned with established ground truth variations to support robust clinical research and training.#---GBSG2: Based on Schumacher et al. [1]. The study evaluated the effects of hormonal treatment and chemotherapy duration in node-positive breast cancer patients, tracking recurrence-free and overall survival among 686 women over a median of 5 years. Our synthetic version is derived from a variation of the GBSG2 dataset available in the lifelines package [2], formatted to match the descriptions in Sauerbrei et al. [3], which we treat as the ground truth.ACTG320: Based on Hammer et al. [4]. The study investigates the impact of adding the protease inhibitor indinavir to a standard two-drug regimen for HIV-1 treatment. The original clinical trial involved 1,151 patients with prior zidovudine exposure and low CD4 cell counts, tracking outcomes over a median follow-up of 38 weeks. Our synthetic dataset is derived from a variation of the ACTG320 dataset available in the sksurv package [5], which we treat as the ground truth dataset.WHAS500: Based on Goldberg et al. [6]. The study follows 500 patients to investigate survival rates following acute myocardial infarction (MI), capturing a range of factors influencing MI incidence and outcomes. Our synthetic data replicates a ground truth variation from the sksurv package, which we treat as the ground truth dataset.FLChain: Based on Dispenzieri et al. [7]. The study assesses the prognostic relevance of serum immunoglobulin free light chains (FLCs) for overall survival in a large cohort of 15,859 participants. Our synthetic version is based on a variation available in the sksurv package, which we treat as the ground truth dataset.###===###Notes:Please find an in-depth discussion on these datasets, as well as their generation process, in the link below, to our paper:https://arxiv.org/abs/2410.16872Kuo, et al. "CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare." arXiv preprint arXiv:2410.16872 (2024).###===###References:[1]: Schumacher, et al. “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German breast cancer study group.”, Journal of Clinical Oncology, 1994.[2]: Davidson-Pilon “lifelines: Survival Analysis in Python”, Journal of Open Source Software, 2019.[3]: Sauerbrei, et al. “Modelling the effects of standard prognostic factors in node-positive breast cancer”, British Journal of Cancer, 1999.[4]: Hammer, et al. “A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less”, New England Journal of Medicine, 1997.[5]: Pölsterl “scikit-survival: A library for time-to-event analysis built on top of scikit-learn”, Journal of Machine Learning Research, 2020.[6]: Goldberg, et al. “Incidence and case fatality rates of acute myocardial infarction (1975–1984): the Worcester heart attack study”, American Heart Journal, 1988.[7]: Dispenzieri, et al. “Use of nonclonal serum immunoglobulin free light chains to predict overall survival in the general population”, in Mayo Clinic Proceedings, 2012.

  18. m

    TeaLeafAgeQuality: Age-Stratified Tea Leaf Quality Classification Dataset

    • data.mendeley.com
    Updated Jan 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Mohsin Kabir (2024). TeaLeafAgeQuality: Age-Stratified Tea Leaf Quality Classification Dataset [Dataset]. http://doi.org/10.17632/7t964jmmy3.1
    Explore at:
    Dataset updated
    Jan 2, 2024
    Authors
    Md Mohsin Kabir
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "TeaLeafAgeQuality" dataset is curated for tea leaf classification, detection and quality prediction based on leaf age. This dataset encompasses a comprehensive collection of tea leaf images categorized into four classes corresponding to their age-based quality:

    Category T1: Age 1 and 2 days, representing the highest quality tea leaves. (562 Raw Images) Category T2: Age 3 to 4 days, indicating good quality tea leaves. (615 Raw Images) Category T3: Age 5 to 7 days, indicating average or below-average quality tea leaves. (508 Raw Images) Category T4: Age 7+ days, denoting tea leaves unsuitable for brewing drinkable tea. (523 Raw Images)

    Each category includes images depicting tea leaves at various stages of their age progression, facilitating research and analysis into the relationship between leaf age and tea quality. The dataset aims to contribute to the advancement of deep learning models for tea leaf classification and quality assessment.

    This dataset comprises three versions: the first is raw, unannotated data, offering a pure, unmodified collection of tea leaves collected from the different tea gardens located at Sylhet, Bangladesh. The second version includes precise annotations, classified into four categories: T1, T2, T3, and T4, for targeted analysis. Finally, the third version contains both annotated and augmented data, enhancing the dataset for more advanced research applications. Each version caters to different levels of data analysis, from basic to complex.

  19. d

    Mechanical MNIST crack path extended version

    • search.dataone.org
    • datadryad.org
    • +1more
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saeed Mohammadzadeh; Emma Lejeune (2025). Mechanical MNIST crack path extended version [Dataset]. http://doi.org/10.5061/dryad.rv15dv486
    Explore at:
    Dataset updated
    May 3, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Saeed Mohammadzadeh; Emma Lejeune
    Time period covered
    Jan 1, 2021
    Description

    The Mechanical MNIST Crack Path dataset contains Finite Element simulation results from phase-field models of quasi-static brittle fracture in heterogeneous material domains subjected to prescribed loading and boundary conditions. For all samples, the material domain is a square with a side length of 1. There is an initial crack of fixed length (0.25) on the left edge of each domain. The bottom edge of the domain is fixed in x (horizontal) and y (vertical), the right edge of the domain is fixed in x and free in y, and the left edge is free in both x and y. The top edge is free in x, and in y it is displaced such that, at each step, the displacement increases linearly from zero at the top right corner to the maximum displacement on the top left corner. Maximum displacement starts at 0.0 and increases to 0.02 by increments of 0.0001 (200 simulation steps in total). The heterogeneous material distribution is obtained by adding rigid circular inclusions to the domain using the Fashion MNIST...

  20. Z

    Replication Package for the Paper: "A Machine Learning Based Ensemble Method...

    • data.niaid.nih.gov
    Updated Jul 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2020). Replication Package for the Paper: "A Machine Learning Based Ensemble Method for Automatic Classification of Decisions" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3842756
    Explore at:
    Dataset updated
    Jul 25, 2020
    Dataset authored and provided by
    Anonymous
    Description

    This is the replication package for the paper: "A Machine Learning Based Ensemble Method for Automatic Classification of Decisions". It contains the source code and dataset of our experiment for the replication by other researchers. In the meanwhile, we provide brief description of the files in the replication package in the following.

    1. code folder

    experiment.py contains the source code for our experiment, which is conducted on Windows 10 and Python 3.7.0. Note that you may get slightly different experiment results when conducting the experiments on different environment configurations.

    requirements.txt records all the installation packages and their version numbers needed for the current program to run. You can use "pip install -r requirements.txt" to rebuild the project and install all dependencies. Note that you may get slightly different experiment results when using different packages or versions.

    1. dataset folder

    decisions.xlsx contains 848 labelled sentence-level decisions from the Hibernate developer mailing list.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski (2024). Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0) [Dataset]. http://doi.org/10.5066/P9FUL880

Machine learning model that estimates total monthly and annual per capita public-supply water use (version 2.0)

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Sep 17, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Ayman Alzraiee; Carol Luukkonen; Richard Niswonger; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Lisa Miller; Kristen Valseth; Joshua Larsen; Donald Martin; Cheryl Dieter; Jana Stewart; Scott Paulinski
License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Time period covered
Jan 1, 2000 - Dec 31, 2020
Description

This child item describes a machine learning model that was developed to estimate public-supply water use by water service area (WSA) boundary and 12-digit hydrologic unit code (HUC12) for the conterminous United States. This model was used to develop an annual and monthly reanalysis of public supply water use for the period 2000-2020. This data release contains model input feature datasets, python codes used to develop and train the water use machine learning model, and output water use predictions by HUC12 and WSA. Public supply water use estimates and statistics files for HUC12s are available on this child item landing page. Public supply water use estimates and statistics for WSAs are available in public_water_use_model.zip. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public su ...

Search
Clear search
Close search
Google apps
Main menu