69 datasets found
  1. Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv +1
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
    Explore at:
    text/x-python, csv, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juliane Köhler; Juliane Köhler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
    • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
    • ger_train.csv – The German training set as CSV file.
    • ger_validation.csv – The German validation set as CSV file.
    • en_test.csv – The English test set as CSV file.
    • en_train.csv – The English training set as CSV file.
    • en_validation.csv – The English validation set as CSV file.
    • splitting.py – The python code for splitting a dataset into train, test and validation set.
    • DataSetTrans_de.csv – The final German dataset as a CSV file.
    • DataSetTrans_en.csv – The final English dataset as a CSV file.
    • translation.py – The python code for translating the cleaned dataset.
  2. Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • data.europa.eu
    • zenodo.org
    unknown
    Updated Mar 12, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://data.europa.eu/88u/dataset/oai-zenodo-org-4601051
    Explore at:
    unknown(393755141)Available download formats
    Dataset updated
    Mar 12, 2021
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Name-based visible type hints for processed projects are stored in the extracted_visible_types folder. Notable changes to each version of the dataset are documented in CHANGELOG.md.

  3. e

    Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Jun 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/63d20a5e-3584-5096-a34d-d3f93fcc8857
    Explore at:
    Dataset updated
    Jun 2, 2024
    Description

    Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder

  4. T

    ref_coco

    • tensorflow.org
    • opendatalab.com
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco
    Explore at:
    Dataset updated
    May 31, 2024
    Description

    A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

    RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

    Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

    Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

    datasetpartitionsplitrefsimages
    refcocogoogletrain4000019213
    refcocogoogleval50004559
    refcocogoogletest50004527
    refcocounctrain4240416994
    refcocouncval38111500
    refcocounctestA1975750
    refcocounctestB1810750
    refcoco+unctrain4227816992
    refcoco+uncval38051500
    refcoco+unctestA1975750
    refcoco+unctestB1798750
    refcocoggoogletrain4482224698
    refcocoggoogleval50004650
    refcocogumdtrain4222621899
    refcocogumdval25731300
    refcocogumdtest50232600

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ref_coco', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

  5. h

    sft-python-q-problems

    • huggingface.co
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morgan Stanley (2025). sft-python-q-problems [Dataset]. https://huggingface.co/datasets/morganstanley/sft-python-q-problems
    Explore at:
    Dataset updated
    Aug 31, 2025
    Dataset authored and provided by
    Morgan Stanley
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SFT Python-Q Programming Problems Dataset

    This dataset contains programming problems with solutions in both Python and Q programming languages, designed for supervised fine-tuning of code generation models.

      📊 Dataset Overview
    

    Total Problems: 678 unique programming problems Train Split: 542 problems
    Test Split: 136 problems Languages: Python and Q Source: LeetCode-style algorithmic problems Format: Multiple data formats for different use cases

      🎯 Key Features… See the full description on the dataset page: https://huggingface.co/datasets/morganstanley/sft-python-q-problems.
    
  6. h

    codeparrot-train-more-filtering

    • huggingface.co
    Updated Apr 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2022). codeparrot-train-more-filtering [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2022
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    Description

    CodeParrot 🦜 Dataset Cleaned and filtered (train)

      Dataset Description
    

    A dataset of Python files from Github. It is a more filtered version of the train split codeparrot-clean-train of codeparrot-clean. The additional filters aim at detecting configuration and test files, as well as outlier files that are unlikely to help the model learn code. The first three filters are applied with a probability of 0.7:

    files with a mention of "test file" or "configuration file" or… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering.

  7. f

    CYP450 80/20 splits

    • figshare.com
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Siegle (2016). CYP450 80/20 splits [Dataset]. http://doi.org/10.6084/m9.figshare.1066108.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    figshare
    Authors
    Daniel Siegle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data from an NIH HTS of 17K compounds against five isozymes of cytochrome P450 screening for inhibition. The activity score is taken from the NIH assay and merged with all the 2-D descriptors from the program Molecular Operating Environment (MOE). The datasets are separated by isozyme and then balanced between actives and inactives. Finally the balanced datasets are subject to an 80/20 training/test split. Link to python script of data manipulation...

  8. E

    Data from: Keyword extraction datasets for Croatian, Estonian, Latvian and...

    • live.european-language-grid.eu
    binary format
    Updated Jun 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Keyword extraction datasets for Croatian, Estonian, Latvian and Russian 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8369
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 3, 2021
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    EACL Hackashop Keyword Challenge Datasets

    In this repository you can find ids of articles used for the keyword extraction challenge at

    EACL Hackashop on News Media Content Analysis and Automated Report Generation (http://embeddia.eu/hackashop2021/). The article ids can be used to generate train-test split used in paper:

    Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.

    Train and test splits are provided for Latvian, Estonian, Russian and Croatian.

    The articles with the corresponding ID-s can be extracted from the following datasets:

    - Estonian and Russian (use the eearticles2015-2019 dataset): https://www.clarin.si/repository/xmlui/handle/11356/1408

    - Latvian: https://www.clarin.si/repository/xmlui/handle/11356/1409

    - Croatian: https://www.clarin.si/repository/xmlui/handle/11356/1410

    dataset_ids folder is organized in the following way:

    - latvian – containing latvian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the latvian_test.json: a json file with ids from test articles to replicate the data

    - estonian – containing estonian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the estonian_test.json: a json file with ids from test articles to replicate the data

    - russian – containing russian_train.json: a json file with ids from train articles to replicate the train data used in Koloski et al. (2020), the russian_test.json: a json file with ids from test articles to replicate the data

    - croatian - containing croatian_id_train.tsv file with sites and ids (note that just ids are not unique across dataset, therefore site information also needs to be included to obtain a unique article identifier) of articles in the train set, and the croatian_id_test.tsv file with sites and ids of articles in the test set.

    In addition, scripts are provided for extracting articles (see folder parse containing scripts parse.py and build_croatian_dataset.py, requirements for scripts are pandas and bs4 Python libraries):

    parse.py is used for extraction of Estonian, Russian and Latvian train and test datasets:

    Instructions:

    ESTONIAN-RUSSIAN

    1) Retrieve the data ee_articles_2015_2019.zip

    2) Create a folder 'data' and subfolder 'ee'

    3) Unzip them in the 'data/ee' folder

    To extract train/test Estonian articles:

    run function 'build_dataset(lang="ee", opt="nat")' in the parse.py script

    To extract train/test Russian articles:

    run function 'build_dataset(lang="ee", opt="rus")' in the parse.py script

    LATVIAN:

    1) Retrieve the latvian data

    2) Unzip it in 'data/lv' folder

    3) To extract train/test Latvian articles:

    run function 'build_dataset(lang="lv", opt="nat")' in the parse.py script

    build_croatian_dataset.py is used for extraction of Croatian train and test datasets:

    Instructions:

    CROATIAN:

    1) Retrieve the Croatian data (file 'STY_24sata_articles_hr_PUB-01.csv')

    2) put the script 'build_croatian_dataset.py' in the same folder as the extracted data and run it (e.g., python build_croatian_dataset.py).

    For additional questions: {Boshko.Koloski,Matej.Martinc,Senja.Pollak}@ijs.si

  9. English Wikipedia Quality Asssessment Dataset

    • figshare.com
    application/bzip2
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morten Warncke-Wang (2023). English Wikipedia Quality Asssessment Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.1375406.v2
    Explore at:
    application/bzip2Available download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Morten Warncke-Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets of articles and their associated quality assessment rating from the English Wikipedia. Each dataset is self-contained as it also includes all content (wiki markup) associated with a given revision. The datasets have been split into a 90% training set and 10% test set using a stratified random sampling strategy.The 2017 dataset is the preferred dataset to use, contains 32,460 articles, and was gathered on 2017/09/10. The 2015 dataset is maintained for historic reference, and contains 30,272 articles gathered on 2015/02/05.The articles were sampled from six of English Wikipedia's seven assessment classes, with the exception of the Featured Article class, which contains all (2015 dataset) or almost all (2017 dataset) articles in that class at the time. Articles are assumed to belong to the highest quality class they are rated as and article history has been mined to find the appropriate revision associated with a given quality rating. Due to the low usage of A-class articles, this class is not part of the datasets. For more details, see "The Success and Failure of Quality Improvement Projects in Peer Production Communities" by Warncke-Wang et al. (CSCW 2015), linked below. These datasets have been used in training the wikiclass Python library machine learner, also linked below.

  10. T

    wiki40b

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Aug 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). wiki40b [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki40b
    Explore at:
    Dataset updated
    Aug 30, 2023
    Description

    Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects. The language models trained on this corpus - including 41 monolingual models, and 2 multilingual models - can be found at https://tfhub.dev/google/collections/wiki40b-lm/1.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wiki40b', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  11. n

    Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

    • narcis.nl
    • data.mendeley.com
    Updated Jan 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoo, T (via Mendeley Data) (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
    Explore at:
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Yoo, T (via Mendeley Data)
    Description

    Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

    We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

    This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

    This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

    Python version:

    from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

    connect data in your google drive

    from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

    Change the path for the custom data

    In this case, we used ICL vault prediction using preop measurement

    dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

    optimal features (sorted by importance) :

    1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

    7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

    y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

    Split the dataset to train and test data, if necessary.

    For example, we can split data to 8:2 as a simple validation test

    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

    In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

    Optimal parameter search could be performed in this section

    parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

    RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_

  12. Monkeypox Skin Lesion Dataset

    • kaggle.com
    Updated Jul 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TensorKitty (2022). Monkeypox Skin Lesion Dataset [Dataset]. https://www.kaggle.com/datasets/nafin59/monkeypox-skin-lesion-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    TensorKitty
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    An updated version of the MSLD dataset, MSLD v2.0 has been released after being verified by an expert dermatologist!

    For details, check our GitHub repo!

    Context

    The recent monkeypox outbreak has become a global healthcare concern owing to its rapid spread in more than 65 countries around the globe. To obstruct its expeditious pace, early diagnosis is a must. But the confirmatory Polymerase Chain Reaction (PCR) tests and other biochemical assays are not readily available in sufficient quantities. In this scenario, computer-aided monkeypox identification from skin lesion images can be a beneficial measure. Nevertheless, so far, such datasets are not available. Hence, the "Monkeypox Skin Lesion Dataset (MSLD)" is created by collecting and processing images from different means of web-scrapping i.e., from news portals, websites and publicly accessible case reports.

    The creation of "Monkeypox Image Lesion Dataset" is primarily focused on distinguishing the monkeypox cases from the similar non-monkeypox cases. Therefore, along with the 'Monkeypox' class, we included skin lesion images of 'Chickenpox' and 'Measles' because of their resemblance to the monkeypox rash and pustules in initial state in another class named 'Others' to perform binary classification.

    Content

    There are 3 folders in the dataset.

    1) Original Images: It contains a total number of 228 images, among which 102 belongs to the 'Monkeypox' class and the remaining 126 represents the 'Others' class i.e., non-monkeypox (chickenpox and measles) cases.

    2) Augmented Images: To aid the classification task, several data augmentation methods such as rotation, translation, reflection, shear, hue, saturation, contrast and brightness jitter, noise, scaling etc. have been applied using MATLAB R2020a. Although this can be readily done using ImageGenerator/other image augmentors, to ensure reproducibility of the results, the augmented images are provided in this folder. Post-augmentation, the number of images increased by approximately 14-folds. The classes 'Monkeypox' and 'Others' have 1428 and 1764 images, respectively.

    3) Fold1: One of the three-fold cross validation datasets. To avoid any sort of bias in training, three-fold cross validation was performed. The original images were split into training, validation and test set(s) with the approximate proportion of 70 : 10 : 20 while maintaining patient independence. According to the commonly perceived data preparation practice, only the training and validation images were augmented while the test set contained only the original images. Users have the option of using the folds directly or using the original data and employing other algorithms to augment it.

    Additionally, a CSV file is provided that has 228 rows and two columns. The table contains the list of all the ImageID(s) with their corresponding label.

    Web Application

    Since monkeypox is demonstrating a very rapid community transmission pattern, a consumer-level software is truly necessary to increase awareness and encourage people to take rapid action. We have developed an easy-to-use web application named Monkey Pox Detector using the open-source python streamlit framework that uses our trained model to address this issue. It makes predictions on whether or not to see a specialist along with the prediction accuracy. Future updates will benefit from the user data we continue to collect and use to improve our model. The web app has a flask core, so that it can be deployed cross-platform in the future.

    Learn more at our GitHub repo!

    Citation

    If this dataset helped your research, please cite the following articles:

    Ali, S. N., Ahmed, M. T., Paul, J., Jahan, T., Sani, S. M. Sakeef, Noor, N., & Hasan, T. (2022). Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study. arXiv preprint arXiv:2207.03342.

    @article{Nafisa2022, title={Monkeypox Skin Lesion Detection Using Deep Learning Models: A Preliminary Feasibility Study}, author={Ali, Shams Nafisa and Ahmed, Md. Tazuddin and Paul, Joydip and Jahan, Tasnim and Sani, S. M. Sakeef and Noor, Nawshaba and Hasan, Taufiq}, journal={arXiv preprint arXiv:2207.03342}, year={2022} }

    Ali, S. N., Ahmed, M. T., Jahan, T., Paul, J., Sani, S. M. Sakeef, Noor, N., Asma, A. N., & Hasan, T. (2023). A Web-based Mpox Skin Lesion Detection System Using State-of-the-art Deep Learning Models Considering Racial Diversity. arXiv preprint arXiv:2306.14169.

    @article{Nafisa2023, title={A Web-base...

  13. Z

    VegeNet - Image datasets and Codes

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tan, Jo Yen (2022). VegeNet - Image datasets and Codes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7254507
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    Tan, Jo Yen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

    Image datasets:

    vege_original : Images of vegetables captured manually in data acquisition stage

    vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed

    non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods

    food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.

    food_image_dataset_split : Image dataset (4) split into train and test sets

    process : Images created when cropping (pre-processing step) to create dataset (2).

  14. T

    cifar10

    • tensorflow.org
    • opendatalab.com
    • +3more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). cifar10 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar10
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('cifar10', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">

  15. p

    Tree Point Classification - New Zealand

    • pacificgeoportal.com
    • geoportal-pacificcore.hub.arcgis.com
    Updated Jul 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eagle Technology Group Ltd (2022). Tree Point Classification - New Zealand [Dataset]. https://www.pacificgeoportal.com/content/0e2e3d0d0ef843e690169cac2f5620f9
    Explore at:
    Dataset updated
    Jul 26, 2022
    Dataset authored and provided by
    Eagle Technology Group Ltd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story

  16. e

    Data for binary classification experiments - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Aug 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Data for binary classification experiments - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/d06f6716-2ae6-56d5-abbe-e8526da23582
    Explore at:
    Dataset updated
    Aug 16, 2025
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Predicting spatial familiarity by exploiting head and eye movements during pedestrian navigation in the real world This paper will be published in Springer Nature Scientific Reports. File overview The structure of the archive is the following: Folder "01_data" contains all the data files needed and a readme file describing the structure of each of these data files. These data files are: lsp.csv [contains demographic data about participants] matched_gaze_imu.csv [contains the segmented behavioral data, i.e. both gaze features and imu features] matched_gaze_imu_feature_description.pdf [contains a description of the features contained in matched_gaze_imu.csv] walking_dates.csv [contains an overview on which date participants walked the familiar and unfamiliar routes] users_polygons.csv [contains one or more polygons per participant in which they are familiar] polygons_markers.csv [contains locations of POIs per polygon for which participants reported to be familiar with] user_routes.csv [containes the route participants provided between a randomly selected pair of POIs they have provided for a given polygon] Folder "02_scripts" contains the data analysis scripts; they are organized in two subfolders: 01_ml_scripts: these are the scripts for the XGBoost classification; they are organized as two python files in which further instructions for use are given. 80_20_code.py is the python file which runs the ML experiments using an 80/20 train/test split L5O4T_code.py is the python file which runs the ML experiments leaving the full data of five different participants per condition as unseen data for the test. requirements.txt states the used Python package versions 02_r_scripts: cleaned_script.Rmd This is an R notebook which can be easily opened in R-Studio and provides the analysis scripts for the descriptive statistics presented in the paper.

  17. Dataset for Cost-effective Simulation-based Test Selection in Self-driving...

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella (2024). Dataset for Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor [Dataset]. http://doi.org/10.5281/zenodo.5914130
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SDC-Scissor tool for Cost-effective Simulation-based Test Selection in Self-driving Cars Software

    This dataset provides test cases for self-driving cars with the BeamNG simulator. Check out the repository and demo video to get started.

    GitHub: github.com/ChristianBirchler/sdc-scissor

    This project extends the tool competition platform from the Cyber-Phisical Systems Testing Competition which was part of the SBST Workshop in 2021.

    Usage

    Demo

    YouTube Link

    Installation

    The tool can either be run with Docker or locally using Poetry.

    When running the simulations a working installation of BeamNG.research is required. Additionally, this simulation cannot be run in a Docker container but must run locally.

    To install the application use one of the following approaches:

    • Docker: docker build --tag sdc-scissor .
    • Poetry: poetry install

    Using the Tool

    The tool can be used with the following two commands:

    • Docker: docker run --volume "$(pwd)/results:/out" --rm sdc-scissor [COMMAND] [OPTIONS] (this will write all files written to /out to the local folder results)
    • Poetry: poetry run python sdc-scissor.py [COMMAND] [OPTIONS]

    There are multiple commands to use. For simplifying the documentation only the command and their options are described.

    • Generation of tests:
      • generate-tests --out-path /path/to/store/tests
    • Automated labeling of Tests:
      • label-tests --road-scenarios /path/to/tests --result-folder /path/to/store/labeled/tests
      • Note: This only works locally with BeamNG.research installed
    • Model evaluation:
      • evaluate-models --dataset /path/to/train/set --save
    • Split train and test data:
      • split-train-test-data --scenarios /path/to/scenarios --train-dir /path/for/train/data --test-dir /path/for/test/data --train-ratio 0.8
    • Test outcome prediction:
      • predict-tests --scenarios /path/to/scenarios --classifier /path/to/model.joblib
    • Evaluation based on random strategy:
      • evaluate --scenarios /path/to/test/scenarios --classifier /path/to/model.joblib

    The possible parameters are always documented with --help.

    Linting

    The tool is verified the linters flake8 and pylint. These are automatically enabled in Visual Studio Code and can be run manually with the following commands:

    poetry run flake8 .
    poetry run pylint **/*.py

    License

    The software we developed is distributed under GNU GPL license. See the LICENSE.md file.

    Contacts

    Christian Birchler - Zurich University of Applied Science (ZHAW), Switzerland - birc@zhaw.ch

    Nicolas Ganz - Zurich University of Applied Science (ZHAW), Switzerland - gann@zhaw.ch

    Sajad Khatiri - Zurich University of Applied Science (ZHAW), Switzerland - mazr@zhaw.ch

    Dr. Alessio Gambi - Passau University, Germany - alessio.gambi@uni-passau.de

    Dr. Sebastiano Panichella - Zurich University of Applied Science (ZHAW), Switzerland - panc@zhaw.ch

    References

    • Christian Birchler, Nicolas Ganz, Sajad Khatiri, Alessio Gambi, and Sebastiano Panichella. 2022. Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor. In 2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE.

    If you use this tool in your research, please cite the following papers:

    @INPROCEEDINGS{Birchler2022,
     author={Birchler, Christian and Ganz, Nicolas and Khatiri, Sajad and Gambi, Alessio, and Panichella, Sebastiano},
     booktitle={2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), 
     title={Cost-effective Simulationbased Test Selection in Self-driving Cars Software with SDC-Scissor}, 
     year={2022},
    }
  18. h

    cf-cpp-to-python-code-generation

    • huggingface.co
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hesam Haddad (2025). cf-cpp-to-python-code-generation [Dataset]. https://huggingface.co/datasets/demoversion/cf-cpp-to-python-code-generation
    Explore at:
    Dataset updated
    Jul 20, 2025
    Authors
    Hesam Haddad
    Description

    Dataset

    The cf-llm-finetune uses a synthetic parallel dataset built from the Codeforces submissions and problems. C++ ICPC-style solutions are filtered, cleaned, and paired with problem statements to generate Python translations using GPT-4.1, creating a fine-tuning dataset for code translation. The final dataset consists of C++ solutions from 2,000 unique problems, and synthetic Python answers, split into train (1,400), validation (300), and test (300) sets. For details on dataset… See the full description on the dataset page: https://huggingface.co/datasets/demoversion/cf-cpp-to-python-code-generation.

  19. f

    Europe PMC Full Text Corpus

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Tirunagari; Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Vid Vartak; Johanna McEntyre (2023). Europe PMC Full Text Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.22848380.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Santosh Tirunagari; Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Vid Vartak; Johanna McEntyre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism.

    Corpus Directory Structure

    annotations/: contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.

    hypothesis/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.
    GROUP0/: contains raw manual annotations made by curator GROUP0. GROUP1/: contains raw manual annotations made by curator GROUP1. GROUP2/: contains raw manual annotations made by curator GROUP2.

    IOB/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in Inside–Outside–Beginning tagging format.
    dev/: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task. test/: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task. train/: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task.

    JSON/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in JSON format. README.md: a detailed description of all the annotation formats.

    articles/: contains the full-text articles annotated in Europe PMC corpus.

    Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API. README.md: a detailed description of the sentencising and fetching of XML articles.

    docs/: contains related documents that were used for generating the corpus.

    Annotation guideline.pdf: annotation guideline that is provided to curators to assist the manual annotation. demo to molecular conenctions.pdf: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform. Training set development.pdf: initial document that details the paper selection procedures.

    pilot/: contains annotations and articles that were used in a pilot study.

    annotations/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. articles/: contains the full-text articles annotated in the pilot study.

     Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser.
     XML/: contains XML articles directly fetched using Europe PMC Article Restful API.
    

    README.md: a detailed description of the sentencising and fetching of XML articles.

    src/: source codes for cleaning annotations and generating IOB files

    metrics/ner_metrics.py: Python script contains SemEval evaluation metrics. annotations.py: Python script used to extract annotations from raw Hypothes.is annotations. generate_IOB_dataset.py: Python script used to convert JSON format annotations to IOB tagging format. generate_json_dataset.py: Python script used to extract annotations to JSON format. hypothesis.py: Python script used to fetch raw Hypothes.is annotations.

    License

    CCBY

    Feedback

    For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.

  20. d

    MountainScape Segmentation Dataset

    • search.dataone.org
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mountain Legacy Project (2024). MountainScape Segmentation Dataset [Dataset]. http://doi.org/10.5683/SP3/CEYU10
    Explore at:
    Dataset updated
    Dec 11, 2024
    Dataset provided by
    Borealis
    Authors
    Mountain Legacy Project
    Time period covered
    Jan 1, 1870 - Aug 30, 2023
    Description

    This dataset contains the MountainScape Segmentation Dataset (MS2D), a collection of oblique mountain images from the Mountain Legacy Project and corresponding manually annotated land cover masks. The dataset is split into 144 historic grayscale images collected by early phototopographic surveyors and 140 modern repeat images captured by the Mountain Legacy Project. The image resolutions range from 16 to 80 megapixels and the corresponding masks are RGB images with 8 landcover classes. The image dataset was used to train and test the Python Landscape Classifier (PyLC), a trainable segmentation network and land cover classification tool for oblique landscape photography. The dataset also contains PyTorch models trained with PyLC using the collection of images and masks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Organization logo

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:
text/x-python, csv, binAvailable download formats
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description
  • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
  • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
  • ger_train.csv – The German training set as CSV file.
  • ger_validation.csv – The German validation set as CSV file.
  • en_test.csv – The English test set as CSV file.
  • en_train.csv – The English training set as CSV file.
  • en_validation.csv – The English validation set as CSV file.
  • splitting.py – The python code for splitting a dataset into train, test and validation set.
  • DataSetTrans_de.csv – The final German dataset as a CSV file.
  • DataSetTrans_en.csv – The final English dataset as a CSV file.
  • translation.py – The python code for translating the cleaned dataset.
Search
Clear search
Close search
Google apps
Main menu