19 datasets found
  1. h

    PAQ_pairs

    • huggingface.co
    Updated Nov 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data (2024). PAQ_pairs [Dataset]. https://huggingface.co/datasets/embedding-data/PAQ_pairs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 5, 2024
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "PAQ_pairs"

      Dataset Summary
    

    Pairs questions and answers obtained from Wikipedia. Disclaimer: The team releasing PAQ QA pairs did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence similarity.

      Languages
    

    English.

      Dataset Structure
    

    Each example in the dataset contains… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/PAQ_pairs.

  2. Z

    Parsimonious machine learning for the global mapping of aboveground biomass...

    • data.niaid.nih.gov
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2024). Parsimonious machine learning for the global mapping of aboveground biomass density [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11580413
    Explore at:
    Dataset updated
    Nov 6, 2024
    Dataset authored and provided by
    Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository hosts data and code presented in the article "Parsimonious machine learning for the global mapping of aboveground biomass potential". The repository contains a compressed file containing all the code needed to reproduce the methodology that we developed and to analyse its results. We did not upload all the temporary and intermediate data files that are created during the execution of the method. We rather uploaded "milestone" data, i.e. final results or important intermediate ones. This includes the final training dataset, model calibration data, the final trained model, the global data for prediction, the final global map of potential aboveground biomass density (AGBD) at present times (raster files at 1km2 and 10km2 resolution), maps depicting regions where climatic conditions are outside of the training range of positive AGBD instances and maps depicting world regions without trees.

    Files:

    code.zip : Compressed directory with all the code needed to reproduce the methodology presented in the manuscript. Contains a README file. Also contains temporary data generated in the process, the training dataset, the trained model, and model calibration data.

    potential_AGBD_Mgha_1km_present_climate_1980_2010.tif : the predicted global potential AGBD under contemporary climate conditions and at a resolution of 1 squared kilometer.

    potential_AGBD_Mgha_10km_present_climate_1980_2010.tif : the predicted global potential AGBD under contemporary climate conditions downsampled at a resolution of 10 squared kilometers.

    potential_AGBD_Mgha_10km_model_difference.tif : the difference between our prediction of potential AGBD and the prediction from a complex state-of-the-art model from Walker et al. (2022).

    potential_AGB_Mg_1km_present_climate_1980_2010.tif : the predicted global potential pixel-level AGB under contemporary climate conditions downsampled at a resolution of 1 squared kilometers.

    number_predictors_out_of_range.zip : tiled maps representing the number of climatic predictors outside of the training range before including 0 AGBD instances in the training dataset.

    tree_absence_map.zip : tiled maps representing world regions without trees. Based on Crowther et al. (2015) (https://elischolar.library.yale.edu/yale_fes_data/1/).

    inference_pipeline_potential_agbd_Mgha_climate.pkl : Calibrated model for the prediction of potential AGBD given bioclimatic conditions.

    predictors_data_global.zip : Global predictors data to apply the model on.

  3. h

    QQP_triplets

    • huggingface.co
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data (2022). QQP_triplets [Dataset]. https://huggingface.co/datasets/embedding-data/QQP_triplets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2022
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "QQP_triplets"

      Dataset Summary
    

    This dataset will give anyone the opportunity to train and test models of semantic equivalence, based on actual Quora data. The data is organized as triplets (anchor, positive, negative). Disclaimer: The team releasing Quora data did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

      Supported Tasks
    

    Sentence Transformers training; useful for… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/QQP_triplets.

  4. 2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Sep 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian B. Kiss; Sophia Bethany Coban; K. Joost Batenburg; Tristan van Leeuwen; Felix Lucka; Maximilian B. Kiss; Sophia Bethany Coban; K. Joost Batenburg; Tristan van Leeuwen; Felix Lucka (2023). 2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning: Slices 1-1,000 [Dataset]. http://doi.org/10.5281/zenodo.8014758
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maximilian B. Kiss; Sophia Bethany Coban; K. Joost Batenburg; Tristan van Leeuwen; Felix Lucka; Maximilian B. Kiss; Sophia Bethany Coban; K. Joost Batenburg; Tristan van Leeuwen; Felix Lucka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload contains slices 1 – 1,000 from the data collection described in

    Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)

    Abstract:
    "Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."

    The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, \(74.8\mu m^2\) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.

    Please refer to the paper for all further technical details.

    The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.
    The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.

    The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.

    Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.

    For more information or guidance in using the data collection, please get in touch with

    Maximilian.Kiss [at] cwi.nl

    Felix.Lucka [at] cwi.nl

  5. f

    Selected MRI datasets for training, validation, and testing.

    • plos.figshare.com
    xls
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuki Wong; Eileen Lee Ming Su; Che Fai Yeong; William Holderbaum; Chenguang Yang (2025). Selected MRI datasets for training, validation, and testing. [Dataset]. http://doi.org/10.1371/journal.pone.0322624.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 9, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Yuki Wong; Eileen Lee Ming Su; Che Fai Yeong; William Holderbaum; Chenguang Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Selected MRI datasets for training, validation, and testing.

  6. TreeSatAI Benchmark Archive for Deep Learning in Forest Applications

    • zenodo.org
    • data.niaid.nih.gov
    bin, pdf, zip
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Schulz; Christian Schulz; Steve Ahlswede; Steve Ahlswede; Christiano Gava; Patrick Helber; Patrick Helber; Benjamin Bischke; Benjamin Bischke; Florencia Arias; Michael Förster; Michael Förster; Jörn Hees; Jörn Hees; Begüm Demir; Begüm Demir; Birgit Kleinschmit; Birgit Kleinschmit; Christiano Gava; Florencia Arias (2024). TreeSatAI Benchmark Archive for Deep Learning in Forest Applications [Dataset]. http://doi.org/10.5281/zenodo.6598391
    Explore at:
    pdf, zip, binAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christian Schulz; Christian Schulz; Steve Ahlswede; Steve Ahlswede; Christiano Gava; Patrick Helber; Patrick Helber; Benjamin Bischke; Benjamin Bischke; Florencia Arias; Michael Förster; Michael Förster; Jörn Hees; Jörn Hees; Begüm Demir; Begüm Demir; Birgit Kleinschmit; Birgit Kleinschmit; Christiano Gava; Florencia Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context and Aim

    Deep learning in Earth Observation requires large image archives with highly reliable labels for model training and testing. However, a preferable quality standard for forest applications in Europe has not yet been determined. The TreeSatAI consortium investigated numerous sources for annotated datasets as an alternative to manually labeled training datasets.

    We found the federal forest inventory of Lower Saxony, Germany represents an unseen treasure of annotated samples for training data generation. The respective 20-cm Color-infrared (CIR) imagery, which is used for forestry management through visual interpretation, constitutes an excellent baseline for deep learning tasks such as image segmentation and classification.

    Description

    The data archive is highly suitable for benchmarking as it represents the real-world data situation of many German forest management services. One the one hand, it has a high number of samples which are supported by the high-resolution aerial imagery. On the other hand, this data archive presents challenges, including class label imbalances between the different forest stand types.

    The TreeSatAI Benchmark Archive contains:

    • 50,381 image triplets (aerial, Sentinel-1, Sentinel-2)

    • synchronized time steps and locations

    • all original spectral bands/polarizations from the sensors

    • 20 species classes (single labels)

    • 12 age classes (single labels)

    • 15 genus classes (multi labels)

    • 60 m and 200 m patches

    • fixed split for train (90%) and test (10%) data

    • additional single labels such as English species name, genus, forest stand type, foliage type, land cover

    The geoTIFF and GeoJSON files are readable in any GIS software, such as QGIS. For further information, we refer to the PDF document in the archive and publications in the reference section.

    Version history

    v1.0.0 - First release

    Citation

    Ahlswede et al. (in prep.)

    GitHub

    Full code examples and pre-trained models from the dataset article (Ahlswede et al. 2022) using the TreeSatAI Benchmark Archive are published on the GitHub repositories of the Remote Sensing Image Analysis (RSiM) Group (https://git.tu-berlin.de/rsim/treesat_benchmark). Code examples for the sampling strategy can be made available by Christian Schulz via email request.

    Folder structure

    We refer to the proposed folder structure in the PDF file.

    • Folder “aerial” contains the aerial imagery patches derived from summertime orthophotos of the years 2011 to 2020. Patches are available in 60 x 60 m (304 x 304 pixels). Band order is near-infrared, red, green, and blue. Spatial resolution is 20 cm.

    • Folder “s1” contains the Sentinel-1 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is VV, VH, and VV/VH ratio. Spatial resolution is 10 m.

    • Folder “s2” contains the Sentinel-2 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, and B09. Spatial resolution is 10 m.

    • The folder “labels” contains a JSON string which was used for multi-labeling of the training patches. Code example of an image sample with respective proportions of 94% for Abies and 6% for Larix is: "Abies_alba_3_834_WEFL_NLF.tif": [["Abies", 0.93771], ["Larix", 0.06229]]

    • The two files “test_filesnames.lst” and “train_filenames.lst” define the filenames used for train (90%) and test (10%) split. We refer to this fixed split for better reproducibility and comparability.

    • The folder “geojson” contains geoJSON files with all the samples chosen for the derivation of training patch generation (point, 60 m bounding box, 200 m bounding box).

    CAUTION: As we could not upload the aerial patches as a single zip file on Zenodo, you need to download the 20 single species files (aerial_60m_…zip) separately. Then, unzip them into a folder named “aerial” with a subfolder named “60m”. This structure is recommended for better reproducibility and comparability to the experimental results of Ahlswede et al. (2022),

    Join the archive

    Model training, benchmarking, algorithm development… many applications are possible! Feel free to add samples from other regions in Europe or even worldwide. Additional remote sensing data from Lidar, UAVs or aerial imagery from different time steps are very welcome. This helps the research community in development of better deep learning and machine learning models for forest applications. You might have questions or want to share code/results/publications using that archive? Feel free to contact the authors.

    Project description

    This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TU Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab).

    Publications

    Ahlswede et al. (2022, in prep.): TreeSatAI Dataset Publication

    Ahlswede S., Nimisha, T.M., and Demir, B. (2022, in revision): Embedded Self-Enhancement Maps for Weakly Supervised Tree Species Mapping in Remote Sensing Images. IEEE Trans Geosci Remote Sens

    Schulz et al. (2022, in prep.): Phenoprofiling

    Conference contributions

    S. Ahlswede, N. T. Madam, C. Schulz, B. Kleinschmit and B. Demіr, "Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods", IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.

    C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, “Exploring the temporal fingerprints of mid-European forest types from Sentinel-1 RVI and Sentinel-2 NDVI time series”, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.

    C. Schulz, M. Förster, S. Vulova and B. Kleinschmit, “The temporal fingerprints of common European forest types from SAR and optical remote sensing data”, AGU Fall Meeting, New Orleans, USA, 2021.

    B. Kleinschmit, M. Förster, C. Schulz, F. Arias, B. Demir, S. Ahlswede, A. K. Aksoy, T. Ha Minh, J. Hees, C. Gava, P. Helber, B. Bischke, P. Habelitz, A. Frick, R. Klinke, S. Gey, D. Seidel, S. Przywarra, R. Zondag and B. Odermatt, “Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees and Forests”, Living Planet Symposium, Bonn, Germany, 2022.

    C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, (2022, submitted): “Exploring the temporal fingerprints of sixteen mid-European forest types from Sentinel-1 and Sentinel-2 time series”, ForestSAT, Berlin, Germany, 2022.

  7. Z

    Synthetically Spoken STAIR

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jean-Pierre Chevrot (2020). Synthetically Spoken STAIR [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1495069
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    William N. Havard
    Laurent Besacier
    Jean-Pierre Chevrot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of synthetically spoken captions for the STAIR dataset. Following the same methodology as Chrupała et al. (see article | dataset | code) we generated speech for each caption of the STAIR dataset using Google's Text-to-Speech API.

    This dataset was used for visually grounded speech experiments (see article accepted at ICASSP2019).

    @INPROCEEDINGS{8683069, author={W. N. {Havard} and J. {Chevrot} and L. {Besacier}}, booktitle={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese}, year={2019}, volume={}, number={}, pages={8618-8622}, keywords={information retrieval;natural language processing;neural nets;speech processing;word processing;artificial neural attention;human attention;monolingual models;part-of-speech tags;nouns;neural models;visually grounded speech signal;English language;Japanese language;word endings;cross-lingual speech-to-speech retrieval;grounded language learning;attention mechanism;cross-lingual speech retrieval;recurrent neural networks.}, doi={10.1109/ICASSP.2019.8683069}, ISSN={2379-190X}, month={May},}

    The dataset comprises the following files :

    mp3-stair.tar.gz : MP3 files of each caption in the STAIR dataset. Filenames have the following pattern imageID_captionID, where both imageID and captionID correspond to those provided in the original dataset (see annotation format here)

    dataset.mfcc.npy : Numpy array with MFCC vectors for each caption. MFCC were extracted using python_speech_features with default configuration. To know to which caption the MFCC vectors belong to, you can use the files dataset.words.txt and dataset.ids.txt.

    dataset.words.txt : Captions corresponding to each MFCC vector (line number = position in Numpy array, starting from 0)

    dataset.ids.txt : IDs of the captions (imageID_captionID) corresponding to each MFCC vector (line number = position in Numpy array, starting from 0)

    Splits

    test

    test.txt : captions comprising the test split

    test_ids.txt: IDs of the captions in the test split

    test_tagged.txt : tagged version of the test split

    test-alignments.json.zip : Forced alignments of all the captions in the test split. (dictionary where the key corresponds to the caption ID in the STAIR dataset). Due to an unknown error during upload, the JSON file had to be zipped...

    train

    train.txt : captions comprising the train split

    train_ids.txt : IDs of the captions in the train split

    train_tagged.txt : tagged version of the train split

    val

    val.txt : captions comprising the val split

    val_ids.txt : IDs of the captions in the val split

    val_tagged.txt : tagged version of the val split

  8. MaleBin: Malware Binary Greyscale Images

    • kaggle.com
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tashie (2025). MaleBin: Malware Binary Greyscale Images [Dataset]. http://doi.org/10.34740/kaggle/dsv/11674648
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 4, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    tashie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    New dataset link: https://www.kaggle.com/datasets/tashiee/malebin-2-0-rgb-malware-binary-images

    **Important Notice (PLEASE READ) A more comprehensive dataset has been developed, featuring improved preprocessing steps and yielding more accurate classification results. This is due to the fact the current model which was trained using this dataset performs poorly on current malware variants, and there are issues with resizing which leads to distorted images

    Due to current time constraints, I am unable to upload the new datasets and accompanying notebooks along with detailed documentation. If you require access to the updated resources, please feel free to contact me at tashvin.raj56@gmail.com — I will be happy to share them personally or update the dataset as soon as possible.

    Additionally, while the Malimg dataset performs reliably within a closed-set environment, it should be noted that its malware samples are outdated. As a result, it may not generalize well to modern, real-world malware threats.**

    Thus i would refrain you from using this dataset for model training and instead to contact me during office hours. Thanks

    This MaleBin Dataset contains 12,464 malware binary images across 39 families. The dataset is compiled from two separate sources:

    1.Malimg Dataset by Nataraj et al. (2011)

    2.A portion of samples from https://www.kaggle.com/datasets/walt30/malware-images. Full credits to: https://www.kaggle.com/walt30.

    The first dataset, the Malimg dataset, is widely recognized in the field of malware detection and consists of malware images generated by transforming binaries into grayscale images based on byte-to-pixel mapping. For the second sample, the malicious files were downloaded from MalwareBazaar, and as stated by the author, the malware images were visualized following the approach presented by Nataraj et al.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25809564%2F48590cab63aafafc1c17bb8f2ba0b5ce%2FScreenshot%202025-05-04%20235108.png?generation=1746375133936778&alt=media" alt="">

    This new dataset was compiled to address a few challenges:

    1.To balance the number of samples across each family.

    2.To resize all samples to 256x256.

    3.To overcome the lack of datasets (Most existing datasets are outdated such as malimg, and newer ones contain a mix of greyscale and RGB)

    Note that some samples were omitted to maintain balance, which helps avoid overfitting and reduces the overall workload.

    Also, please note that I do not take credit for the original datasets. Full credits are due to the respective owners.

    Please do contact me if there is any oversights regarding the dataset.
  9. h

    coco_captions_quintets

    • huggingface.co
    Updated Aug 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data (2022). coco_captions_quintets [Dataset]. https://huggingface.co/datasets/embedding-data/coco_captions_quintets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "coco_captions"

      Dataset Summary
    

    COCO is a large-scale object detection, segmentation, and captioning dataset. This repo contains five captions per image; useful for sentence similarity tasks. Disclaimer: The team releasing COCO did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/coco_captions_quintets.

  10. R

    Invoice Management Dataset

    • universe.roboflow.com
    zip
    Updated Dec 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CVIP Workspace (2024). Invoice Management Dataset [Dataset]. https://universe.roboflow.com/cvip-workspace/invoice-management/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 28, 2024
    Dataset authored and provided by
    CVIP Workspace
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Text Bounding Boxes
    Description

    Intelligent Invoice Management System

    Project Description:
    The Intelligent Invoice Management System is an advanced AI-powered platform designed to revolutionize traditional invoice processing. By automating the extraction, validation, and management of invoice data, this system addresses the inefficiencies, inaccuracies, and high costs associated with manual methods. It enables businesses to streamline operations, reduce human error, and expedite payment cycles.

    Problem Statement:
    Manual invoice processing involves labor-intensive tasks such as data entry, verification, and reconciliation. These processes are time-consuming, prone to errors, and can result in financial losses and delays. The diversity of invoice formats from various vendors adds complexity, making automation a critical need for efficiency and scalability.

    Proposed Solution:
    The Intelligent Invoice Management System automates the end-to-end process of invoice handling using AI and machine learning techniques. Core functionalities include:
    1. Invoice Generation: Automatically generate PDF invoices in at least four formats, populated with synthetic data.
    2. Data Development: Leverage a dataset containing fields such as receipt numbers, company details, sales tax information, and itemized tables to create realistic invoice samples.
    3. AI-Powered Labeling: Use Tesseract OCR to extract labeled data from invoice images, and train YOLO for label recognition, ensuring precise identification of fields.
    4. Database Integration: Store extracted information in a structured database for seamless retrieval and analysis.
    5. Web-Based Information System: Provide a user-friendly platform to upload invoices and retrieve key metrics, such as:
    - Total sales within a specified duration.
    - Total sales tax paid during a given timeframe.
    - Detailed invoice information in tabular form for specific date ranges.

    Key Features and Deliverables:
    1. Invoice Generation:
    - Generate 20,000 invoices using an automated script.
    - Include dummy logos, company details, and itemized tables for four items per invoice.

    1. Label Definition and Format:

      • Define structured labels (TBLR, CLASS Name, Recognized Text).
      • Provide labels in both XML and JSON formats for seamless integration.
    2. OCR and AI Training:

      • Automate labeling using Tesseract OCR for high-accuracy text recognition.
      • Train and test YOLO to detect and classify invoice fields (TBLR and CLASS).
    3. Database Management:

      • Store OCR-extracted labels and field data in a database.
      • Enable efficient search and aggregation of invoice data.
    4. Web-Based Interface:

      • Build a responsive system for users to upload invoices and retrieve data based on company name or NTN.
      • Display metrics and reports for total sales, tax paid, and invoice details over custom date ranges.

    Expected Outcomes: - Reduction in manual effort and operational costs.
    - Improved accuracy in invoice processing and financial reporting.
    - Enhanced scalability and adaptability for diverse invoice formats.
    - Faster turnaround time for invoice-related tasks.

    By automating critical aspects of invoice management, this system delivers a robust and intelligent solution to meet the evolving needs of businesses.

  11. h

    SPECTER

    • huggingface.co
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data (2023). SPECTER [Dataset]. https://huggingface.co/datasets/embedding-data/SPECTER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 12, 2023
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "SPECTER"

      Dataset Summary
    

    Dataset containing triplets (three sentences): anchor, positive, and negative. Contains titles of papers. Disclaimer: The team releasing SPECTER did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

      Dataset Structure
    

    Each example in the dataset contains triplets of equivalent sentences and is formatted as a dictionary with the key "set" and a list with… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/SPECTER.

  12. cars_wagonr_swift

    • kaggle.com
    zip
    Updated Sep 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ajay (2019). cars_wagonr_swift [Dataset]. https://www.kaggle.com/ajaykgp12/cars-wagonr-swift
    Explore at:
    zip(44486490 bytes)Available download formats
    Dataset updated
    Sep 11, 2019
    Authors
    Ajay
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.

    Content

    There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.

    The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning

    Inspiration

    1. With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.

    2. Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .

    3. I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car

  13. Pretraining data for PeptideCLM (UPDATED)

    • zenodo.org
    bin, csv
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Feller; Aaron Feller (2025). Pretraining data for PeptideCLM (UPDATED) [Dataset]. http://doi.org/10.5281/zenodo.15042141
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Mar 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Aaron Feller; Aaron Feller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 20, 2024
    Description

    This version update includes changes to Generated_peptides.csv to fix cyclization. The prior upload did not have ring closures generated correctly as SMILES strings. The model in the publication was trained on the dataset containing errors, however to support the community we decided it would be best to release a 10M peptide SMILES dataset for use in future pretraining applications. All strings should now load correctly to mol files with RDKit.

  14. d

    Modified Versions of Diving48: Shape and Texture

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Broomé, Sofia (2023). Modified Versions of Diving48: Shape and Texture [Dataset]. http://doi.org/10.7910/DVN/MXJPIZ
    Explore at:
    Dataset updated
    Nov 12, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Broomé, Sofia
    Description

    We modify the Diving48 dataset ("RESOUND: Towards Action Recognition without Representation Bias", Li et al., ECCV 2020) into three new domains: two based on shape and one based on texture (following Geirhos et al., ICLR 2019). Note that the Statistical Visual Computing Lab in San Diego (http://www.svcl.ucsd.edu) has the copyright to the Diving48 dataset. Please cite the RESOUND paper, if you are using any data related to the Diving48 dataset, including our modified versions here "RESOUND: Towards Action Recognition without Representation Bias", Li et al., ECCV 2020. In the shape domains, we blur the background and only maintain the segmented diver(s) (S1), or their bounding boxes (S2). In the texture domain (T), we conversely mask out bounding boxes where the diver(s) are, and only keep the background. The masked boxes are filled with the average Imagenet pixel value (following Choi et al., NeurIPS 2019). The class evidence should lie only in the divers' movement; hence, the texture version should not contain any relevant signal, and the accuracy should drop to random performance. Thus, we can study how different models drop in score when tested on the shape or texture domain, indicating both cross-domain robustness (for S1 and S2) and texture bias (for T). This modified dataset was introduced in "Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition", Broomé et al., arXiv 2112.12175. Only the test set of Diving48 was used there -- we did not train on these modified domains, they were only for evaluation. The files are .mp4-videos, consisting of 32 frames each, regardless of the length of the original clip (but they are typically around 5 seconds long). We may consider to upload also the training set, please contact us if you need it urgently. Otherwise, the trained model for diver segmentation is released in this repository https://github.com/sofiabroome/diver-segmentation if you want to perform the cropping and saving yourself, at your own desired frame rate.

  15. R

    Gtsdb German Traffic Sign Detection Benchmark Dataset

    • universe.roboflow.com
    • kaggle.com
    zip
    Updated Jul 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Traore (2022). Gtsdb German Traffic Sign Detection Benchmark Dataset [Dataset]. https://universe.roboflow.com/mohamed-traore-2ekkp/gtsdb---german-traffic-sign-detection-benchmark/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 6, 2022
    Dataset authored and provided by
    Mohamed Traore
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Signs Bounding Boxes
    Description

    This project was created by downloading the GTSDB German Traffic Sign Detection Benchmark

    dataset from Kaggle and importing the annotated training set files (images and annotation files)

    to Roboflow.

    https://www.kaggle.com/datasets/safabouguezzi/german-traffic-sign-detection-benchmark-gtsdb

    The annotation files were adjusted to conform to the YOLO Keras TXT format prior to upload, as the original format did not include a label map file.

    v1 contains the original imported images, without augmentations. This is the version to download and import to your own project if you'd like to add your own augmentations.

    v2 contains an augmented version of the dataset, with annotations. This version of the project was trained with Roboflow's "FAST" model.

    v3 contains an augmented version of the dataset, with annotations. This version of the project was trained with Roboflow's "ACCURATE" model.

  16. h

    altlex

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data, altlex [Dataset]. https://huggingface.co/datasets/embedding-data/altlex
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "altlex"

      Dataset Summary
    

    Git repository for software associated with the 2016 ACL paper "Identifying Causal Relations Using Parallel Wikipedia Articles." Disclaimer: The team releasing altlex did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence similarity.

      Languages
    

    English.… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/altlex.

  17. Titanic Dataset

    • kaggle.com
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mudasar Sabir (2025). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/mudasarsabir/titanic-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Muhammad Mudasar Sabir
    Description

    Description 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

    If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle

    The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

    Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!

    The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.

    On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

    While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

    In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

    Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.

    How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!

    What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

    Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

    The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

    Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

    Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.

    How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:

    Click on the “Submit Predictions” button

    Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.

    Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

    The file should have exactly 2 columns:

    PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!

    A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...

  18. h

    sentence-compression

    • huggingface.co
    • opendatalab.com
    Updated Feb 3, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data (2012). sentence-compression [Dataset]. https://huggingface.co/datasets/embedding-data/sentence-compression
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2012
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "sentence-compression"

      Dataset Summary
    

    Dataset with pairs of equivalent sentences. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from using the dataset. Disclaimer: The team releasing sentence-compression did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

      Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/sentence-compression.
    
  19. h

    Amazon-QA

    • huggingface.co
    Updated Nov 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data (2024). Amazon-QA [Dataset]. https://huggingface.co/datasets/embedding-data/Amazon-QA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 5, 2024
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "Amazon-QA"

      Dataset Summary
    

    This dataset contains Question and Answer data from Amazon. Disclaimer: The team releasing Amazon-QA did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence similarity.

      Languages
    

    English.

      Dataset Structure
    

    Each example in the dataset… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/Amazon-QA.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Embedding Training Data (2024). PAQ_pairs [Dataset]. https://huggingface.co/datasets/embedding-data/PAQ_pairs

PAQ_pairs

PAQ_pairs

embedding-data/PAQ_pairs

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 5, 2024
Dataset authored and provided by
Embedding Training Data
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for "PAQ_pairs"

  Dataset Summary

Pairs questions and answers obtained from Wikipedia. Disclaimer: The team releasing PAQ QA pairs did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

  Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

  Languages

English.

  Dataset Structure

Each example in the dataset contains… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/PAQ_pairs.

Search
Clear search
Close search
Google apps
Main menu