19 datasets found

h
PAQ_pairs
huggingface.co
Updated Nov 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data (2024). PAQ_pairs [Dataset]. https://huggingface.co/datasets/embedding-data/PAQ_pairs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 5, 2024
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "PAQ_pairs"

Dataset Summary

Pairs questions and answers obtained from Wikipedia. Disclaimer: The team releasing PAQ QA pairs did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.

Dataset Structure

Each example in the dataset contains… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/PAQ_pairs.
Z
Parsimonious machine learning for the global mapping of aboveground biomass...
data.niaid.nih.gov
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous (2024). Parsimonious machine learning for the global mapping of aboveground biomass density [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11580413
Explore at:
Dataset updated
Nov 6, 2024
Dataset authored and provided by
Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository hosts data and code presented in the article "Parsimonious machine learning for the global mapping of aboveground biomass potential". The repository contains a compressed file containing all the code needed to reproduce the methodology that we developed and to analyse its results. We did not upload all the temporary and intermediate data files that are created during the execution of the method. We rather uploaded "milestone" data, i.e. final results or important intermediate ones. This includes the final training dataset, model calibration data, the final trained model, the global data for prediction, the final global map of potential aboveground biomass density (AGBD) at present times (raster files at 1km2 and 10km2 resolution), maps depicting regions where climatic conditions are outside of the training range of positive AGBD instances and maps depicting world regions without trees.

Files:

code.zip : Compressed directory with all the code needed to reproduce the methodology presented in the manuscript. Contains a README file. Also contains temporary data generated in the process, the training dataset, the trained model, and model calibration data.

potential_AGBD_Mgha_1km_present_climate_1980_2010.tif : the predicted global potential AGBD under contemporary climate conditions and at a resolution of 1 squared kilometer.

potential_AGBD_Mgha_10km_present_climate_1980_2010.tif : the predicted global potential AGBD under contemporary climate conditions downsampled at a resolution of 10 squared kilometers.

potential_AGBD_Mgha_10km_model_difference.tif : the difference between our prediction of potential AGBD and the prediction from a complex state-of-the-art model from Walker et al. (2022).

potential_AGB_Mg_1km_present_climate_1980_2010.tif : the predicted global potential pixel-level AGB under contemporary climate conditions downsampled at a resolution of 1 squared kilometers.

number_predictors_out_of_range.zip : tiled maps representing the number of climatic predictors outside of the training range before including 0 AGBD instances in the training dataset.

tree_absence_map.zip : tiled maps representing world regions without trees. Based on Crowther et al. (2015) (https://elischolar.library.yale.edu/yale_fes_data/1/).

inference_pipeline_potential_agbd_Mgha_climate.pkl : Calibrated model for the prediction of potential AGBD given bioclimatic conditions.

predictors_data_global.zip : Global predictors data to apply the model on.
h
QQP_triplets
huggingface.co
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data (2022). QQP_triplets [Dataset]. https://huggingface.co/datasets/embedding-data/QQP_triplets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 21, 2022
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "QQP_triplets"

Dataset Summary

This dataset will give anyone the opportunity to train and test models of semantic equivalence, based on actual Quora data. The data is organized as triplets (anchor, positive, negative). Disclaimer: The team releasing Quora data did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

Supported Tasks

Sentence Transformers training; useful for… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/QQP_triplets.
2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography...
zenodo.org
data.niaid.nih.gov
zip
Updated Sep 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maximilian B. Kiss; Sophia Bethany Coban; K. Joost Batenburg; Tristan van Leeuwen; Felix Lucka; Maximilian B. Kiss; Sophia Bethany Coban; K. Joost Batenburg; Tristan van Leeuwen; Felix Lucka (2023). 2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning: Slices 1-1,000 [Dataset]. http://doi.org/10.5281/zenodo.8014758
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8014758
Dataset updated
Sep 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maximilian B. Kiss; Sophia Bethany Coban; K. Joost Batenburg; Tristan van Leeuwen; Felix Lucka; Maximilian B. Kiss; Sophia Bethany Coban; K. Joost Batenburg; Tristan van Leeuwen; Felix Lucka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This upload contains slices 1 – 1,000 from the data collection described in

Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)

Abstract:
"Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."

The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, \(74.8\mu m^2\) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.

Please refer to the paper for all further technical details.

The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.
The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.

The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.

Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.

For more information or guidance in using the data collection, please get in touch with

Maximilian.Kiss [at] cwi.nl

Felix.Lucka [at] cwi.nl
f
Selected MRI datasets for training, validation, and testing.
plos.figshare.com
xls
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuki Wong; Eileen Lee Ming Su; Che Fai Yeong; William Holderbaum; Chenguang Yang (2025). Selected MRI datasets for training, validation, and testing. [Dataset]. http://doi.org/10.1371/journal.pone.0322624.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0322624.t002
Dataset updated
May 9, 2025
Dataset provided by
PLOS ONE
Authors
Yuki Wong; Eileen Lee Ming Su; Che Fai Yeong; William Holderbaum; Chenguang Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Selected MRI datasets for training, validation, and testing.
TreeSatAI Benchmark Archive for Deep Learning in Forest Applications
zenodo.org
data.niaid.nih.gov
bin, pdf, zip
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Schulz; Christian Schulz; Steve Ahlswede; Steve Ahlswede; Christiano Gava; Patrick Helber; Patrick Helber; Benjamin Bischke; Benjamin Bischke; Florencia Arias; Michael Förster; Michael Förster; Jörn Hees; Jörn Hees; Begüm Demir; Begüm Demir; Birgit Kleinschmit; Birgit Kleinschmit; Christiano Gava; Florencia Arias (2024). TreeSatAI Benchmark Archive for Deep Learning in Forest Applications [Dataset]. http://doi.org/10.5281/zenodo.6598391
Explore at:
pdf, zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6598391
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christian Schulz; Christian Schulz; Steve Ahlswede; Steve Ahlswede; Christiano Gava; Patrick Helber; Patrick Helber; Benjamin Bischke; Benjamin Bischke; Florencia Arias; Michael Förster; Michael Förster; Jörn Hees; Jörn Hees; Begüm Demir; Begüm Demir; Birgit Kleinschmit; Birgit Kleinschmit; Christiano Gava; Florencia Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context and Aim

Deep learning in Earth Observation requires large image archives with highly reliable labels for model training and testing. However, a preferable quality standard for forest applications in Europe has not yet been determined. The TreeSatAI consortium investigated numerous sources for annotated datasets as an alternative to manually labeled training datasets.

We found the federal forest inventory of Lower Saxony, Germany represents an unseen treasure of annotated samples for training data generation. The respective 20-cm Color-infrared (CIR) imagery, which is used for forestry management through visual interpretation, constitutes an excellent baseline for deep learning tasks such as image segmentation and classification.

Description

The data archive is highly suitable for benchmarking as it represents the real-world data situation of many German forest management services. One the one hand, it has a high number of samples which are supported by the high-resolution aerial imagery. On the other hand, this data archive presents challenges, including class label imbalances between the different forest stand types.

The TreeSatAI Benchmark Archive contains:

50,381 image triplets (aerial, Sentinel-1, Sentinel-2)

synchronized time steps and locations

all original spectral bands/polarizations from the sensors

20 species classes (single labels)

12 age classes (single labels)

15 genus classes (multi labels)

60 m and 200 m patches

fixed split for train (90%) and test (10%) data

additional single labels such as English species name, genus, forest stand type, foliage type, land cover

The geoTIFF and GeoJSON files are readable in any GIS software, such as QGIS. For further information, we refer to the PDF document in the archive and publications in the reference section.

Version history

v1.0.0 - First release

Citation

Ahlswede et al. (in prep.)

GitHub

Full code examples and pre-trained models from the dataset article (Ahlswede et al. 2022) using the TreeSatAI Benchmark Archive are published on the GitHub repositories of the Remote Sensing Image Analysis (RSiM) Group (https://git.tu-berlin.de/rsim/treesat_benchmark). Code examples for the sampling strategy can be made available by Christian Schulz via email request.

Folder structure

We refer to the proposed folder structure in the PDF file.

Folder “aerial” contains the aerial imagery patches derived from summertime orthophotos of the years 2011 to 2020. Patches are available in 60 x 60 m (304 x 304 pixels). Band order is near-infrared, red, green, and blue. Spatial resolution is 20 cm.

Folder “s1” contains the Sentinel-1 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is VV, VH, and VV/VH ratio. Spatial resolution is 10 m.

Folder “s2” contains the Sentinel-2 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, and B09. Spatial resolution is 10 m.

The folder “labels” contains a JSON string which was used for multi-labeling of the training patches. Code example of an image sample with respective proportions of 94% for Abies and 6% for Larix is: "Abies_alba_3_834_WEFL_NLF.tif": [["Abies", 0.93771], ["Larix", 0.06229]]

The two files “test_filesnames.lst” and “train_filenames.lst” define the filenames used for train (90%) and test (10%) split. We refer to this fixed split for better reproducibility and comparability.

The folder “geojson” contains geoJSON files with all the samples chosen for the derivation of training patch generation (point, 60 m bounding box, 200 m bounding box).

CAUTION: As we could not upload the aerial patches as a single zip file on Zenodo, you need to download the 20 single species files (aerial_60m_…zip) separately. Then, unzip them into a folder named “aerial” with a subfolder named “60m”. This structure is recommended for better reproducibility and comparability to the experimental results of Ahlswede et al. (2022),

Join the archive

Model training, benchmarking, algorithm development… many applications are possible! Feel free to add samples from other regions in Europe or even worldwide. Additional remote sensing data from Lidar, UAVs or aerial imagery from different time steps are very welcome. This helps the research community in development of better deep learning and machine learning models for forest applications. You might have questions or want to share code/results/publications using that archive? Feel free to contact the authors.

Project description

This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TU Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab).

Publications

Ahlswede et al. (2022, in prep.): TreeSatAI Dataset Publication

Ahlswede S., Nimisha, T.M., and Demir, B. (2022, in revision): Embedded Self-Enhancement Maps for Weakly Supervised Tree Species Mapping in Remote Sensing Images. IEEE Trans Geosci Remote Sens

Schulz et al. (2022, in prep.): Phenoprofiling

Conference contributions

S. Ahlswede, N. T. Madam, C. Schulz, B. Kleinschmit and B. Demіr, "Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods", IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.

C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, “Exploring the temporal fingerprints of mid-European forest types from Sentinel-1 RVI and Sentinel-2 NDVI time series”, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.

C. Schulz, M. Förster, S. Vulova and B. Kleinschmit, “The temporal fingerprints of common European forest types from SAR and optical remote sensing data”, AGU Fall Meeting, New Orleans, USA, 2021.

B. Kleinschmit, M. Förster, C. Schulz, F. Arias, B. Demir, S. Ahlswede, A. K. Aksoy, T. Ha Minh, J. Hees, C. Gava, P. Helber, B. Bischke, P. Habelitz, A. Frick, R. Klinke, S. Gey, D. Seidel, S. Przywarra, R. Zondag and B. Odermatt, “Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees and Forests”, Living Planet Symposium, Bonn, Germany, 2022.

C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, (2022, submitted): “Exploring the temporal fingerprints of sixteen mid-European forest types from Sentinel-1 and Sentinel-2 time series”, ForestSAT, Berlin, Germany, 2022.
Z
Synthetically Spoken STAIR
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jean-Pierre Chevrot (2020). Synthetically Spoken STAIR [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1495069
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
William N. Havard
Laurent Besacier
Jean-Pierre Chevrot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of synthetically spoken captions for the STAIR dataset. Following the same methodology as Chrupała et al. (see article | dataset | code) we generated speech for each caption of the STAIR dataset using Google's Text-to-Speech API.

This dataset was used for visually grounded speech experiments (see article accepted at ICASSP2019).

@INPROCEEDINGS{8683069, author={W. N. {Havard} and J. {Chevrot} and L. {Besacier}}, booktitle={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese}, year={2019}, volume={}, number={}, pages={8618-8622}, keywords={information retrieval;natural language processing;neural nets;speech processing;word processing;artificial neural attention;human attention;monolingual models;part-of-speech tags;nouns;neural models;visually grounded speech signal;English language;Japanese language;word endings;cross-lingual speech-to-speech retrieval;grounded language learning;attention mechanism;cross-lingual speech retrieval;recurrent neural networks.}, doi={10.1109/ICASSP.2019.8683069}, ISSN={2379-190X}, month={May},}

The dataset comprises the following files :

mp3-stair.tar.gz : MP3 files of each caption in the STAIR dataset. Filenames have the following pattern imageID_captionID, where both imageID and captionID correspond to those provided in the original dataset (see annotation format here)

dataset.mfcc.npy : Numpy array with MFCC vectors for each caption. MFCC were extracted using python_speech_features with default configuration. To know to which caption the MFCC vectors belong to, you can use the files dataset.words.txt and dataset.ids.txt.

dataset.words.txt : Captions corresponding to each MFCC vector (line number = position in Numpy array, starting from 0)

dataset.ids.txt : IDs of the captions (imageID_captionID) corresponding to each MFCC vector (line number = position in Numpy array, starting from 0)

Splits

test

test.txt : captions comprising the test split

test_ids.txt: IDs of the captions in the test split

test_tagged.txt : tagged version of the test split

test-alignments.json.zip : Forced alignments of all the captions in the test split. (dictionary where the key corresponds to the caption ID in the STAIR dataset). Due to an unknown error during upload, the JSON file had to be zipped...

train

train.txt : captions comprising the train split

train_ids.txt : IDs of the captions in the train split

train_tagged.txt : tagged version of the train split

val

val.txt : captions comprising the val split

val_ids.txt : IDs of the captions in the val split

val_tagged.txt : tagged version of the val split
MaleBin: Malware Binary Greyscale Images
kaggle.com
Updated May 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tashie (2025). MaleBin: Malware Binary Greyscale Images [Dataset]. http://doi.org/10.34740/kaggle/dsv/11674648
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/11674648
Dataset updated
May 4, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
tashie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
New dataset link: https://www.kaggle.com/datasets/tashiee/malebin-2-0-rgb-malware-binary-images

**Important Notice (PLEASE READ) A more comprehensive dataset has been developed, featuring improved preprocessing steps and yielding more accurate classification results. This is due to the fact the current model which was trained using this dataset performs poorly on current malware variants, and there are issues with resizing which leads to distorted images

Due to current time constraints, I am unable to upload the new datasets and accompanying notebooks along with detailed documentation. If you require access to the updated resources, please feel free to contact me at tashvin.raj56@gmail.com — I will be happy to share them personally or update the dataset as soon as possible.

Additionally, while the Malimg dataset performs reliably within a closed-set environment, it should be noted that its malware samples are outdated. As a result, it may not generalize well to modern, real-world malware threats.**

Thus i would refrain you from using this dataset for model training and instead to contact me during office hours. Thanks

This MaleBin Dataset contains 12,464 malware binary images across 39 families. The dataset is compiled from two separate sources:

1.Malimg Dataset by Nataraj et al. (2011)

2.A portion of samples from https://www.kaggle.com/datasets/walt30/malware-images. Full credits to: https://www.kaggle.com/walt30.

The first dataset, the Malimg dataset, is widely recognized in the field of malware detection and consists of malware images generated by transforming binaries into grayscale images based on byte-to-pixel mapping. For the second sample, the malicious files were downloaded from MalwareBazaar, and as stated by the author, the malware images were visualized following the approach presented by Nataraj et al.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25809564%2F48590cab63aafafc1c17bb8f2ba0b5ce%2FScreenshot%202025-05-04%20235108.png?generation=1746375133936778&alt=media" alt="">

This new dataset was compiled to address a few challenges:

1.To balance the number of samples across each family.

2.To resize all samples to 256x256.

3.To overcome the lack of datasets (Most existing datasets are outdated such as malimg, and newer ones contain a mix of greyscale and RGB)

Note that some samples were omitted to maintain balance, which helps avoid overfitting and reduces the overall workload.

Also, please note that I do not take credit for the original datasets. Full credits are due to the respective owners.

Please do contact me if there is any oversights regarding the dataset.
h
coco_captions_quintets
huggingface.co
Updated Aug 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data (2022). coco_captions_quintets [Dataset]. https://huggingface.co/datasets/embedding-data/coco_captions_quintets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "coco_captions"

Dataset Summary

COCO is a large-scale object detection, segmentation, and captioning dataset. This repo contains five captions per image; useful for sentence similarity tasks. Disclaimer: The team releasing COCO did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/coco_captions_quintets.
R
Invoice Management Dataset
universe.roboflow.com
zip
Updated Dec 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CVIP Workspace (2024). Invoice Management Dataset [Dataset]. https://universe.roboflow.com/cvip-workspace/invoice-management/model/1
Explore at:
zipAvailable download formats
Dataset updated
Dec 28, 2024
Dataset authored and provided by
CVIP Workspace
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Text Bounding Boxes
Description
Intelligent Invoice Management System

Project Description:
The Intelligent Invoice Management System is an advanced AI-powered platform designed to revolutionize traditional invoice processing. By automating the extraction, validation, and management of invoice data, this system addresses the inefficiencies, inaccuracies, and high costs associated with manual methods. It enables businesses to streamline operations, reduce human error, and expedite payment cycles.

Problem Statement:
Manual invoice processing involves labor-intensive tasks such as data entry, verification, and reconciliation. These processes are time-consuming, prone to errors, and can result in financial losses and delays. The diversity of invoice formats from various vendors adds complexity, making automation a critical need for efficiency and scalability.

Proposed Solution:
The Intelligent Invoice Management System automates the end-to-end process of invoice handling using AI and machine learning techniques. Core functionalities include:
1. Invoice Generation: Automatically generate PDF invoices in at least four formats, populated with synthetic data.
2. Data Development: Leverage a dataset containing fields such as receipt numbers, company details, sales tax information, and itemized tables to create realistic invoice samples.
3. AI-Powered Labeling: Use Tesseract OCR to extract labeled data from invoice images, and train YOLO for label recognition, ensuring precise identification of fields.
4. Database Integration: Store extracted information in a structured database for seamless retrieval and analysis.
5. Web-Based Information System: Provide a user-friendly platform to upload invoices and retrieve key metrics, such as:
- Total sales within a specified duration.
- Total sales tax paid during a given timeframe.
- Detailed invoice information in tabular form for specific date ranges.

Key Features and Deliverables:
1. Invoice Generation:
- Generate 20,000 invoices using an automated script.
- Include dummy logos, company details, and itemized tables for four items per invoice.

Label Definition and Format:

Define structured labels (TBLR, CLASS Name, Recognized Text).

Provide labels in both XML and JSON formats for seamless integration.

OCR and AI Training:

Automate labeling using Tesseract OCR for high-accuracy text recognition.

Train and test YOLO to detect and classify invoice fields (TBLR and CLASS).

Database Management:

Store OCR-extracted labels and field data in a database.

Enable efficient search and aggregation of invoice data.

Web-Based Interface:

Build a responsive system for users to upload invoices and retrieve data based on company name or NTN.

Display metrics and reports for total sales, tax paid, and invoice details over custom date ranges.

Expected Outcomes: - Reduction in manual effort and operational costs.
- Improved accuracy in invoice processing and financial reporting.
- Enhanced scalability and adaptability for diverse invoice formats.
- Faster turnaround time for invoice-related tasks.

By automating critical aspects of invoice management, this system delivers a robust and intelligent solution to meet the evolving needs of businesses.
h
SPECTER
huggingface.co
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data (2023). SPECTER [Dataset]. https://huggingface.co/datasets/embedding-data/SPECTER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 12, 2023
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "SPECTER"

Dataset Summary

Dataset containing triplets (three sentences): anchor, positive, and negative. Contains titles of papers. Disclaimer: The team releasing SPECTER did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

Dataset Structure

Each example in the dataset contains triplets of equivalent sentences and is formatted as a dictionary with the key "set" and a list with… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/SPECTER.
cars_wagonr_swift
kaggle.com
zip
Updated Sep 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajay (2019). cars_wagonr_swift [Dataset]. https://www.kaggle.com/ajaykgp12/cars-wagonr-swift
Explore at:
zip(44486490 bytes)Available download formats
Dataset updated
Sep 11, 2019
Authors
Ajay
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.

Content

There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.

The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning

Inspiration

With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.

Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .

I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car
Pretraining data for PeptideCLM (UPDATED)
zenodo.org
bin, csv
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Feller; Aaron Feller (2025). Pretraining data for PeptideCLM (UPDATED) [Dataset]. http://doi.org/10.5281/zenodo.15042141
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15042141
Dataset updated
Mar 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Aaron Feller; Aaron Feller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Nov 20, 2024
Description
This version update includes changes to Generated_peptides.csv to fix cyclization. The prior upload did not have ring closures generated correctly as SMILES strings. The model in the publication was trained on the dataset containing errors, however to support the community we decided it would be best to release a 10M peptide SMILES dataset for use in future pretraining applications. All strings should now load correctly to mol files with RDKit.
d
Modified Versions of Diving48: Shape and Texture
search.dataone.org
dataverse.harvard.edu
Updated Nov 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Broomé, Sofia (2023). Modified Versions of Diving48: Shape and Texture [Dataset]. http://doi.org/10.7910/DVN/MXJPIZ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/MXJPIZ
Dataset updated
Nov 12, 2023
Dataset provided by
Harvard Dataverse
Authors
Broomé, Sofia
Description
We modify the Diving48 dataset ("RESOUND: Towards Action Recognition without Representation Bias", Li et al., ECCV 2020) into three new domains: two based on shape and one based on texture (following Geirhos et al., ICLR 2019). Note that the Statistical Visual Computing Lab in San Diego (http://www.svcl.ucsd.edu) has the copyright to the Diving48 dataset. Please cite the RESOUND paper, if you are using any data related to the Diving48 dataset, including our modified versions here "RESOUND: Towards Action Recognition without Representation Bias", Li et al., ECCV 2020. In the shape domains, we blur the background and only maintain the segmented diver(s) (S1), or their bounding boxes (S2). In the texture domain (T), we conversely mask out bounding boxes where the diver(s) are, and only keep the background. The masked boxes are filled with the average Imagenet pixel value (following Choi et al., NeurIPS 2019). The class evidence should lie only in the divers' movement; hence, the texture version should not contain any relevant signal, and the accuracy should drop to random performance. Thus, we can study how different models drop in score when tested on the shape or texture domain, indicating both cross-domain robustness (for S1 and S2) and texture bias (for T). This modified dataset was introduced in "Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition", Broomé et al., arXiv 2112.12175. Only the test set of Diving48 was used there -- we did not train on these modified domains, they were only for evaluation. The files are .mp4-videos, consisting of 32 frames each, regardless of the length of the original clip (but they are typically around 5 seconds long). We may consider to upload also the training set, please contact us if you need it urgently. Otherwise, the trained model for diver segmentation is released in this repository https://github.com/sofiabroome/diver-segmentation if you want to perform the cropping and saving yourself, at your own desired frame rate.
R
Gtsdb German Traffic Sign Detection Benchmark Dataset
universe.roboflow.com
kaggle.com
zip
Updated Jul 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Traore (2022). Gtsdb German Traffic Sign Detection Benchmark Dataset [Dataset]. https://universe.roboflow.com/mohamed-traore-2ekkp/gtsdb---german-traffic-sign-detection-benchmark/model/3
Explore at:
zipAvailable download formats
Dataset updated
Jul 6, 2022
Dataset authored and provided by
Mohamed Traore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Signs Bounding Boxes
Description
This project was created by downloading the GTSDB German Traffic Sign Detection Benchmark

dataset from Kaggle and importing the annotated training set files (images and annotation files)

to Roboflow.

https://www.kaggle.com/datasets/safabouguezzi/german-traffic-sign-detection-benchmark-gtsdb

Original home of the dataset: https://benchmark.ini.rub.de/?section=gtsdb&subsection=dataset - Institut Für Neuroinformatik

The annotation files were adjusted to conform to the YOLO Keras TXT format prior to upload, as the original format did not include a label map file.

v1 contains the original imported images, without augmentations. This is the version to download and import to your own project if you'd like to add your own augmentations.

v2 contains an augmented version of the dataset, with annotations. This version of the project was trained with Roboflow's "FAST" model.

v3 contains an augmented version of the dataset, with annotations. This version of the project was trained with Roboflow's "ACCURATE" model.

Choosing Between Computer Vision Model Sizes | New and Improved Roboflow Train
h
altlex
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data, altlex [Dataset]. https://huggingface.co/datasets/embedding-data/altlex
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "altlex"

Dataset Summary

Git repository for software associated with the 2016 ACL paper "Identifying Causal Relations Using Parallel Wikipedia Articles." Disclaimer: The team releasing altlex did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/altlex.
Titanic Dataset
kaggle.com
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Mudasar Sabir (2025). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/mudasarsabir/titanic-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Muhammad Mudasar Sabir
Description
Description 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!

The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.

How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!

What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.

How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:

Click on the “Submit Predictions” button

Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.

Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!

A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...
h
sentence-compression
huggingface.co
opendatalab.com
Updated Feb 3, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data (2012). sentence-compression [Dataset]. https://huggingface.co/datasets/embedding-data/sentence-compression
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2012
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "sentence-compression"

Dataset Summary

Dataset with pairs of equivalent sentences. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from using the dataset. Disclaimer: The team releasing sentence-compression did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/sentence-compression.
h
Amazon-QA
huggingface.co
Updated Nov 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data (2024). Amazon-QA [Dataset]. https://huggingface.co/datasets/embedding-data/Amazon-QA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 5, 2024
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "Amazon-QA"

Dataset Summary

This dataset contains Question and Answer data from Amazon. Disclaimer: The team releasing Amazon-QA did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.

Dataset Structure

Each example in the dataset… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/Amazon-QA.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Embedding Training Data (2024). PAQ_pairs [Dataset]. https://huggingface.co/datasets/embedding-data/PAQ_pairs

PAQ_pairs

embedding-data/PAQ_pairs

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 5, 2024

Dataset authored and provided by

Embedding Training Data

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for "PAQ_pairs"

  Dataset Summary

Pairs questions and answers obtained from Wikipedia. Disclaimer: The team releasing PAQ QA pairs did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.

  Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

  Languages

English.

  Dataset Structure

Each example in the dataset contains… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/PAQ_pairs.

Clear search

Close search

Google apps

Main menu

PAQ_pairs

Parsimonious machine learning for the global mapping of aboveground biomass...

QQP_triplets

2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography...

Selected MRI datasets for training, validation, and testing.

TreeSatAI Benchmark Archive for Deep Learning in Forest Applications

Synthetically Spoken STAIR

MaleBin: Malware Binary Greyscale Images

This MaleBin Dataset contains 12,464 malware binary images across 39 families. The dataset is compiled from two separate sources:

This new dataset was compiled to address a few challenges:

Please do contact me if there is any oversights regarding the dataset.

coco_captions_quintets

Invoice Management Dataset

SPECTER

cars_wagonr_swift

Context

Content

Inspiration

Pretraining data for PeptideCLM (UPDATED)

Modified Versions of Diving48: Shape and Texture

Gtsdb German Traffic Sign Detection Benchmark Dataset

This project was created by downloading the GTSDB German Traffic Sign Detection Benchmark

dataset from Kaggle and importing the annotated training set files (images and annotation files)

to Roboflow.

https://www.kaggle.com/datasets/safabouguezzi/german-traffic-sign-detection-benchmark-gtsdb

altlex

Titanic Dataset

sentence-compression

Amazon-QA

PAQ_pairs

PAQ_pairs

embedding-data/PAQ_pairs