100+ datasets found

200 Million High-quality Image Data
m.nexdata.ai
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 200 Million High-quality Image Data [Dataset]. https://m.nexdata.ai/datasets/computervision/1793
Explore at:
Dataset updated
Apr 7, 2025
Dataset authored and provided by
Nexdata
Variables measured
Data size, Image type, Data format, Data content, Image resolution
Description
This image database contains 200 million high-quality images that have undergone professional review. The resources are diverse in type, featuring high resolution and clarity, excellent color accuracy, and rich detail. All materials have been legally obtained through authorized channels, with clear indications of copyright ownership and usage authorization scope. The entire collection provides commercial-grade usage rights and has been granted permission for scientific research use, ensuring clear and traceable intellectual property attribution. The vast and high-quality image resources offer robust support for a wide range of applications, including research in the field of computer vision, training of image recognition algorithms, and sourcing materials for creative design, thereby facilitating efficient progress in related areas.
Data from: High Resolution Water Quality Dataset of Chinese Lakes and...
figshare.com
txt
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shilong Luan; Huixiao Pan; Ruoque Shen; Xiaosheng Xia; Hongtao Duan; Wenping Yuan; Jing Wei (2025). High Resolution Water Quality Dataset of Chinese Lakes and Reservoirs from 2000 to 2023 [Dataset]. http://doi.org/10.6084/m9.figshare.27626286.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27626286.v2
Dataset updated
Feb 24, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Shilong Luan; Huixiao Pan; Ruoque Shen; Xiaosheng Xia; Hongtao Duan; Wenping Yuan; Jing Wei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China
Description
The dataset includes monthly data of eight water quality parameters for lakes and reservoirs in China from 2000 to 2023. The data were simulated using random forest models, taking into account the impacts of climate, soil properties, and anthropogenic activities. These water quality parameters are pH, dissolved oxygen (DO; mg/L), total nitrogen (TN; mg/L), total phosphorus (TP; mg/L), permanganate index (CODMn; mg/L), turbidity (Tur; JTU), electrical conductivity (EC; S/m) and dissolved organic carbon (DOC; mg/L). The data is stored in CSV format, sorted by lake and reservoir, and each CSV file contains monthly water quality data for the lake or reservoir and corresponding coordinates.
m
Data from: CQ100: A High-Quality Image Dataset for Color Quantization...
data.mendeley.com
Updated Dec 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Emre Celebi (2024). CQ100: A High-Quality Image Dataset for Color Quantization Research [Dataset]. http://doi.org/10.17632/vw5ys9hfxw.4
Explore at:
Unique identifier
https://doi.org/10.17632/vw5ys9hfxw.4
Dataset updated
Dec 17, 2024
Authors
M. Emre Celebi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CQ100 is a diverse and high-quality dataset of color images that can be used to develop, test, and compare color quantization algorithms. The dataset can also be used in other color image processing tasks, including filtering and segmentation.

If you find CQ100 useful, please cite the following publication: M. E. Celebi and M. L. Perez-Delgado, “CQ100: A High-Quality Image Dataset for Color Quantization Research,” Journal of Electronic Imaging, vol. 32, no. 3, 033019, 2023.

You may download the above publication free of charge from: https://www.spiedigitallibrary.org/journals/journal-of-electronic-imaging/volume-32/issue-3/033019/cq100--a-high-quality-image-dataset-for-color-quantization/10.1117/1.JEI.32.3.033019.full?SSO=1
High-Resolution X-ray computed tomography (XCT) image data set of additively...
catalog.data.gov
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). High-Resolution X-ray computed tomography (XCT) image data set of additively manufactured cobalt chrome samples produced with varying laser powder bed fusion processing parameters [Dataset]. https://catalog.data.gov/dataset/high-resolution-x-ray-computed-tomography-xct-image-data-set-of-additively-manufactured-co-0b01c
Explore at:
Dataset updated
Jul 9, 2025
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This data contains X-ray computed tomography (XCT) reconstructed slices of additively manufactured cobalt chrome samples produced with varying laser powder bed fusion (LPBF) processing parameters (scan speed and hatch spacing). A constant laser power of 195 W and a layer thickness of 20 µm were used. Unoptimized processing parameters created defects in these parts. The as-built CoCr disks were 40 mm in diameter and 10 mm in height, with no post-processing step (e.g. heat treatment or hot isostatic pressing) used. Five mm diameter cylinders were cored out of each disk, and regions of interests (ROIs) within the cylinders were measured with XCT. The voxel size is approximately 2.5 µm, and approximately 1000 x 1000 x 1000 voxel three-dimensional images were obtained, for an actual volume of about (pi/4) x (2.5 mm)^3 in case of the approximately 2.5 µm voxel data sets. The data set contains two folders ('raw' and 'segmented') with 5 zipped tiff image folders, one for each sample. The images in the 'raw' folder are the original 16-bit XCT reconstructed images. The images in the 'segmented' folder are the segmented images. 'setn' in the file name represents the sample set and 'samplen' represents the sample number. The final trailing -n represents the number of the image in the stack where higher number is toward the top of the sample.
AI Training Dataset Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). AI Training Dataset Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-training-dataset-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI Training Dataset Market Outlook

The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.

One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.

Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.

The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.

As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.

Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.

Data Type Analysis

The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.

Image data is critical for computer vision application
n
High-quality diffusion-weighted imaging of Parkinsons disease
neuinfo.org
dknet.org
+1more
Updated Jul 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). High-quality diffusion-weighted imaging of Parkinsons disease [Dataset]. http://identifiers.org/RRID:SCR_014121
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_014121
Dataset updated
Jul 5, 2025
Description
A project which contains data and analysis pipelines for a set of 53 subjects in a cross-sectional Parkinsons disease (PD) study. The dataset contains diffusion-weighted images (DWI) of 27 PD patients and 26 age, sex, and education-matched control subjects. The DWIs were acquired with 120 unique gradient directions, b=1000 and b=2500 s/mm2, and isotropic 2.4 mm3 voxels. The acquisition used a twice-refocused spin echo sequence in order to avoid distortions induced by eddy currents.
d
High-resolution infrared color satellite cloud map - East Asia
data.gov.tw
json, xml
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Central Weather Administration Ministry of Transportation and Communications, High-resolution infrared color satellite cloud map - East Asia [Dataset]. https://data.gov.tw/en/datasets/8193
Explore at:
json, xmlAvailable download formats
Dataset authored and provided by
Central Weather Administration Ministry of Transportation and Communications
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Area covered
East Asia, Asia
Description
High-resolution satellite cloud image data *Changes in download URL as of September 15, 2023, please switch by December 31, 2023, the old version link will expire after the deadline. For those who need to download a large amount of data, please apply for membership at the open platform for meteorological data: https://opendata.cwa.gov.tw/index
A Curated List of Image Deblurring Datasets
kaggle.com
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jishnu Parayil Shibu (2023). A Curated List of Image Deblurring Datasets [Dataset]. https://www.kaggle.com/datasets/jishnuparayilshibu/a-curated-list-of-image-deblurring-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jishnu Parayil Shibu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.

Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.

To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.

Following is a list of the datasets that are currently provided: - GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. - HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes. - RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations. - CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
S
High Quality Rendered Dataset for Intrinsic Image Decomposition
scidb.cn
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yujie Wang; Qingnan Fan; Kun Li; Dongdong Chen; Jingyu Yang; Jianzhi Lu; Dani Lischinski; Baoquan Chen (2025). High Quality Rendered Dataset for Intrinsic Image Decomposition [Dataset]. http://doi.org/10.57760/sciencedb.j00240.00031
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00240.00031
Dataset updated
Apr 7, 2025
Dataset provided by
Science Data Bank
Authors
Yujie Wang; Qingnan Fan; Kun Li; Dongdong Chen; Jingyu Yang; Jianzhi Lu; Dani Lischinski; Baoquan Chen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is a synthetic dataset for intrinsic decomposition, providing photorealistic rendered images along with ground-truth albedo and shading maps. It contains approximately 20K samples, each consisting of an RGB image, a ground-truth albedo image, and a ground-truth shading image.
f
DEEPFLOOD DATASET: High-Resolution Dataset for Accurate Flood Mappingand...
figshare.com
zip
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geospatial and Remote Sensing Research Lab (2025). DEEPFLOOD DATASET: High-Resolution Dataset for Accurate Flood Mappingand Segmentation [Dataset]. http://doi.org/10.6084/m9.figshare.28328339.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28328339.v1
Dataset updated
Feb 1, 2025
Dataset provided by
figshare
Authors
Geospatial and Remote Sensing Research Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The DeepFlood dataset provides high-resolution georeferenced images from both manned andunmanned aerial platforms, featuring detailed labels that go beyond simple binary distinctions.These labels include inundated vegetation, dry vegetation, open water, and others,making the dataset highly applicable for flood mapping across various landscapes. It uniquelyincorporates SAR imagery alongside optical and UAV images, enabling a multi-modal approachto accurately delineate flooded areas.
GlobalHighPM₂.₅: Global Daily Seamless 1 km Ground-Level PM₂.₅ Dataset over...
zenodo.org
nc, pdf, zip
Updated May 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Wei; Jing Wei; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu (2025). GlobalHighPM₂.₅: Global Daily Seamless 1 km Ground-Level PM₂.₅ Dataset over Land (2017–Present) [Dataset]. http://doi.org/10.5281/zenodo.10800980
Explore at:
nc, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10800980
Dataset updated
May 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jing Wei; Jing Wei; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 11, 2022
Description
GlobalHighPM_2.5 is part of a series of long-term, seamless, global, high-resolution, and high-quality datasets of air pollutants over land (i.e., GlobalHighAirPollutants, GHAP). It is generated from big data sources (e.g., ground-based measurements, satellite remote sensing products, atmospheric reanalysis, and model simulations) using artificial intelligence, taking into account the spatiotemporal heterogeneity of air pollution.

This dataset contains input data, analysis codes, and generated dataset used for the following article. If you use the GlobalHighPM_2.5 dataset in your scientific research, please cite the following reference (Wei et al., NC, 2023):

Wei, J., Li, Z., Lyapustin, A., Wang, J., Dubovik, O., Schwartz, J., Sun, L., Li, C., Liu, S., and Zhu, T. First close insight into global daily gapless 1 km PM_2.5 pollution, variability, and health impact. Nature Communications, 2023, 14, 8349. https://doi.org/10.1038/s41467-023-43862-3

Input Data

Relevant raw data for each figure (compiled into a single sheet within an Excel document) in the manuscript.

Code

Relevant Python scripts for replicating and ploting the analysis results in the manuscript, as well as codes for converting data formats.

Generated Dataset

Here is the first big data-derived seamless (spatial coverage = 100%) daily, monthly, and yearly 1 km (i.e., D1K, M1K, and Y1K) global ground-level PM_2.5 dataset over land from 2017 to the present. This dataset exhibits high quality, with cross-validation coefficients of determination (CV-R²) of 0.91, 0.97, and 0.98, and root-mean-square errors (RMSEs) of 9.20, 4.15, and 2.77 µg m^-3 on the daily, monthly, and annual bases, respectively.

Due to data volume limitations,

all (including daily) data for the year 2022 is accessible at: GlobalHighPM2.5 (2022)

all (including daily) data for the year 2021 is accessible at: GlobalHighPM2.5 (2021)

all (including daily) data for the year 2020 is accessible at: GlobalHighPM2.5 (2020)

all (including daily) data for the year 2019 is accessible at: GlobalHighPM2.5 (2019)

all (including daily) data for the year 2018 is accessible at: GlobalHighPM2.5 (2018)

all (including daily) data for the year 2017 is accessible at: GlobalHighPM2.5 (2017)

continuously updated...

More GHAP datasets for different air pollutants are available at: https://weijing-rs.github.io/product.html
d
High-resolution lidar data for infrastructure corridors, Wiseman Quadrangle,...
catalog.data.gov
datasets.ai
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaska Division of Geological & Geophysical Surveys (Point of Contact) (2023). High-resolution lidar data for infrastructure corridors, Wiseman Quadrangle, Alaska [Dataset]. https://catalog.data.gov/dataset/high-resolution-lidar-data-for-infrastructure-corridors-wiseman-quadrangle-alaska21
Explore at:
Dataset updated
Jul 5, 2023
Dataset provided by
Alaska Division of Geological & Geophysical Surveys (Point of Contact)
Area covered
Alaska
Description
In advance of design, permitting, and construction of a pipeline to deliver North Slope natural gas to out-of-state customers and Alaska communities, the Division of Geological & Geophysical Surveys (DGGS) has acquired lidar (Light Detection and Ranging) data along proposed pipeline routes, nearby areas of infrastructure, and regions where significant geologic hazards have been identified. Lidar data will serve multiple purposes, but have primarily been collected to (1) evaluate active faulting, slope instability, thaw settlement, erosion, and other engineering constraints along proposed pipeline routes, and (2) provide a base layer for the state-federal GIS database that will be used to evaluate permit applications and construction plans. The dataset represents all classified laser returns from the lidar survey and their associated geospatial coordinates.
FAST Fluxgate Magnetometer High-Resolution 7.8125 ms Data
data.nasa.gov
s.cnmilf.com
+2more
Updated Apr 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). FAST Fluxgate Magnetometer High-Resolution 7.8125 ms Data [Dataset]. https://data.nasa.gov/dataset/fast-fluxgate-magnetometer-high-resolution-7-8125-ms-data
Explore at:
Dataset updated
Apr 8, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Calibrated Fluxgate Data acquired by the Fast Auroral SnapshoT Small Explorer, FAST, Magnetometer Instrument. Data have been calibrated, despun, and detrended against the International Geomagnetic Reference Field, IGRF, using IGRF Coefficients for the Date of Acquisition. Data are provided in several Coordinate Systems. Non detrended Data in Spacecraft and Geocentric Equatorial Inertial Coordinates are provided. Ephemeris Data are also provided.
P
High-Quality Invoice Images for OCR Dataset
paperswithcode.com
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Freddy C. Chua; Nigel P. Duffy (2025). High-Quality Invoice Images for OCR Dataset [Dataset]. https://paperswithcode.com/dataset/high-quality-invoice-images-for-ocr
Explore at:
Dataset updated
Apr 28, 2025
Authors
Freddy C. Chua; Nigel P. Duffy
Description
dataset link : https://www.kaggle.com/datasets/osamahosamabdellatif/high-quality-invoice-images-for-ocr

Overview High-Quality Invoice Images for OCR is a curated dataset containing professionally scanned and digitally captured invoice documents. It is designed for training, fine-tuning, and evaluating OCR models, machine learning pipelines, and data extraction systems.

This dataset focuses on clean, structured invoices to simulate real-world scenarios in financial document automation.

What's Inside 📄 Variety of invoice templates from multiple industries (e.g., retail, manufacturing, services)

🖋️ Different currencies, tax formats, and layouts

📸 High-resolution scanned and photographed invoices

🏷️ Optional field annotations (e.g., invoice number, date, total amount, vendor name) for supervised training

Key Applications Training and fine-tuning OCR and Document AI models

Machine learning for structured and semi-structured data extraction

Intelligent Document Processing (IDP) and Robotic Process Automation (RPA)

Benchmarking table detection, key-value extraction, and layout analysis models

Why Use This Dataset? ✅ High-quality images optimized for OCR and data extraction tasks

✅ Real-world invoice variations to improve model robustness

✅ Ideal for machine learning workflows in finance, ERP, and accounting systems

✅ Supports rapid prototyping for invoice understanding models

Ideal For Researchers working on OCR and document understanding

Developers building invoice processing systems

Machine learning engineers fine-tuning models for data extraction

Startups and enterprises automating financial workflows
G
High Resolution Digital Elevation Model Mosaic (HRDEM Mosaic) - CanElevation...
open.canada.ca
fgdb/gdb, html, json +3
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natural Resources Canada (2025). High Resolution Digital Elevation Model Mosaic (HRDEM Mosaic) - CanElevation Series [Dataset]. https://open.canada.ca/data/en/dataset/0fe65119-e96e-4a57-8bfe-9d9245fba06b
Explore at:
json, pdf, html, fgdb/gdb, wms, wcsAvailable download formats
Dataset updated
Mar 12, 2025
Dataset provided by
Natural Resources Canada
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The High Resolution Digital Elevation Model Mosaic provides a unique and continuous representation of the high resolution elevation data available across the country. The High Resolution Digital Elevation Model (HRDEM) product used is derived from airborne LiDAR data (mainly in the south) and satellite images in the north. The mosaic is available for both the Digital Terrain Model (DTM) and the Digital Surface Model (DSM) from web mapping services. It is part of the CanElevation Series created to support the National Elevation Data Strategy implemented by NRCan. This strategy aims to increase Canada's coverage of high-resolution elevation data and increase the accessibility of the products. Unlike the HRDEM product in the same series, which is distributed by acquisition project without integration between projects, the mosaic is created to provide a single, continuous representation of strategy data. The most recent datasets for a given territory are used to generate the mosaic. This mosaic is disseminated through the Data Cube Platform, implemented by NRCan using geospatial big data management technologies. These technologies enable the rapid and efficient visualization of high-resolution geospatial data and allow for the rapid generation of dynamically derived products. The mosaic is available from Web Map Services (WMS), Web Coverage Services (WCS) and SpatioTemporal Asset Catalog (STAC) collections. Accessible data includes the Digital Terrain Model (DTM), the Digital Surface Model (DSM) and derived products such as shaded relief and slope. The mosaic is referenced to the Canadian Height Reference System 2013 (CGVD2013) which is the reference standard for orthometric heights across Canada. Source data for HRDEM datasets used to create the mosaic is acquired through multiple projects with different partners. Collaboration is a key factor to the success of the National Elevation Strategy. Refer to the “Supporting Document” section to access the list of the different partners including links to their respective data.
Nimbus High Resolution Infrared Radiometer Grayscale Swath Data L1, TIFF...
data.nasa.gov
search.dataone.org
+6more
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Nimbus High Resolution Infrared Radiometer Grayscale Swath Data L1, TIFF V001 [Dataset]. https://data.nasa.gov/dataset/nimbus-high-resolution-infrared-radiometer-grayscale-swath-data-l1-tiff-v001
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This data set consists of daily, global grayscale TIFF images derived from radiative temperatures measured in the 3.4 to 4.2 µm window. These data were detected by the High Resolution Infrared Radiometer (HRIR) on board the Nimbus 1, Nimbus 2, and Nimbus 3 satellites during 1964, 1966, and 1969-1970. The Nimbus HRIR sensor was used to map the earth's nighttime cloud cover and to measure cloud top temperatures or surface temperatures. Note: This data set is not georeferenced and contains some gaps in temporal coverage because of missing data.
c
The global AI Training Dataset Market size will be USD 2962.4 million in...
cognitivemarketresearch.com
pdf,excel,csv,ppt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research, The global AI Training Dataset Market size will be USD 2962.4 million in 2025. [Dataset]. https://www.cognitivemarketresearch.com/ai-training-dataset-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global AI Training Dataset Market size will be USD 2962.4 million in 2025. It will expand at a compound annual growth rate (CAGR) of 28.60% from 2025 to 2033.

North America held the major market share for more than 37% of the global revenue with a market size of USD 1096.09 million in 2025 and will grow at a compound annual growth rate (CAGR) of 26.4% from 2025 to 2033. Europe accounted for a market share of over 29% of the global revenue, with a market size of USD 859.10 million. APAC held a market share of around 24% of the global revenue with a market size of USD 710.98 million in 2025 and will grow at a compound annual growth rate (CAGR) of 30.6% from 2025 to 2033. South America has a market share of more than 3.8% of the global revenue, with a market size of USD 112.57 million in 2025 and will grow at a compound annual growth rate (CAGR) of 27.6% from 2025 to 2033. Middle East had a market share of around 4% of the global revenue and was estimated at a market size of USD 118.50 million in 2025 and will grow at a compound annual growth rate (CAGR) of 27.9% from 2025 to 2033. Africa had a market share of around 2.20% of the global revenue and was estimated at a market size of USD 65.17 million in 2025 and will grow at a compound annual growth rate (CAGR) of 28.3% from 2025 to 2033. Data Annotation category is the fastest growing segment of the AI Training Dataset Market

Market Dynamics of AI Training Dataset Market

Key Drivers for AI Training Dataset Market

Government-Led Open Data Initiatives Fueling AI Training Dataset Market Growth

In recent years, Government-initiated open data efforts have strongly driven the development of the AI Training Dataset Market through offering affordable, high-quality datasets that are vital in training sound AI models. For instance, the U.S. government's drive for openness and innovation can be seen through portals such as Data.gov, which provides an enormous collection of datasets from many industries, ranging from healthcare, finance, and transportation. Such datasets are basic building blocks in constructing AI applications and training models using real-world data. In the same way, the platform data.gov.uk, run by the U.K. government, offers ample datasets to aid AI research and development, creating an environment that is supportive of technological growth. By releasing such information into the public domain, governments not only enhance transparency but also encourage innovation in the AI industry, resulting in greater demand for training datasets and helping to drive the market's growth.

India's IndiaAI Datasets Platform Accelerates AI Training Dataset Market Growth

India's upcoming launch of the IndiaAI Datasets Platform in January 2025 is likely to greatly increase the AI Training Dataset Market. The project, which is part of the government's ?10,000 crore IndiaAI Mission, will establish an open-source repository similar to platforms such as HuggingFace to enable developers to create, train, and deploy AI models. The platform will collect datasets from central and state governments and private sector organizations to provide a wide and rich data pool. Through improved access to high-quality, non-personal data, the platform is filling an important requirement for high-quality datasets for training AI models, thus driving innovation and development in the AI industry. This public initiative reflects India's determination to become a global AI hub, offering the infrastructure required to facilitate startups, researchers, and businesses in creating cutting-edge AI solutions. The initiative not only simplifies data access but also creates a model for public-private partnerships in AI development.

Restraint Factor for the AI Training Dataset Market

Data Privacy Regulations Impeding AI Training Dataset Market Growth

Strict data privacy laws are coming up as a major constraint in the AI Training Dataset Market since governments across the globe are establishing legislation to safeguard personal data. In the European Union, explicit consent for using personal data is required under the General Data Protection Regulation (GDPR), reducing the availability of datasets for training AI. Likewise, the data protection regulator in Brazil ordered Meta and others to stop the use of Brazilian personal data in training AI models due to dangers to individuals' funda...
n
3-hourly High Resolution Brightness Temperature (BT) images
data-search.nerc.ac.uk
catalogue.ceda.ac.uk
Updated Jul 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). 3-hourly High Resolution Brightness Temperature (BT) images [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=imagery
Explore at:
Dataset updated
Jul 7, 2021
Description
Global Brightness Temperature imagery from the Cloud Archive User Service project. This project produced a long time-series of global thermal infra-red imagery of the Earth using data from operational meteorological satellites, which was used in validating atmospheric General Circulation Models. The source data used in CLAUS are the level B3 (reduced resolution) 10 micron radiances from operational meteorological satellites participating in the International Satellite Cloud Climatology Programme (ISCCP) and were obtained from the NASA Langley Atmospheric Sciences Data Center (LASDC). During the CLAUS project the B3 data were first processed to create a uniform latitude-longitude grid (or image) of Brightness Temperature (BT) values at a spatial resolution of 0.5 by 0.5 degrees and temporal resolution of three hours. The B3 data were also rigorously quality controlled to remove residual noise and navigation/calibration errors that were noticed in the original processing. The 0.5 degree resolution data were updated and supplemented by a new product at one-third degree spatial resolution for use in process studies. The CLAUS Lo-res data archive span the period 1983-2009 and the files are stored in the Portable Grey Map (PGM) format. This is a simple flat file binary format preceded by an ASCII (readable) header that contains information such as the image dimensions and version number. For detailed information about the CLAUS data (processing, quality, etc) please see available documentation (Docs).
u
Data from: USHAP: Big Data Seamless 1 km Ground-level PM2.5 Dataset for the...
iro.uiowa.edu
data.niaid.nih.gov
Updated May 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Wei; Jun Wang; Zhanqing Li (2023). USHAP: Big Data Seamless 1 km Ground-level PM2.5 Dataset for the United States [Dataset]. https://iro.uiowa.edu/esploro/outputs/dataset/USHAP-Big-Data-Seamless-1-km/9984702835302771
Explore at:
Dataset updated
May 1, 2023
Dataset provided by
Zenodo
Authors
Jing Wei; Jun Wang; Zhanqing Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 1, 2023
Area covered
United States
Description
USHAP (USHighAirPollutants) is one of the series of long-term, full-coverage, high-resolution, and high-quality datasets of ground-level air pollutants for the United States. It is generated from the big data (e.g., ground-based measurements, satellite remote sensing products, atmospheric reanalysis, and model simulations) using artificial intelligence by considering the spatiotemporal heterogeneity of air pollution. This is the big data-derived seamless (spatial coverage = 100%) daily, monthly, and yearly 1 km (i.e., D1K, M1K, and Y1K) ground-level PM2.5 dataset in the United States from 2000 to 2020. Our daily PM2.5 estimates agree well with ground measurements with an average cross-validation coefficient of determination (CV-R2) of 0.82 and normalized root-mean-square error (NRMSE) of 0.40, respectively. All the data will be made public online once our paper is accepted, and if you want to use the USHighPM2.5 dataset for related scientific research, please contact us (Email: weijing_rs@163.com; weijing@umd.edu). Wei, J., Wang, J., Li, Z., Kondragunta, S., Anenberg, S., Wang, Y., Zhang, H., Diner, D., Hand, J., Lyapustin, A., Kahn, R., Colarco, P., da Silva, A., and Ichoku, C. Long-term mortality burden trends attributed to black carbon and PM2.5 from wildfire emissions across the continental USA from 2000 to 2020: a deep learning modelling study. The Lancet Planetary Health, 2023, 7, e963–e975. https://doi.org/10.1016/S2542-5196(23)00235-8 More air quality datasets of different air pollutants can be found at: https://weijing-rs.github.io/product.html
h
HQ-Edit
huggingface.co
Updated Jun 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSC-VLAA (2024). HQ-Edit [Dataset]. https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 28, 2024
Dataset authored and provided by
UCSC-VLAA
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for HQ-EDIT

HQ-Edit, a high-quality instruction-based image editing dataset with total 197,350 edits. Unlike prior approaches relying on attribute guidance or human feedback on building datasets, we devise a scalable data collection pipeline leveraging advanced foundation models, namely GPT-4V and DALL-E 3. HQ-Edit’s high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing… See the full description on the dataset page: https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nexdata (2025). 200 Million High-quality Image Data [Dataset]. https://m.nexdata.ai/datasets/computervision/1793

200 Million High-quality Image Data

Explore at:

Dataset updated

Apr 7, 2025

Dataset authored and provided by

Nexdata

Variables measured

Data size, Image type, Data format, Data content, Image resolution

Description

This image database contains 200 million high-quality images that have undergone professional review. The resources are diverse in type, featuring high resolution and clarity, excellent color accuracy, and rich detail. All materials have been legally obtained through authorized channels, with clear indications of copyright ownership and usage authorization scope. The entire collection provides commercial-grade usage rights and has been granted permission for scientific research use, ensuring clear and traceable intellectual property attribution. The vast and high-quality image resources offer robust support for a wide range of applications, including research in the field of computer vision, training of image recognition algorithms, and sourcing materials for creative design, thereby facilitating efficient progress in related areas.

Clear search

Close search

Google apps

Main menu

200 Million High-quality Image Data

Data from: High Resolution Water Quality Dataset of Chinese Lakes and...

Data from: CQ100: A High-Quality Image Dataset for Color Quantization...

High-Resolution X-ray computed tomography (XCT) image data set of additively...

AI Training Dataset Market Report | Global Forecast From 2025 To 2033

AI Training Dataset Market Outlook

Data Type Analysis

High-quality diffusion-weighted imaging of Parkinsons disease

High-resolution infrared color satellite cloud map - East Asia

A Curated List of Image Deblurring Datasets

High Quality Rendered Dataset for Intrinsic Image Decomposition

DEEPFLOOD DATASET: High-Resolution Dataset for Accurate Flood Mappingand...

GlobalHighPM₂.₅: Global Daily Seamless 1 km Ground-Level PM₂.₅ Dataset over...

High-resolution lidar data for infrastructure corridors, Wiseman Quadrangle,...

FAST Fluxgate Magnetometer High-Resolution 7.8125 ms Data

High-Quality Invoice Images for OCR Dataset

High Resolution Digital Elevation Model Mosaic (HRDEM Mosaic) - CanElevation...

Nimbus High Resolution Infrared Radiometer Grayscale Swath Data L1, TIFF...

The global AI Training Dataset Market size will be USD 2962.4 million in...

3-hourly High Resolution Brightness Temperature (BT) images

Data from: USHAP: Big Data Seamless 1 km Ground-level PM2.5 Dataset for the...

HQ-Edit

200 Million High-quality Image Data