This image database contains 200 million high-quality images that have undergone professional review. The resources are diverse in type, featuring high resolution and clarity, excellent color accuracy, and rich detail. All materials have been legally obtained through authorized channels, with clear indications of copyright ownership and usage authorization scope. The entire collection provides commercial-grade usage rights and has been granted permission for scientific research use, ensuring clear and traceable intellectual property attribution. The vast and high-quality image resources offer robust support for a wide range of applications, including research in the field of computer vision, training of image recognition algorithms, and sourcing materials for creative design, thereby facilitating efficient progress in related areas.
High Quality Image-Text Pairs (HQITP) dataset contains 134M high-quality image-caption pairs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CQ100 is a diverse and high-quality dataset of color images that can be used to develop, test, and compare color quantization algorithms. The dataset can also be used in other color image processing tasks, including filtering and segmentation.
If you find CQ100 useful, please cite the following publication: M. E. Celebi and M. L. Perez-Delgado, “CQ100: A High-Quality Image Dataset for Color Quantization Research,” Journal of Electronic Imaging, vol. 32, no. 3, 033019, 2023.
You may download the above publication free of charge from: https://www.spiedigitallibrary.org/journals/journal-of-electronic-imaging/volume-32/issue-3/033019/cq100--a-high-quality-image-dataset-for-color-quantization/10.1117/1.JEI.32.3.033019.full?SSO=1
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset includes monthly data of eight water quality parameters for lakes and reservoirs in China from 2000 to 2023. The data were simulated using random forest models, taking into account the impacts of climate, soil properties, and anthropogenic activities. These water quality parameters are pH, dissolved oxygen (DO; mg/L), total nitrogen (TN; mg/L), total phosphorus (TP; mg/L), permanganate index (CODMn; mg/L), turbidity (Tur; JTU), electrical conductivity (EC; S/m) and dissolved organic carbon (DOC; mg/L). The data is stored in CSV format, sorted by lake and reservoir, and each CSV file contains monthly water quality data for the lake or reservoir and corresponding coordinates.
A project which contains data and analysis pipelines for a set of 53 subjects in a cross-sectional Parkinsons disease (PD) study. The dataset contains diffusion-weighted images (DWI) of 27 PD patients and 26 age, sex, and education-matched control subjects. The DWIs were acquired with 120 unique gradient directions, b=1000 and b=2500 s/mm2, and isotropic 2.4 mm3 voxels. The acquisition used a twice-refocused spin echo sequence in order to avoid distortions induced by eddy currents.
A high-resolution semantic segmentation dataset with 50 validation and 100 test objects. Image resolution in BIG ranges from 2048×1600 to 5000×3600. Every image in the dataset has been carefully labeled by a professional while keeping the same guidelines as PASCAL VOC 2012 without the void region.
PartImageNet is a large, high-quality dataset with part segmentation annotations. It consists of 158 classes from ImageNet with approximately 24000 images. PartImageNet offers part-level annotations on a general set of classes with non-rigid, articulated objects, while having an order of magnitude larger size compared to existing datasets. It can be utilized in multiple vision tasks including but not limited to: Part Discovery, Semantic Segmentation, Few-shot Learning.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Explore the DIV2K Dataset, a comprehensive collection of 1000 high-resolution RGB images designed for NTIRE and PIRM challenges.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Mobile Icon | Mobile Screenshot Dataset is a meticulously curated collection of 9,000+ high-quality mobile screenshots, categorized across 13 diverse application types. This dataset is designed to support AI/ML researchers, UI/UX analysts, and developers in advancing mobile interface understanding, image classification, and content recognition.
Each image has been manually reviewed and verified by computer vision professionals at DataCluster Labs, ensuring high-quality and reliable data for research and development purposes.
The images in this dataset are exclusively owned by Data Cluster Labs and were not downloaded from the internet. To access a larger portion of the training dataset for research and commercial purposes, a license can be purchased. Contact us at sales@datacluster.ai Visit www.datacluster.ai to know more.
The President believes we need to equip every child with the skills and education they need to be on a clear path to a good job and the middle class. To ensure these opportunities are available to all, President Obama has put forward a comprehensive early learning proposal to build a strong foundation for success in the first five years of life. These investments will help close America's school readiness gap and ensure that America's children enter kindergarten ready to succeed.
dataset link : https://www.kaggle.com/datasets/osamahosamabdellatif/high-quality-invoice-images-for-ocr
Overview High-Quality Invoice Images for OCR is a curated dataset containing professionally scanned and digitally captured invoice documents. It is designed for training, fine-tuning, and evaluating OCR models, machine learning pipelines, and data extraction systems.
This dataset focuses on clean, structured invoices to simulate real-world scenarios in financial document automation.
What's Inside 📄 Variety of invoice templates from multiple industries (e.g., retail, manufacturing, services)
🖋️ Different currencies, tax formats, and layouts
📸 High-resolution scanned and photographed invoices
🏷️ Optional field annotations (e.g., invoice number, date, total amount, vendor name) for supervised training
Key Applications Training and fine-tuning OCR and Document AI models
Machine learning for structured and semi-structured data extraction
Intelligent Document Processing (IDP) and Robotic Process Automation (RPA)
Benchmarking table detection, key-value extraction, and layout analysis models
Why Use This Dataset? ✅ High-quality images optimized for OCR and data extraction tasks
✅ Real-world invoice variations to improve model robustness
✅ Ideal for machine learning workflows in finance, ERP, and accounting systems
✅ Supports rapid prototyping for invoice understanding models
Ideal For Researchers working on OCR and document understanding
Developers building invoice processing systems
Machine learning engineers fine-tuning models for data extraction
Startups and enterprises automating financial workflows
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The data labeling market is experiencing robust growth, projected to reach $3.84 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 28.13% from 2025 to 2033. This expansion is fueled by the increasing demand for high-quality training data across various sectors, including healthcare, automotive, and finance, which heavily rely on machine learning and artificial intelligence (AI). The surge in AI adoption, particularly in areas like autonomous vehicles, medical image analysis, and fraud detection, necessitates vast quantities of accurately labeled data. The market is segmented by sourcing type (in-house vs. outsourced), data type (text, image, audio), labeling method (manual, automatic, semi-supervised), and end-user industry. Outsourcing is expected to dominate the sourcing segment due to cost-effectiveness and access to specialized expertise. Similarly, image data labeling is likely to hold a significant share, given the visual nature of many AI applications. The shift towards automation and semi-supervised techniques aims to improve efficiency and reduce labeling costs, though manual labeling will remain crucial for tasks requiring high accuracy and nuanced understanding. Geographical distribution shows strong potential across North America and Europe, with Asia-Pacific emerging as a key growth region driven by increasing technological advancements and digital transformation. Competition in the data labeling market is intense, with a mix of established players like Amazon Mechanical Turk and Appen, alongside emerging specialized companies. The market's future trajectory will likely be shaped by advancements in automation technologies, the development of more efficient labeling techniques, and the increasing need for specialized data labeling services catering to niche applications. Companies are focusing on improving the accuracy and speed of data labeling through innovations in AI-powered tools and techniques. Furthermore, the rise of synthetic data generation offers a promising avenue for supplementing real-world data, potentially addressing data scarcity challenges and reducing labeling costs in certain applications. This will, however, require careful attention to ensure that the synthetic data generated is representative of real-world data to maintain model accuracy. This comprehensive report provides an in-depth analysis of the global data labeling market, offering invaluable insights for businesses, investors, and researchers. The study period covers 2019-2033, with 2025 as the base and estimated year, and a forecast period of 2025-2033. We delve into market size, segmentation, growth drivers, challenges, and emerging trends, examining the impact of technological advancements and regulatory changes on this rapidly evolving sector. The market is projected to reach multi-billion dollar valuations by 2033, fueled by the increasing demand for high-quality data to train sophisticated machine learning models. Recent developments include: September 2024: The National Geospatial-Intelligence Agency (NGA) is poised to invest heavily in artificial intelligence, earmarking up to USD 700 million for data labeling services over the next five years. This initiative aims to enhance NGA's machine-learning capabilities, particularly in analyzing satellite imagery and other geospatial data. The agency has opted for a multi-vendor indefinite-delivery/indefinite-quantity (IDIQ) contract, emphasizing the importance of annotating raw data be it images or videos—to render it understandable for machine learning models. For instance, when dealing with satellite imagery, the focus could be on labeling distinct entities such as buildings, roads, or patches of vegetation.October 2023: Refuel.ai unveiled a new platform, Refuel Cloud, and a specialized large language model (LLM) for data labeling. Refuel Cloud harnesses advanced LLMs, including its proprietary model, to automate data cleaning, labeling, and enrichment at scale, catering to diverse industry use cases. Recognizing that clean data underpins modern AI and data-centric software, Refuel Cloud addresses the historical challenge of human labor bottlenecks in data production. With Refuel Cloud, enterprises can swiftly generate the expansive, precise datasets they require in mere minutes, a task that traditionally spanned weeks.. Key drivers for this market are: Rising Penetration of Connected Cars and Advances in Autonomous Driving Technology, Advances in Big Data Analytics based on AI and ML. Potential restraints include: Rising Penetration of Connected Cars and Advances in Autonomous Driving Technology, Advances in Big Data Analytics based on AI and ML. Notable trends are: Healthcare is Expected to Witness Remarkable Growth.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GlobalHighPM2.5 is part of a series of long-term, seamless, global, high-resolution, and high-quality datasets of air pollutants over land (i.e., GlobalHighAirPollutants, GHAP). It is generated from big data sources (e.g., ground-based measurements, satellite remote sensing products, atmospheric reanalysis, and model simulations) using artificial intelligence, taking into account the spatiotemporal heterogeneity of air pollution.
This dataset contains input data, analysis codes, and generated dataset used for the following article. If you use the GlobalHighPM2.5 dataset in your scientific research, please cite the following reference (Wei et al., NC, 2023):
Wei, J., Li, Z., Lyapustin, A., Wang, J., Dubovik, O., Schwartz, J., Sun, L., Li, C., Liu, S., and Zhu, T. First close insight into global daily gapless 1 km PM2.5 pollution, variability, and health impact. Nature Communications, 2023, 14, 8349. https://doi.org/10.1038/s41467-023-43862-3
Input Data
Relevant raw data for each figure (compiled into a single sheet within an Excel document) in the manuscript.
Code
Relevant Python scripts for replicating and ploting the analysis results in the manuscript, as well as codes for converting data formats.
Generated Dataset
Here is the first big data-derived seamless (spatial coverage = 100%) daily, monthly, and yearly 1 km (i.e., D1K, M1K, and Y1K) global ground-level PM2.5 dataset over land from 2017 to the present. This dataset exhibits high quality, with cross-validation coefficients of determination (CV-R2) of 0.91, 0.97, and 0.98, and root-mean-square errors (RMSEs) of 9.20, 4.15, and 2.77 µg m-3 on the daily, monthly, and annual bases, respectively.
Due to data volume limitations,
all (including daily) data for the year 2022 is accessible at: GlobalHighPM2.5 (2022)
all (including daily) data for the year 2021 is accessible at: GlobalHighPM2.5 (2021)
all (including daily) data for the year 2020 is accessible at: GlobalHighPM2.5 (2020)
all (including daily) data for the year 2019 is accessible at: GlobalHighPM2.5 (2019)
all (including daily) data for the year 2018 is accessible at: GlobalHighPM2.5 (2018)
all (including daily) data for the year 2017 is accessible at: GlobalHighPM2.5 (2017)
continuously updated...
More GHAP datasets for different air pollutants are available at: https://weijing-rs.github.io/product.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
[Paper] https://arxiv.org/abs/2506.19848 [GitHub] https://github.com/Cooperx521/ScaleCap
ScaleCap450k-Hyper detailed and high quality image caption
Dataset details
This dataset contains 450k image-caption pairs, where the captions are annotated using the ScaleCap pipeline. For more details, please refer to the paper. In collecting images for our dataset, we primarily focus on two aspects: diversity and richness of image content. Given that the ShareGPT4V-100k already… See the full description on the dataset page: https://huggingface.co/datasets/long-xing1/ScaleCap-450k.
GlobalHighPM2.5 is one of the series of long-term, full-coverage, global high-resolution and high-quality datasets of ground-level air pollutants over land (i.e., GlobalHighAirPollutants, GHAP). It is generated from big data (e.g., ground-based measurements, satellite remote sensing products, atmospheric reanalysis, and model simulations) using artificial intelligence by considering the spatiotemporal heterogeneity of air pollution. The coefficient of determination R2 for cross validation with ten fold data is 0.91, and the root mean square error RMSE is 9.2 µ g/m3. The main scope covers the entire global land area, with a spatial resolution of 1 km and a temporal resolution of day, month, and year, measured in µg/m3. Attention: This dataset is recorded in Universal Time (UTC, GMT+0) and is continuously updated. If you need more data, please contact the author by email( weijing_rs@163.com ; weijing@umd.edu ). The data file contains four types of codes for converting NC to GeoTiff (Python, Matlab, IDL, and R languages) nc2geotiff codes.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
About
We provide a comprehensive talking-head video dataset with over 50,000+ videos, totaling more than 500 hours of footage and featuring 20,841 unique identities from around the world.
Distribution
Detailing the format, size, and structure of the dataset:
Data Volume:
-Total Size: 2.7TB
-Total Videos: 47,547
-Identities Covered: 20,841
-Resolution: 60% 4k(1980), 33% fullHD(1080)
-Formats: MP4
-Full-length videos with visible mouth movements in every frame.
-Minimum face size of 400 pixels.
-Video durations range from 20 seconds to 5 minutes.
-Faces have not been cut out, full screen videos including backgrounds.
Usage
This dataset is ideal for a variety of applications:
Face Recognition & Verification: Training and benchmarking facial recognition models.
Action Recognition: Identifying human activities and behaviors.
Re-Identification (Re-ID): Tracking identities across different videos and environments.
Deepfake Detection: Developing methods to detect manipulated videos.
Generative AI: Training high-resolution video generation models.
Lip Syncing Applications: Enhancing AI-driven lip-syncing models for dubbing and virtual avatars.
Background AI Applications: Developing AI models for automated background replacement, segmentation, and enhancement.
Coverage
Explaining the scope and coverage of the dataset:
Geographic Coverage: Worldwide
Time Range: Time range and size of the videos have been noted in the CSV file.
Demographics: Includes information about age, gender, ethnicity, format, resolution, and file size.
Languages Covered (Videos):
English: 23,038 videos
Portuguese: 1,346 videos
Spanish: 677 videos
Norwegian: 1,266 videos
Swedish: 1,056 videos
Korean: 848 videos
Polish: 1,807 videos
Indonesian: 1,163 videos
French: 1,102 videos
German: 1,276 videos
Japanese: 1,433 videos
Dutch: 1,666 videos
Indian: 1,163 videos
Czech: 590 videos
Chinese: 685 videos
Italian: 975 videos
Philipeans: 920 videos
Bulgaria: 340 videos
Romanian: 1144 videos
Arabic: 1691 videos
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25697584%2F9886980daa5564aa1654f08f1265a16e%2Fgenders.svg?generation=1743586595494800&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25697584%2Fa507c24865a6c7ca74c617bbec9b0ab3%2Fgender1.svg?generation=1743586726981819&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25697584%2F7b1cb6a68030bd5ddd5ec35ae456f28b%2Fgender2.svg?generation=1743586742032222&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25697584%2Fac9f51471caa388b494c3190fce34438%2Fgender3.svg?generation=1743586754238882&alt=media" alt="">
Who Can Use It
List examples of intended users and their use cases:
Data Scientists: Training machine learning models for video-based AI applications.
Researchers: Studying human behavior, facial analysis, or video AI advancements.
Businesses: Developing facial recognition systems, video analytics, or AI-driven media applications.
Additional Notes
Ensure ethical usage and compliance with privacy regulations. The dataset’s quality and scale make it valuable for high-performance AI training. Potential preprocessing (cropping, downsampling) may be needed for different use cases. Dataset has not been completed yet and expands daily, please contact for most up to date CSV file. The dataset has been divided into 20GB zipped files and is hosted on a private server (with the option to upload to the cloud if needed). To verify the dataset's quality, please contact me for the full CSV file. I’d be happy to provide example videos selected by the potential buyer.
We processed 41 Voyager 2 images of Neptune’s moon Triton with pixel scales < 2 km/pixel form their raw, compressed, archived state to more usable cloud-optimized geotiffs, which can easily be used within spatial analysis software such as GIS. Processing was done using the USGS’ ISIS software and included geometric and radiometric calibration, and the removal of image reseaux and corner markers (originally used for geometric calibration). The images were also photogrammetrically controlled relative to one another to improve their locations on the surface, which were initially inaccurate by up to 200 km. After performing a least squares bundle adjustment, the root mean square (RMS) uncertainty in image locations was 0.50, 0.52, and 0.51 pixels in latitude, longitude, and radius, respectively, with minimum and maximum residuals of -4.21 and +3.197 pixels, respectively. Each individual image was warped to an orthographic projection centered at 15o W and 18o N at the native image resolution. Because reseaux removal introduces interpolated data, two versions of each image are provided: a fully processed version with reseaux removed, and a partially processed version that retains the reseaux (i.e., no interpolation). We also generated a mosaic of Triton images that is spatially consistent with the entire dataset and provides context for the individual images (not every image is included in the mosaic). The mosaic uses the same orthographic projection as the individual images, but a consistent scale of 600 m/pixel. Three versions of the mosaic are included: a fully processed version (reseaux removed, some interpolated pixel values), a partially processed version (reseaux retained, no interpolation), and a fully processed but sharpened (high pass filter with 100% albedo add-back) to enhance surface features. This data release improves the usability and accessibility of this singular dataset and enables new scientific investigations of Triton.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
United States - 61.5-Year High Quality Market (HQM) Corporate Bond Spot Rate was 5.95% in March of 2025, according to the United States Federal Reserve. Historically, United States - 61.5-Year High Quality Market (HQM) Corporate Bond Spot Rate reached a record high of 12.50 in June of 1984 and a record low of 3.09 in December of 2020. Trading Economics provides the current actual value, an historical data chart and related indicators for United States - 61.5-Year High Quality Market (HQM) Corporate Bond Spot Rate - last updated from the United States Federal Reserve on May of 2025.
High-Quality Wetland points displayed in the DNR Watershed Restoration and Protection Viewer. These are unique wetlands and those wetlands with least disturbed or reference conditions. Points represent a generalized area, for legal and privacy reasons. All points are in HUCs that fall mostly within Wisconsin.
https://data.gov.tw/licensehttps://data.gov.tw/license
High-resolution satellite cloud image data *Changes in download URL as of September 15, 2023, please switch by December 31, 2023, the old version link will expire after the deadline. For those who need to download a large amount of data, please apply for membership at the open platform for meteorological data: https://opendata.cwa.gov.tw/index
This image database contains 200 million high-quality images that have undergone professional review. The resources are diverse in type, featuring high resolution and clarity, excellent color accuracy, and rich detail. All materials have been legally obtained through authorized channels, with clear indications of copyright ownership and usage authorization scope. The entire collection provides commercial-grade usage rights and has been granted permission for scientific research use, ensuring clear and traceable intellectual property attribution. The vast and high-quality image resources offer robust support for a wide range of applications, including research in the field of computer vision, training of image recognition algorithms, and sourcing materials for creative design, thereby facilitating efficient progress in related areas.