100+ datasets found

s
Statistics Interface Province-Level Data Collection - Datasets - This...
store.smartdatahub.io
Updated Nov 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Statistics Interface Province-Level Data Collection - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. // [Dataset]. https://store.smartdatahub.io/dataset/fi_tilastokeskus_tilastointialueet_maakunta1000k
Explore at:
Dataset updated
Nov 11, 2024
Description
The dataset collection in question is a compilation of related data tables sourced from the website of Tilastokeskus (Statistics Finland) in Finland. The data present in the collection is organized in a tabular format comprising of rows and columns, each holding related data. The collection includes several tables, each of which represents different years, providing a temporal view of the data. The description provided by the data source, Tilastokeskuksen palvelurajapinta (Statistics Finland's service interface), suggests that the data is likely to be statistical in nature and could be related to regional statistics, given the nature of the source. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).
Intelligent Monitor
kaggle.com
Updated Apr 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ptdevsecops (2024). Intelligent Monitor [Dataset]. http://doi.org/10.34740/kaggle/ds/4383210
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/4383210
Dataset updated
Apr 12, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ptdevsecops
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
IntelligentMonitor: Empowering DevOps Environments With Advanced Monitoring and Observability aims to improve monitoring and observability in complex, distributed DevOps environments by leveraging machine learning and data analytics. This repository contains a sample implementation of the IntelligentMonitor system proposed in the research paper, presented and published as part of the 11th International Conference on Information Technology (ICIT 2023).

If you use this dataset and code or any herein modified part of it in any publication, please cite these papers:

P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.

For any questions and research queries - please reach out via Email.

Abstract - In the dynamic field of software development, DevOps has become a critical tool for enhancing collaboration, streamlining processes, and accelerating delivery. However, monitoring and observability within DevOps environments pose significant challenges, often leading to delayed issue detection, inefficient troubleshooting, and compromised service quality. These issues stem from DevOps environments' complex and ever-changing nature, where traditional monitoring tools often fall short, creating blind spots that can conceal performance issues or system failures. This research addresses these challenges by proposing an innovative approach to improve monitoring and observability in DevOps environments. Our solution, Intelligent-Monitor, leverages realtime data collection, intelligent analytics, and automated anomaly detection powered by advanced technologies such as machine learning and artificial intelligence. The experimental results demonstrate that IntelligentMonitor effectively manages data overload, reduces alert fatigue, and improves system visibility, thereby enhancing performance and reliability. For instance, the average CPU usage across all components showed a decrease of 9.10%, indicating improved CPU efficiency. Similarly, memory utilization and network traffic showed an average increase of 7.33% and 0.49%, respectively, suggesting more efficient use of resources. By providing deep insights into system performance and facilitating rapid issue resolution, this research contributes to the DevOps community by offering a comprehensive solution to one of its most pressing challenges. This fosters more efficient, reliable, and resilient software development and delivery processes.

Components The key components that would need to be implemented are:

Data Collection - Collect performance metrics and log data from the distributed system components. Could use technology like Kafka or telemetry libraries.

Data Processing - Preprocess and aggregate the collected data into an analyzable format. Could use Spark for distributed data processing.

Anomaly Detection - Apply machine learning algorithms to detect anomalies in the performance metrics. Could use isolation forest or LSTM models.

Alerting - Generate alerts when anomalies are detected. It could integrate with tools like PagerDuty.

Visualization - Create dashboards to visualize system health and key metrics. Could use Grafana or Kibana.

Data Storage - Store the collected metrics and log data. Could use Elasticsearch or InfluxDB.

Implementation Details The core of the implementation would involve the following: - Setting up the data collection pipelines. - Building and training anomaly detection ML models on historical data. - Developing a real-time data processing pipeline. - Creating an alerting framework that ties into the ML models. - Building visualizations and dashboards.

The code would need to handle scaled-out, distributed execution for production environments.

Proper code documentation, logging, and testing would be added throughout the implementation.

Usage Examples Usage examples could include:

Running the data collection agents on each system component.

Visualizing system metrics through Grafana dashboards.

Investigating anomalies detected by the ML models.

Tuning the alerting rules to minimize false positives.

Correlating metrics with log data to troubleshoot issues.

References The implementation would follow the details provided in the original research paper: P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.

Any additional external libraries or sources used would be properly cited.

Tags - DevOps, Software Development, Collaboration, Streamlini...
CRUMB: the Collected Radiogalaxies Using MiraBest dataset
zenodo.org
data.niaid.nih.gov
text/x-python
Updated Sep 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fiona Alice May Porter; Fiona Alice May Porter (2023). CRUMB: the Collected Radiogalaxies Using MiraBest dataset [Dataset]. http://doi.org/10.5281/zenodo.7746094
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7746094
Dataset updated
Sep 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fiona Alice May Porter; Fiona Alice May Porter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The CRUMB dataset is a machine learning image dataset of Fanaroff-Riley galaxies, constructed by combining the datasets of MiraBest, FR-DEEP, AT17 and a supplementary MiraBest hybrid dataset.

Sources are labelled using a unified "basic" label system with the following classes:

0: FRI

1: FRII

2: Hybrid source

The original labels of each of the parent datasets are also retained as a "complete" label which is accessible using the built-in "complete_labels" method on the dataloader. These labels are as follows:

MiraBest: 0 (confidently-classified FRI), 1 (confidently-classified wide-angle-tail), 2 (confidently-classified head-tail), 3 (uncertainly-classified FRI), 4 (uncertainly-classified wide-angle-tail), 5 (confidently-classified FRII), 6 (confidently-classified double-double), 7 (uncertainly-classified FRII), 8 (confidently-classified hybrid), 9 (uncertainly-classified hybrid)

FR-DEEP: 0 (FRI), 1 (FRII)

AT17: 0 (FRI), 1 (FRII), 2 (bent)

MiraBest Hybrid: 0 (confidently-classified hybrid), 1 (uncertainly-classified hybrid)

For examples of how to use CRUMB, please see its Github.
w
Dataset of book subjects that contain Howard the Duck : the complete...
workwithdata.com
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book subjects that contain Howard the Duck : the complete collection. Vol. 2 [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Howard+the+Duck+:+the+complete+collection.+Vol.+2&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 2 rows and is filtered where the books is Howard the Duck : the complete collection. Vol. 2. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
EK500 Water Column Sonar Data Collected During DE0407
catalog.data.gov
datasets.ai
+1more
Updated Sep 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA National Centers for Environmental Information (Point of Contact) (2021). EK500 Water Column Sonar Data Collected During DE0407 [Dataset]. https://catalog.data.gov/dataset/ek500-water-column-sonar-data-collected-during-de0407
Explore at:
Dataset updated
Sep 17, 2021
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
Description
NEFSC Apex Predators Longline Survey (DE0407, EK500). The fishery independent survey of Atlantic large and small coastal sharks is conducted bi-annually in U.S. waters. Its primary objective is to conduct a standardized, systematic survey of the shark populations off the U.S. Atlantic coast to provide unbiased indices of relative abundance for species inhabiting the waters from Florida to the Mid-Atlantic. This survey also provides an opportunity to tag sharks with conventional and electronic tags as part of the NEFSC Cooperative Shark Tagging Program, inject with OTC for age validation studies, and to collect biological samples and data used in analyses of life history characteristics (age, growth, reproductive biology, trophic ecology, etc.) and other research of sharks in U.S. coastal waters including the collection of morphometric data for size conversions. The time series of abundance indices from this survey is critical to the evaluation of coastal Atlantic shark species.
c
Clinical Questions Collection
s.cnmilf.com
healthdata.gov
+4more
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). Clinical Questions Collection [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/clinical-questions-collection-665af
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
The Clinical Questions Collection is a repository of questions that have been collected between 1991 – 2003 from healthcare providers in clinical settings across the country. The questions have been submitted by investigators who wish to share their data with other researchers. This dataset is no-longer updated with new content. The collection is used in developing approaches to clinical and consumer-health question answering, as well as researching information needs of clinicians and the language they use to express their information needs. All files are formatted in XML.
Volume of recyclable waste collected from municipalities Japan FY 2023, by...
statista.com
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Volume of recyclable waste collected from municipalities Japan FY 2023, by type [Dataset]. https://www.statista.com/statistics/1183178/japan-recyclable-waste-volume-municipalities/
Explore at:
Dataset updated
Oct 1, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Japan
Description
In fiscal year 2023, the Japan Containers and Packaging Recycling Association received approximately 655.8 thousand metric tons of plastic waste collected by municipalities in Japan. That fiscal year, plastics accounted for around 55 percent of the collected recyclable waste. The consumption volume of plastic products in Japan has been rising continuously in recent years.
w
Dataset of books in the Complete Guides series
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books in the Complete Guides series [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=j0-book_series&fop0=%3D&fval0=Complete+Guides&j=1&j0=book_series
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 6 rows and is filtered where the book series is Complete Guides. It features 9 columns including author, publication date, language, and book publisher.
w
Dataset of books called The Bill : the complete low-down on 20 years at Sun...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called The Bill : the complete low-down on 20 years at Sun Hill [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=The+Bill+%3A+the+complete+low-down+on+20+years+at+Sun+Hill
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is The Bill : the complete low-down on 20 years at Sun Hill. It features 7 columns including author, publication date, language, and book publisher.
d
USGS Southwest Repeat Photography Collection: Kanab Creek, southern Utah and...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). USGS Southwest Repeat Photography Collection: Kanab Creek, southern Utah and northern Arizona, 1872-2010 [Dataset]. https://catalog.data.gov/dataset/usgs-southwest-repeat-photography-collection-kanab-creek-southern-utah-and-northern-a-1872
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Kanab Creek, Utah, Arizona
Description
The USGS Southwest Repeat Photography Collection (‘Collection’), formerly named the Desert Laboratory Repeat Photography Collection, is now housed by the Southwest Biological Science Center (SBSC) in Flagstaff, Arizona. It contains images from the late 1800s to mid-2000s, and was assembled over decades by now retired USGS scientists Drs. Robert H. Webb and Raymond M. Turner. There are 80 camera points, or stakes, along Kanab Creek in the Collection, with images and fields notes taken between 1872 and 2010 (a 138-year span). About one-fourth of the Kanab Creek film had been previously digitized, but none of the associated materials, including field notes, were digitized. The goal of the Fiscal Year (FY) 2016 Kanab Creek data preservation project was to preserve all film and materials for the Kanab Creek stretch, which represents a small subset of the entire USGS Southwest Repeat Photography Collection. For the purposes of this project, we will be using assessments made during the digital preservation of this subset in order to estimate the time and protocols necessary to digitize and release the entire Collection, thus using this project as an example to model the preservation efforts for the rest of the Collection. As the Collection was compiled, every camera point or stake location (‘stake’) was assigned a stake number denoted by the letter ‘s’ in front of a number (e.g., s1234). It is important to note that since images were taken by various repeat photography expeditions at different times and for different research purposes, stake numbers do not follow a logical numerical or geographical order. Each stake has a physical folder where all materials collected over time have been consolidated, with the exception of film. The folders contain print photographs, field notes, Record of Repeat Photography note sheets and other paper materials. The photographic film (negative and positive) is stored within archival envelopes within fire safes at SBSC for security. The goal of this preservation project was to digitize the best quality film and other materials for each date at all stakes along Kanab Creek in order to preserve the long term visual record. The digitization process was slightly different depending on the type of material; all processes were documented and a detailed protocol has been provided as an attachment on the ScienceBase page. Geospatial data are also included in this release to provide users with a map representation of where the stake locations are situated along Kanab Creek. This data release contains 80 child items, each representing a different stake location along Kanab Creek. Each child item provides digital copies of images and field notes from all photographed years at each stake, and has an associated metadata record that describes the contents of the page. This main landing page includes all child items, geospatial references for all stake locations (SHP file), a spreadsheet with supplemental information for each stake (CSV file), a project level metadata record, a metadata record that describes both the SHP file and the CSV, and a document describing the scanning protocols used for this project. Description of film materials: Film is ideal for preservation at a high resolution because it is the closest representation of the source image captured by the camera. The images in this release include digital scans of 182 film negatives or prints. Most of these are from distinct stake locations and dates, however there are some duplicates (from the same stake and date) that were scanned in order to preserve unique details they included. The film types vary depending on the time of data collection and type of camera used. The film types include black and white negatives, black and white positives, color negatives and color positives, and are found in either 4x5" size or 120mm film size. To digitize all film types we used a Hasselblad Flextight scanner with associated FlexLight software. To edit film we used a combination of FlexLight software and Adobe Photoshop editing tools. All adjustments were made in order to achieve maximum clarity of landscape features, debris flows, vegetation patterns and human settlements in the photos so these aspects can be easily observed and studied. Description of photographic prints: The stake folders in the Kanab Creek Collection contain print photographs of various sizes and types, with possible repeats. The print sizes included 8x10", 5x8", 4x6" prints and 3x5" Polaroid photos. The type of photographic equipment used to take the original photographs varied depending on the year the data were captured, the photographic technology available at that time, and the photographer’s choice of film size and camera. Photographs were digitized using a flatbed scanner and associated software. Then, the scans were edited in Adobe Photoshop. Similar to the process used for film, image adjustments were made in order to achieve maximum clarity of landscape features and change; particularly, channel changes resulting from various geomorphic processes (including stream flow, floods and debris flows), vegetation patterns and human settlements so that these aspects could be easily observed and studied. Description of paper materials: The images in the repeat photography collection are accompanied by field notes that provide valuable information about the repeat imagery and data collection process. All paper materials in the stake folders were digitized to preserve this important information. These included hand written field notes, Record of Repeat Photography data sheets, field notes on vegetation, and film envelopes with written camera metadata and camera settings. Some folder documents are originals, while others are printed photocopies of field notebooks or other documents. A flatbed EPSON scanner was used to digitize these documents. Minor adjustments were made to digital quality in order to maximize readability of the information. The digitization process was slightly different depending on the type of material; all processes are documented in the metadata, and a detailed digitizing protocol has been provided as an attachment on the project's ScienceBase page.
w
Dataset of books called The complete wine course
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called The complete wine course [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=The+complete+wine+course
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 2 rows and is filtered where the book is The complete wine course. It features 7 columns including author, publication date, language, and book publisher.
Data from: UCM Bird Collection (Arctos)
gbif.org
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily Braker; Emily Braker (2025). UCM Bird Collection (Arctos) [Dataset]. http://doi.org/10.15468/ajtd6v
Explore at:
Unique identifier
https://doi.org/10.15468/ajtd6v
Dataset updated
Jul 5, 2025
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
University of Colorado Museum of Natural History
Authors
Emily Braker; Emily Braker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered

Description
The UCM Bird collection numbers over 12,000 specimens and document changes in biodiversity over the last 200 years. With global coverage and strengths in Colorado avifauna, the collection dates from the early 1800s to the present. Nearly 6,000 specimens were donated by the Colorado College Museum in Colorado Springs in the 1960s. This material includes the collection of Charles E. Aiken, a pioneer ornithologist in Colorado, and dates back to 1805. The Bird collection is also home to several specimens of iconic extinct species such as Passenger Pigeons and Carolina Parakeets.
University of Georgia Collection of Arthropods
gbif.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Georgia Collection of Arthropods (2025). University of Georgia Collection of Arthropods [Dataset]. http://doi.org/10.15468/rsljus
Explore at:
Unique identifier
https://doi.org/10.15468/rsljus
Dataset updated
Jun 4, 2025
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
University of Georgia Collection of Arthropods
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The University of Georgia Collection of Arthropods (UGCA) serves as the official state repository of insects and other non-marine arthropods. The Collection is part of the UGA Department of Entomology and the Georgia Museum of Natural History (GMNH). The UGCA includes approximately 2,000,000 pinned specimens. In addition the collection houses significant alcohol-preserved and slide-mounted collections. Approximately 60% of the holdings are from the southeastern United States as is consistent with our mission to serve as the primary systematics reference for the state. More than 70% of that regional material is identified to the species level.
w
Dataset of books called The complete servant
workwithdata.com
Updated Apr 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called The complete servant [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=The+complete+servant
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is The complete servant. It features 7 columns including author, publication date, language, and book publisher.
Z
Research Artefact: An Empirical Study of React-Library Related Issues via...
data.niaid.nih.gov
zenodo.org
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kurniaji, Ganno Tribuana (2023). Research Artefact: An Empirical Study of React-Library Related Issues via Stack Overflow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6420714
Explore at:
Dataset updated
Aug 4, 2023
Dataset provided by
Nugroho, Yusuf Sulistyo
Kurniaji, Ganno Tribuana
Islam, Syful
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This research artifact accompanies the paper titled "An Empirical Study of React-Library Related Issues via Stack Overflow." It is a comprehensive repository that includes the collected dataset containing 447,542 React-related Stack Overflow question posts, as well as 384 representative samples obtained randomly. The primary objective of this artifact is to facilitate the replication of our dataset for researchers and allow them to utilize it for further investigations and research purposes.
F
English Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/english-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the English Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the English language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this English OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible English text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native English Speaking people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of English text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native English language crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the English language. Your journey to enhanced language understanding and processing starts here.
Z
PixBox Landsat 8 pixel collection for CMIX
data.niaid.nih.gov
zenodo.org
Updated Dec 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lebreton, Carole (2021). PixBox Landsat 8 pixel collection for CMIX [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5040270
Explore at:
Dataset updated
Dec 20, 2021
Dataset provided by
Wevers, Jan
Brockmann, Carsten
Stelzer, Kerstin
Lebreton, Carole
Paperin, Michael
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The PixBox-L8-CMIX dataset was used as a validation reference within the first Cloud Masking Inter-comparison eXercise (CMIX) conducted within the Committee Earth Observation Satellites (CEOS) Working Group on Calibration & Validation (WGCV) in 2019. The PixBox-L8-CMIX pixel collection was existing prior to CMIX and conducted already in 2015.

The overarching idea of PixBox is a quantitative assessment of the quality of a pixel classification which is the result of an automated algorithm/procedure. Pixel classification is defined as assigning a certain number of attributes to an image pixel, such as cloud, clear sky, water, land, inland water, flooded, snow etc. Such pixel classification attributes are typically used to further guide higher level processing.

The PixBox dataset production: trained experienced expert(s) manually classify pixels of an image sensor into a pre-defined detailed set of classes. These are typically different cloud transparencies, cloud shadow, condition of underlying surface (“semi-transparent clouds over snow”, “clouds over bright scattering water”). An average collected dataset includes several 10-thousands of pixels because it has to be representative for all classes, and for various observation and environmental conditions, such as climate zones, sun illumination etc. Quality control of the collected pixels is important in order to detect misclassifications and systematic errors. An auto-associative neural network is trained for this purpose.

The PixBox-L8-CMIX dataset is a pixel collection containing 18,830 pixels manually collected from 11 Landsat 8 Level 1 products. The dataset is temporally well distributed. Spatially it is focused on coastal areas, mainly in Europe. Thematically it is focused on coastal zones, but still representing land and water surfaces.

PixBox-L8-CMIX dataset

The PixBox-L8-CMIX dataset consists of two two main ZIP files, one holding the pixel collection and description, and another one with all used Landsat 8 L1 data. The dataset is structured as follows:

PixBox-L8-CMIX.zip

The collected features (CSV file).

A description to all categories and classes, incl. linkage to the used Landsat 8 L1 products.

Landsat8_L1.zip

11 zipped Landsat 8 Level 1 products [1], used to produce the dataset.

Files

pixbox_landsat8_cmix_20150527.csv - This file contains all collected pixel information in CSV format. All collected classes are stored as integer values. A description of the categories and definition of the integers to class names is given in the additional description file.

pixbox_landsat8_cmix_20150527_description.txt - This file gives a clear description of the categories and classes. It can be used to convert the class ID numbers, stored in the CSV, to class strings. Additionally, it links the satellite product ID, given in the CSV, to the Sentinel-2 L1C product names.

11 Landsat 8 L1 products in ZIP format.

References

[1] Landsat 8 products courtesy of the U.S. Geological Survey
Leading data collection methods among UK consumers 2023
statista.com
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading data collection methods among UK consumers 2023 [Dataset]. https://www.statista.com/statistics/1453941/data-collection-method-consumers-uk/
Explore at:
Dataset updated
Jun 26, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 2023 - Dec 2023
Area covered
United Kingdom
Description
During a late 2023 survey among working-age consumers in the United Kingdom, **** percent of respondents stated that they preferred for their data to be collected via interactive surveys. Meanwhile, **** percent of respondents mentioned loyalty cards/programs as their favored data collection method.
e
TEMPHUM_KalternEddyCov
data.europa.eu
Updated Jul 4, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). TEMPHUM_KalternEddyCov [Dataset]. https://data.europa.eu/data/datasets/76d862a3b83b88bcc77e37f45f4d3c04ae2d799b?locale=bg
Explore at:
Dataset updated
Jul 4, 2021
Description
Offering (timeseries group) of the Sensor Observation Service - SOS - collected within the MONALISA project belonging to a network of 31 measuring stations in the Bolzano province. The offering groups timeseries from a specific sensor installed in the station site. Timeseries are identified with the observed parameter name that are listed in the keywords list. The naming convention of the title define the offering identification name: “Offering”_”Station name and altitude”. The complete list of the timeseries provided by the web service, is available in json format in the API: http://monalisasos.eurac.edu/sos/api/v1/timeseries/ . Further information can be found on the project website: http://monalisasos.eurac.edu/sos/. To browse and/or download the timeseries data a map viewer is available: http://monalisasos.eurac.edu/sos/static/client/helgoland/index.html#/map
Dataset relating a study on Geospatial Open Data usage and metadata quality
zenodo.org
data.niaid.nih.gov
Updated Jun 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alfonso Quarati; Alfonso Quarati; Monica De Martino; Monica De Martino (2023). Dataset relating a study on Geospatial Open Data usage and metadata quality [Dataset]. http://doi.org/10.5281/zenodo.4280594
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4280594
Dataset updated
Jun 19, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alfonso Quarati; Alfonso Quarati; Monica De Martino; Monica De Martino
Description
The Open Government Data portals (OGD) thanks to the presence of thousands of geo-referenced datasets, containing spatial information, are of extreme interest for any analysis or process relating to the territory. For this to happen, users must be enabled to access these datasets and reuse them. An element often considered hindering the full dissemination of OGD data is the quality of their metadata. Starting from an experimental investigation conducted on over 160,000 geospatial datasets belonging to six national and international OGD portals, this work has as its first objective to provide an overview of the usage of these portals measured in terms of datasets views and downloads. Furthermore, to assess the possible influence of the quality of the metadata on the use of geospatial datasets, an assessment of the metadata for each dataset was carried out, and the correlation between these two variables was measured. The results obtained showed a significant underutilization of geospatial datasets and a generally poor quality of their metadata. Besides, a weak correlation was found between the use and quality of the metadata, not such as to assert with certainty that the latter is a determining factor of the former.

The dataset consists of six zipped CSV files, containing the collected datasets' usage data, full metadata, and computed quality values, for about 160,000 geospatial datasets belonging to the three national and three international portals considered in the study, i.e. US (catalog.data.gov), Colombia (datos.gov.co), Ireland (data.gov.ie), HDX (data.humdata.org), EUODP (data.europa.eu), and NASA (data.nasa.gov).

Data collection occurred in the period: 2019-12-19 -- 2019-12-23.

The header for each CSV file is:

[ ,portalid,id,downloaddate,metadata,overallq,qvalues,assessdate,dviews,downloads,engine,admindomain]

where for each row (a portal's dataset) the following fields are defined as follows:

portalid: portal identifier

id: dataset identifier

downloaddate: date of data collection

metadata: the overall dataset's metadata downloaded via API from the portal according to the supporting platform schema

overallq: overall quality values computed by applying the methodology presented in [1]

qvalues: json object containing the quality values computed for the 17 metrics presented in [1]

assessdate: date of quality assessment

dviews: number of total views for the dataset

downloads: number of total downloads for the dataset (made available only by the Colombia, HDX, and NASA portals)

engine: identifier of the supporting portal platform: 1(CKAN), 2 (Socrata)

admindomain: 1 (national), 2 (international)

[1] Neumaier, S.; Umbrich, J.; Polleres, A. Automated Quality Assessment of Metadata Across Open Data Portals.J. Data and Information Quality2016,8, 2:1–2:29. doi:10.1145/2964909

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). Statistics Interface Province-Level Data Collection - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. // [Dataset]. https://store.smartdatahub.io/dataset/fi_tilastokeskus_tilastointialueet_maakunta1000k

Statistics Interface Province-Level Data Collection - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. //

Explore at:

Dataset updated

Nov 11, 2024

Description

The dataset collection in question is a compilation of related data tables sourced from the website of Tilastokeskus (Statistics Finland) in Finland. The data present in the collection is organized in a tabular format comprising of rows and columns, each holding related data. The collection includes several tables, each of which represents different years, providing a temporal view of the data. The description provided by the data source, Tilastokeskuksen palvelurajapinta (Statistics Finland's service interface), suggests that the data is likely to be statistical in nature and could be related to regional statistics, given the nature of the source. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).

Clear search

Close search

Google apps

Main menu

Statistics Interface Province-Level Data Collection - Datasets - This...

Intelligent Monitor

CRUMB: the Collected Radiogalaxies Using MiraBest dataset

Dataset of book subjects that contain Howard the Duck : the complete...

EK500 Water Column Sonar Data Collected During DE0407

Clinical Questions Collection

Volume of recyclable waste collected from municipalities Japan FY 2023, by...

Dataset of books in the Complete Guides series

Dataset of books called The Bill : the complete low-down on 20 years at Sun...

USGS Southwest Repeat Photography Collection: Kanab Creek, southern Utah and...

Dataset of books called The complete wine course

Data from: UCM Bird Collection (Arctos)

University of Georgia Collection of Arthropods

Dataset of books called The complete servant

Research Artefact: An Empirical Study of React-Library Related Issues via...

English Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

PixBox Landsat 8 pixel collection for CMIX

Leading data collection methods among UK consumers 2023

TEMPHUM_KalternEddyCov

Dataset relating a study on Geospatial Open Data usage and metadata quality

Statistics Interface Province-Level Data Collection - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. //See More Versions

Statistics Interface Province-Level Data Collection - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. //