Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
2024.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The data collection and labeling market is experiencing robust growth, fueled by the escalating demand for high-quality training data in artificial intelligence (AI) and machine learning (ML) applications. The market, estimated at $15 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 25% over the forecast period (2025-2033), reaching approximately $75 billion by 2033. This expansion is primarily driven by the increasing adoption of AI across diverse sectors, including healthcare (medical image analysis, drug discovery), automotive (autonomous driving systems), finance (fraud detection, risk assessment), and retail (personalized recommendations, inventory management). The rising complexity of AI models and the need for more diverse and nuanced datasets are significant contributing factors to this growth. Furthermore, advancements in data annotation tools and techniques, such as active learning and synthetic data generation, are streamlining the data labeling process and making it more cost-effective. However, challenges remain. Data privacy concerns and regulations like GDPR necessitate robust data security measures, adding to the cost and complexity of data collection and labeling. The shortage of skilled data annotators also hinders market growth, necessitating investments in training and upskilling programs. Despite these restraints, the market’s inherent potential, coupled with ongoing technological advancements and increased industry investments, ensures sustained expansion in the coming years. Geographic distribution shows strong concentration in North America and Europe initially, but Asia-Pacific is poised for rapid growth due to increasing AI adoption and the availability of a large workforce. This makes strategic partnerships and global expansion crucial for market players aiming for long-term success.
This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.
Overview With extensive experience in speech recognition, Nexdata has resource pool covering more than 50 countries and regions. Our linguist team works closely with clients to assist them with dictionary and text corpus construction, speech quality inspection, linguistics consulting and etc.
Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide
-Compliance: All the Machine Learning (ML) Data are collected with proper authorization -Quality: Multiple rounds of quality inspections ensures high quality data output
-Secure Implementation: NDA is signed to gurantee secure implementation and Machine Learning (ML) Data is destroyed upon delivery.
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global data collection and labeling market is experiencing robust growth, driven by the escalating demand for high-quality training data to fuel the advancements in artificial intelligence (AI) and machine learning (ML). This market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an impressive $70 billion by 2033. This significant expansion is fueled by several key factors. The increasing adoption of AI across diverse sectors, including IT, automotive, BFSI (Banking, Financial Services, and Insurance), healthcare, and retail and e-commerce, is a primary driver. Furthermore, the growing complexity of AI models necessitates larger and more diverse datasets, thereby increasing the demand for professional data labeling services. The emergence of innovative data annotation tools and techniques further contributes to market growth. However, challenges remain, including the high cost of data collection and labeling, data privacy concerns, and the need for skilled professionals capable of handling diverse data types. The market segmentation highlights the significant contributions from various sectors. The IT sector leads in adoption, followed closely by the automotive and BFSI sectors. Healthcare and retail/e-commerce are also exhibiting rapid growth due to the increasing reliance on AI-powered solutions for improved diagnostics, personalized medicine, and enhanced customer experiences. Geographically, North America currently holds a substantial market share, followed by Europe and Asia Pacific. However, the Asia Pacific region is poised for the fastest growth due to its large and rapidly developing digital economy and increasing government initiatives promoting AI adoption. Key players like Reality AI, Scale AI, and Labelbox are shaping the market landscape through continuous innovation and strategic acquisitions. The market's future trajectory will be significantly influenced by advancements in automation technologies, improvements in data annotation methodologies, and the growing awareness of the importance of high-quality data for successful AI deployments.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains key characteristics about the data described in the Data Descriptor A cone-beam X-ray computed tomography data collection designed for machine learning. Contents:
1. human readable metadata summary table in CSV format
2. machine readable metadata file in JSON formatVersioning Note:Version 2 was generated when the metadata format was updated from JSON to JSON-LD. This was an automatic process that changed only the format, not the contents, of the metadata.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global data collection and labeling market size was USD 27.1 Billion in 2023 and is likely to reach USD 133.3 Billion by 2032, expanding at a CAGR of 22.4 % during 2024–2032. The market growth is attributed to the increasing demand for high-quality labeled datasets to train artificial intelligence and machine learning algorithms across various industries.
Growing adoption of AI in e-commerce is projected to drive the market in the assessment year. E-commerce platforms rely on high-quality images to showcase products effectively and improve the online shopping experience for customers. Accurately labeled images enable better product categorization and search optimization, driving higher conversion rates and customer engagement.
Rising adoption of AI in the financial sector is a significant factor boosting the need for data collection and labeling services for tasks such as fraud detection, risk assessment, and algorithmic trading. Financial institutions leverage labeled datasets to train AI models to analyze vast amounts of transactional data, identify patterns, and detect anomalies indicative of fraudulent activity.
The use of artificial intelligence is revolutionizing the way labeled datasets are created and utilized. With the advancements in AI technologies, such as computer vision and natural language processing, the demand for accurately labeled datasets has surged across various industries.
AI algorithms are increasingly being leveraged to automate and streamline the data labeling process, reducing the manual effort required and improving efficiency. For instance,
In April 2022, Encord, a startup, introduced its beta version of CordVision, an AI-assisted labeling application that inten
Nexdata is equipped with professional recording equipment and has resources pool of 70+ countries and regions, and provide various types of speech recognition data collection services for Machine Learning (ML) Data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains samples 1 - 8 from the data collection described in
Henri Der Sarkissian, Felix Lucka, Maureen van Eijnatten, Giulia Colacicco, Sophia Bethany Coban, Kees Joost Batenburg, "A Cone-Beam X-Ray CT Data Collection Designed for Machine Learning", Sci Data 6, 215 (2019). https://doi.org/10.1038/s41597-019-0235-y or arXiv:1905.04787 (2019)
Abstract:
"Unlike previous works, this open data collection consists of X-ray cone-beam (CB) computed tomography (CT) datasets specifically designed for machine learning applications and high cone-angle artefact reduction: Forty-two walnuts were scanned with a laboratory X-ray setup to provide not only data from a single object but from a class of objects with natural variability. For each walnut, CB projections on three different orbits were acquired to provide CB data with different cone angles as well as being able to compute artefact-free, high-quality ground truth images from the combined data that can be used for supervised learning. We provide the complete image reconstruction pipeline: raw projection data, a description of the scanning geometry, pre-processing and reconstruction scripts using open software, and the reconstructed volumes. Due to this, the dataset can not only be used for high cone-angle artefact reduction but also for algorithm development and evaluation for other tasks, such as image reconstruction from limited or sparse-angle (low-dose) scanning, super resolution, or segmentation."
The scans are performed using a custom-built, highly flexible X-ray CT scanner, the FleX-ray scanner, developed by XRE nvand located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. The general purpose of the FleX-ray Lab is to conduct proof of concept experiments directly accessible to researchers in the field of mathematics and computer science. The scanner consists of a cone-beam microfocus X-ray point source that projects polychromatic X-rays onto a 1536-by-1944 pixels, 14-bit flat panel detector (Dexella 1512NDT) and a rotation stage in-between, upon which a sample is mounted. All three components are mounted on translation stages which allow them to move independently from one another.
Please refer to the paper for all further technical details.
The complete data set can be found via the following links: 1-8, 9-16, 17-24, 25-32, 33-37, 38-42
The corresponding Python scripts for loading, pre-processing and reconstructing the projection data in the way described in the paper can be found on github
For more information or guidance in using these dataset, please get in touch with
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global data collection software market size is anticipated to significantly expand from USD 1.8 billion in 2023 to USD 4.2 billion by 2032, exhibiting a CAGR of 10.1% during the forecast period. This remarkable growth is fueled by the increasing demand for data-driven decision-making solutions across various industries. As organizations continue to recognize the strategic value of harnessing vast amounts of data, the need for sophisticated data collection tools becomes more pressing. The growing integration of artificial intelligence and machine learning within software solutions is also a critical factor propelling the market forward, enabling more accurate and real-time data insights.
One major growth factor for the data collection software market is the rising importance of real-time analytics. In an era where time-sensitive decisions can define business success, the capability to gather and analyze data in real-time is invaluable. This trend is particularly evident in sectors like healthcare, where prompt data collection can impact patient care, and in retail, where immediate insights into consumer behavior can enhance customer experience and drive sales. Additionally, the proliferation of the Internet of Things (IoT) has further accelerated the demand for data collection software, as connected devices produce a continuous stream of data that organizations must manage efficiently.
The digital transformation sweeping across industries is another crucial driver of market growth. As businesses endeavor to modernize their operations and customer interactions, there is a heightened demand for robust data collection solutions that can seamlessly integrate with existing systems and infrastructure. Companies are increasingly investing in cloud-based data collection software to improve scalability, flexibility, and accessibility. This shift towards cloud solutions is not only enabling organizations to reduce IT costs but also to enhance collaboration by making data more readily available across different departments and geographies.
The intensified focus on regulatory compliance and data protection is also shaping the data collection software market. With the introduction of stringent data privacy regulations such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States, organizations are compelled to adopt data collection practices that ensure compliance and protect customer information. This necessitates the use of sophisticated software capable of managing data responsibly and transparently, thereby fueling market growth. Moreover, the increasing awareness among businesses about the potential financial and reputational risks associated with data breaches is prompting the adoption of secure data collection solutions.
The data collection software market can be segmented into software and services, each playing a pivotal role in the ecosystem. The software component remains the bedrock of this market, providing the essential tools and platforms that enable organizations to collect, store, and analyze data effectively. The software solutions offered vary in complexity and functionality, catering to different organizational needs ranging from basic data entry applications to advanced analytics platforms that incorporate AI and machine learning capabilities. The demand for such sophisticated solutions is on the rise as organizations seek to harness data not just for operational purposes but for strategic insights as well.
The services segment encompasses various offerings that support the deployment and optimization of data collection software. These services include consulting, implementation, training, and maintenance, all crucial for ensuring that the software operates efficiently and meets the evolving needs of the user. As the market evolves, there is an increasing emphasis on offering customized services that address specific industry requirements, thereby enhancing the overall value proposition for clients. The services segment is expected to grow steadily as businesses continue to seek external expertise to complement their internal capabilities, particularly in areas such as data analytics and cybersecurity.
Integration services have become particularly important as organizations strive to create seamless workflows that incorporate new data collection solutions with existing IT infrastructure. This need for integration is driven by the growing complexity of enterprise IT environments, where disparate systems and applications must wo
Imagery and Footage Data Collection | Annotation & Labelling services for Artificial Intelligence, Machine Learning and Computer Vision projects at any scale.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
IntelligentMonitor: Empowering DevOps Environments With Advanced Monitoring and Observability aims to improve monitoring and observability in complex, distributed DevOps environments by leveraging machine learning and data analytics. This repository contains a sample implementation of the IntelligentMonitor system proposed in the research paper, presented and published as part of the 11th International Conference on Information Technology (ICIT 2023).
If you use this dataset and code or any herein modified part of it in any publication, please cite these papers:
P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.
For any questions and research queries - please reach out via Email.
Abstract - In the dynamic field of software development, DevOps has become a critical tool for enhancing collaboration, streamlining processes, and accelerating delivery. However, monitoring and observability within DevOps environments pose significant challenges, often leading to delayed issue detection, inefficient troubleshooting, and compromised service quality. These issues stem from DevOps environments' complex and ever-changing nature, where traditional monitoring tools often fall short, creating blind spots that can conceal performance issues or system failures. This research addresses these challenges by proposing an innovative approach to improve monitoring and observability in DevOps environments. Our solution, Intelligent-Monitor, leverages realtime data collection, intelligent analytics, and automated anomaly detection powered by advanced technologies such as machine learning and artificial intelligence. The experimental results demonstrate that IntelligentMonitor effectively manages data overload, reduces alert fatigue, and improves system visibility, thereby enhancing performance and reliability. For instance, the average CPU usage across all components showed a decrease of 9.10%, indicating improved CPU efficiency. Similarly, memory utilization and network traffic showed an average increase of 7.33% and 0.49%, respectively, suggesting more efficient use of resources. By providing deep insights into system performance and facilitating rapid issue resolution, this research contributes to the DevOps community by offering a comprehensive solution to one of its most pressing challenges. This fosters more efficient, reliable, and resilient software development and delivery processes.
Components The key components that would need to be implemented are:
Implementation Details The core of the implementation would involve the following: - Setting up the data collection pipelines. - Building and training anomaly detection ML models on historical data. - Developing a real-time data processing pipeline. - Creating an alerting framework that ties into the ML models. - Building visualizations and dashboards.
The code would need to handle scaled-out, distributed execution for production environments.
Proper code documentation, logging, and testing would be added throughout the implementation.
Usage Examples Usage examples could include:
References The implementation would follow the details provided in the original research paper: P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.
Any additional external libraries or sources used would be properly cited.
Tags - DevOps, Software Development, Collaboration, Streamlini...
Data DescriptionThe DIPSER dataset is designed to assess student attention and emotion in in-person classroom settings, consisting of RGB camera data, smartwatch sensor data, and labeled attention and emotion metrics. It includes multiple camera angles per student to capture posture and facial expressions, complemented by smartwatch data for inertial and biometric metrics. Attention and emotion labels are derived from self-reports and expert evaluations. The dataset includes diverse demographic groups, with data collected in real-world classroom environments, facilitating the training of machine learning models for predicting attention and correlating it with emotional states.Data Collection and Generation ProceduresThe dataset was collected in a natural classroom environment at the University of Alicante, Spain. The recording setup consisted of six general cameras positioned to capture the overall classroom context and individual cameras placed at each student’s desk. Additionally, smartwatches were used to collect biometric data, such as heart rate, accelerometer, and gyroscope readings.Experimental SessionsNine distinct educational activities were designed to ensure a comprehensive range of engagement scenarios:News Reading – Students read projected or device-displayed news.Brainstorming Session – Idea generation for problem-solving.Lecture – Passive listening to an instructor-led session.Information Organization – Synthesizing information from different sources.Lecture Test – Assessment of lecture content via mobile devices.Individual Presentations – Students present their projects.Knowledge Test – Conducted using Kahoot.Robotics Experimentation – Hands-on session with robotics.MTINY Activity Design – Development of educational activities with computational thinking.Technical SpecificationsRGB Cameras: Individual cameras recorded at 640×480 pixels, while context cameras captured at 1280×720 pixels.Frame Rate: 9-10 FPS depending on the setup.Smartwatch Sensors: Collected heart rate, accelerometer, gyroscope, rotation vector, and light sensor data at a frequency of 1–100 Hz.Data Organization and FormatsThe dataset follows a structured directory format:/groupX/experimentY/subjectZ.zip Each subject-specific folder contains:images/ (individual facial images)watch_sensors/ (sensor readings in JSON format)labels/ (engagement & emotion annotations)metadata/ (subject demographics & session details)Annotations and LabelingEach data entry includes engagement levels (1-5) and emotional states (9 categories) based on both self-reported labels and evaluations by four independent experts. A custom annotation tool was developed to ensure consistency across evaluations.Missing Data and Data QualitySynchronization: A centralized server ensured time alignment across devices. Brightness changes were used to verify synchronization.Completeness: No major missing data, except for occasional random frame drops due to embedded device performance.Data Consistency: Uniform collection methodology across sessions, ensuring high reliability.Data Processing MethodsTo enhance usability, the dataset includes preprocessed bounding boxes for face, body, and hands, along with gaze estimation and head pose annotations. These were generated using YOLO, MediaPipe, and DeepFace.File Formats and AccessibilityImages: Stored in standard JPEG format.Sensor Data: Provided as structured JSON files.Labels: Available as CSV files with timestamps.The dataset is publicly available under the CC-BY license and can be accessed along with the necessary processing scripts via the DIPSER GitHub repository.Potential Errors and LimitationsDue to camera angles, some student movements may be out of frame in collaborative sessions.Lighting conditions vary slightly across experiments.Sensor latency variations are minimal but exist due to embedded device constraints.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025dipserdatasetinpersonstudent1, title={DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Carolina Lorenzo Álvarez and Jorge Fernandez-Herrero and Diego Viejo and Rosabel Roig-Vila and Miguel Cazorla}, year={2025}, eprint={2502.20209}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20209}, } Usage and ReproducibilityResearchers can utilize standard tools like OpenCV, TensorFlow, and PyTorch for analysis. The dataset supports research in machine learning, affective computing, and education analytics, offering a unique resource for engagement and attention studies in real-world classroom environments.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Synthetic Data Generation Market size was valued at USD 0.4 Billion in 2024 and is projected to reach USD 9.3 Billion by 2032, growing at a CAGR of 46.5 % from 2026 to 2032.
The Synthetic Data Generation Market is driven by the rising demand for AI and machine learning, where high-quality, privacy-compliant data is crucial for model training. Businesses seek synthetic data to overcome real-data limitations, ensuring security, diversity, and scalability without regulatory concerns. Industries like healthcare, finance, and autonomous vehicles increasingly adopt synthetic data to enhance AI accuracy while complying with stringent privacy laws.
Additionally, cost efficiency and faster data availability fuel market growth, reducing dependency on expensive, time-consuming real-world data collection. Advancements in generative AI, deep learning, and simulation technologies further accelerate adoption, enabling realistic synthetic datasets for robust AI model development.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The U.S. Data Collection And Labeling Market size was valued at USD 855.0 million in 2023 and is projected to reach USD 3964.16 million by 2032, exhibiting a CAGR of 24.5 % during the forecasts period. The US Data Collection and Labeling Market implies the process of gathering and labeling data for the creation of machine learning, artificial intelligence, as well as other data-related applications. The market helps various sectors including retail health care, automotive, and finance through supplying labeled data which is critical in training and improving models used in AI and overall decision-making. Some of the primary applications are related to image and speech recognition, self-driving cars and many others related to Predictive analysis. New directions promote the development of a greater degree of automatization of processes, the use of highly specialized annotation tools, and the need for further development of specialized data labeling services. The market is also experiencing incorporation of artificial intelligence for the automation of several data labeling tasks. Recent developments include: In July 2022, IBM announced the acquisition of Databand.ai to augment its software portfolio across AI, data and automation. For the record, Databand.ai was IBM's fifth acquisition in 2022, signifying the latter’s commitment to hybrid cloud and AI skills and capabilities. , In June 2022, Oracle completed the acquisition of Cerner as the Austin-based company gears up to ramp up its cloud business in the hospital and health system landscape. .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset provides a collection of behaviour biometrics data (commonly known as Keyboard, Mouse and Touchscreen (KMT) dynamics). The data was collected for use in a FinTech research project undertaken by academics and researchers at Computer Science Department, Edge Hill University, United Kingdom. The project called CyberSIgnature uses KMT dynamics data to distinguish between legitimate card owners and fraudsters. An application was developed that has a graphical user interface (GUI) similar to a standard online card payment form including fields for card type, name, card number, card verification code (cvc) and expiry date. Then, user KMT dynamics were captured while they entered fictitious card information on the GUI application.
The dataset consists of 1,760 KMT dynamic instances collected over 88 user sessions on the GUI application. Each user session involves 20 iterations of data entry in which the user is assigned a fictitious card information (drawn at random from a pool) to enter 10 times and subsequently presented with 10 additional card information, each to be entered once. The 10 additional card information is drawn from a pool that has been assigned or to be assigned to other users. A KMT data instance is collected during each data entry iteration. Thus, a total of 20 KMT data instances (i.e., 10 legitimate and 10 illegitimate) was collected during each user entry session on the GUI application.
The raw dataset is stored in .json format within 88 separate files. The root folder named behaviour_biometrics_dataset' consists of two sub-folders
raw_kmt_dataset' and `feature_kmt_dataset'; and a Jupyter notebook file (kmt_feature_classificatio.ipynb). Their folder and file content is described below:
-- raw_kmt_dataset': this folder contains 88 files, each named
raw_kmt_user_n.json', where n is a number from 0001 to 0088. Each file contains 20 instances of KMT dynamics data corresponding to a given fictitious card; and the data instances are equally split between legitimate (n = 10) and illegitimate (n = 10) classes. The legitimate class corresponds to KMT dynamics captured from the user that is assigned to the card detail; while the illegitimate class corresponds to KMT dynamics data collected from other users entering the same card detail.
-- feature_kmt_dataset': this folder contains two sub-folders, namely:
feature_kmt_json' and feature_kmt_xlsx'. Each folder contains 88 files (of the relevant format: .json or .xlsx) , each named
feature_kmt_user_n', where n is a number from 0001 to 0088. Each file contains 20 instances of features extracted from the corresponding `raw_kmt_user_n' file including the class labels (legitimate = 1 or illegitimate = 0).
-- `kmt_feature_classification.ipynb': this file contains python code necessary to generate features from the raw KMT files and apply simple machine learning classification task to generate results. The code is designed to run with minimal effort from the user.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a published version of the WGMLEARN literature collection currently managed as a Zotero group library. That library is managed and curated by members of WGMLEARN and aims to be a collection of all the published works at the intersection of machine learning and marine science.The Zotero library is continuously updated, but a static instance of all its contents from May 2023 can be downloaded here for use in reference management software.Custom keywords are included with each item; these allow for classification by data type (data:*), machine learning task (task:*), and algorithm (method:*). Other keywords are included for information but they are not guaranteed to be applied consistently.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset is a collection of articles indexed in the Web of Science database, used for a bibliometric article on the topic Data Collection and Analysis Systems Using Machine Learning in Internet of Things. The main idea is to identify articles related to the theme through bibliometric techniques and perform analyses using tools such as VOSviewer and CiteNetExplorer to support the state of the art.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
2024.