Facebook
TwitterAI Training Data | Annotated Checkout Flows for Retail, Restaurant, and Marketplace Websites Overview
Unlock the next generation of agentic commerce and automated shopping experiences with this comprehensive dataset of meticulously annotated checkout flows, sourced directly from leading retail, restaurant, and marketplace websites. Designed for developers, researchers, and AI labs building large language models (LLMs) and agentic systems capable of online purchasing, this dataset captures the real-world complexity of digital transactions—from cart initiation to final payment.
Key Features
Breadth of Coverage: Over 10,000 unique checkout journeys across hundreds of top e-commerce, food delivery, and service platforms, including but not limited to Walmart, Target, Kroger, Whole Foods, Uber Eats, Instacart, Shopify-powered sites, and more.
Actionable Annotation: Every flow is broken down into granular, step-by-step actions, complete with timestamped events, UI context, form field details, validation logic, and response feedback. Each step includes:
Page state (URL, DOM snapshot, and metadata)
User actions (clicks, taps, text input, dropdown selection, checkbox/radio interactions)
System responses (AJAX calls, error/success messages, cart/price updates)
Authentication and account linking steps where applicable
Payment entry (card, wallet, alternative methods)
Order review and confirmation
Multi-Vertical, Real-World Data: Flows sourced from a wide variety of verticals and real consumer environments, not just demo stores or test accounts. Includes complex cases such as multi-item carts, promo codes, loyalty integration, and split payments.
Structured for Machine Learning: Delivered in standard formats (JSONL, CSV, or your preferred schema), with every event mapped to action types, page features, and expected outcomes. Optional HAR files and raw network request logs provide an extra layer of technical fidelity for action modeling and RLHF pipelines.
Rich Context for LLMs and Agents: Every annotation includes both human-readable and model-consumable descriptions:
“What the user did” (natural language)
“What the system did in response”
“What a successful action should look like”
Error/edge case coverage (invalid forms, OOS, address/payment errors)
Privacy-Safe & Compliant: All flows are depersonalized and scrubbed of PII. Sensitive fields (like credit card numbers, user addresses, and login credentials) are replaced with realistic but synthetic data, ensuring compliance with privacy regulations.
Each flow tracks the user journey from cart to payment to confirmation, including:
Adding/removing items
Applying coupons or promo codes
Selecting shipping/delivery options
Account creation, login, or guest checkout
Inputting payment details (card, wallet, Buy Now Pay Later)
Handling validation errors or OOS scenarios
Order review and final placement
Confirmation page capture (including order summary details)
Why This Dataset?
Building LLMs, agentic shopping bots, or e-commerce automation tools demands more than just page screenshots or API logs. You need deeply contextualized, action-oriented data that reflects how real users interact with the complex, ever-changing UIs of digital commerce. Our dataset uniquely captures:
The full intent-action-outcome loop
Dynamic UI changes, modals, validation, and error handling
Nuances of cart modification, bundle pricing, delivery constraints, and multi-vendor checkouts
Mobile vs. desktop variations
Diverse merchant tech stacks (custom, Shopify, Magento, BigCommerce, native apps, etc.)
Use Cases
LLM Fine-Tuning: Teach models to reason through step-by-step transaction flows, infer next-best-actions, and generate robust, context-sensitive prompts for real-world ordering.
Agentic Shopping Bots: Train agents to navigate web/mobile checkouts autonomously, handle edge cases, and complete real purchases on behalf of users.
Action Model & RLHF Training: Provide reinforcement learning pipelines with ground truth “what happens if I do X?” data across hundreds of real merchants.
UI/UX Research & Synthetic User Studies: Identify friction points, bottlenecks, and drop-offs in modern checkout design by replaying flows and testing interventions.
Automated QA & Regression Testing: Use realistic flows as test cases for new features or third-party integrations.
What’s Included
10,000+ annotated checkout flows (retail, restaurant, marketplace)
Step-by-step event logs with metadata, DOM, and network context
Natural language explanations for each step and transition
All flows are depersonalized and privacy-compliant
Example scripts for ingesting, parsing, and analyzing the dataset
Flexible licensing for research or commercial use
Sample Categories Covered
Grocery delivery (Instacart, Walmart, Kroger, Target, etc.)
Restaurant takeout/delivery (Ub...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Summary
APIGen-MT is an automated agentic data generation pipeline designed to synthesize verifiable, high-quality, realistic datasets for agentic applications This dataset was released as part of APIGen-MT: Agentic PIpeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay Code: https://github.com/apigen-mt/apigen-mt.github.io The repo contains 5000 multi-turn trajectories collected by APIGen-MT This dataset is a subset of the data used to train the xLAM-2 model… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/APIGen-MT-5k.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic Data Generation Demo — UK Retail Dataset
Welcome to this synthetic data generation demo repository by Syncora.ai. This project showcases how to generate synthetic data using real-world tabular structures, demonstrated on a UK retail dataset with columns such as:
Country
CustomerID
UnitPrice
InvoiceDate
Quantity
StockCode
This dataset is designed for dataset for LLM training and AI development, enabling developers to work with privacy-safe, high-quality… See the full description on the dataset page: https://huggingface.co/datasets/syncora/uk_retail_store_synthetic_dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Traditional knowledge graphs of water conservancy project risks have supported risk decision-making. However, they are constrained by limited data modalities and low accuracy in information extraction. A multimodal water conservancy project risk knowledge graph is proposed in this study, along with a synergistic strategy involving multimodal large language models Risk decision-making generation is facilitated through a multi-agent agentic retrieval-augmented generation framework. To enhance visual recognition, a DenseNet-based image classification model is improved by incorporating single-head self-attention and coordinate attention mechanisms. For textual data, risk entities such as locations, components, and events are extracted using a BERT-BiLSTM-CRF architecture. These extracted entities serve as the foundation for constructing the multimodal knowledge graph. To support generation, a multi-agent agentic retrieval-augmented generation mechanism is introduced. This mechanism enhances the reliability and interpretability of risk decision-making outputs. In experiments, the enhanced DenseNet model outperforms the original baseline in both precision and recall for image recognition tasks. In risk decision-making tasks, the proposed approach—combining a multimodal knowledge graph with a multi-agent agentic retrieval-augmented generation method—achieves strong performance on BERTScore and ROUGE-L metrics. This work presents a novel perspective for leveraging multimodal knowledge graphs in water conservancy project risk management.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Traditional knowledge graphs of water conservancy project risks have supported risk decision-making. However, they are constrained by limited data modalities and low accuracy in information extraction. A multimodal water conservancy project risk knowledge graph is proposed in this study, along with a synergistic strategy involving multimodal large language models Risk decision-making generation is facilitated through a multi-agent agentic retrieval-augmented generation framework. To enhance visual recognition, a DenseNet-based image classification model is improved by incorporating single-head self-attention and coordinate attention mechanisms. For textual data, risk entities such as locations, components, and events are extracted using a BERT-BiLSTM-CRF architecture. These extracted entities serve as the foundation for constructing the multimodal knowledge graph. To support generation, a multi-agent agentic retrieval-augmented generation mechanism is introduced. This mechanism enhances the reliability and interpretability of risk decision-making outputs. In experiments, the enhanced DenseNet model outperforms the original baseline in both precision and recall for image recognition tasks. In risk decision-making tasks, the proposed approach—combining a multimodal knowledge graph with a multi-agent agentic retrieval-augmented generation method—achieves strong performance on BERTScore and ROUGE-L metrics. This work presents a novel perspective for leveraging multimodal knowledge graphs in water conservancy project risk management.
Facebook
Twitterhttps://www.intelevoresearch.com/privacy-policyhttps://www.intelevoresearch.com/privacy-policy
Middle East & Africa Generative AI in Testing market is set to grow from USD 221.08M in 2024 to USD 884.75M by 2034, at a CAGR of 15.35%. Explore trends, drivers, growth.
Facebook
Twitterhttps://www.intelevoresearch.com/privacy-policyhttps://www.intelevoresearch.com/privacy-policy
Europe Generative AI in Testing market is set to rise from USD 0.21B in 2024 to USD 3.75B by 2034, growing at a CAGR of 34.21%. Explore drivers, trends and opportunities.
Facebook
Twitter💬 Customer Support Conversation Dataset — Powered by Syncora.ai
A free synthetic dataset for chatbot training, LLM fine-tuning, and synthetic data generation research.Created using Syncora.ai’s privacy-safe synthetic data engine, this dataset is ideal for developing, testing, and benchmarking AI customer support systems. It serves as a dataset for chatbot training and a dataset for LLM training, offering rich, structured conversation data for real-world simulation.
🌟… See the full description on the dataset page: https://huggingface.co/datasets/syncora/customer_support_conversations_dataset.
Facebook
Twitter🏃 Synthetic Wearable & Activity Dataset — Powered by Syncora.ai
Free dataset for health analytics, activity recognition, synthetic data generation, and dataset for LLM training.
🌟 About This Dataset
This dataset contains synthetic wearable fitness records, modeled on signals from devices such as the Apple Watch. All entries are fully synthetic, generated with Syncora.ai’s synthetic data engine, ensuring privacy-safe and bias-aware data.
The dataset provides rich… See the full description on the dataset page: https://huggingface.co/datasets/syncora/fitness-tracker-dataset.
Facebook
Twitterhttps://www.intelevoresearch.com/privacy-policyhttps://www.intelevoresearch.com/privacy-policy
Global Generative AI in Testing market is set to grow from USD 0.71B in 2024 to USD 14.15B by 2034,at a CAGR of 34.2% (2025–2034). Explore trends, opportunities and drivers.
Facebook
Twitter🧠 Mental Health Posting Dataset — Synthetic Dataset for LLM & Chatbot Training
Free dataset for mental health research, LLM training, and chatbot development, generated using synthetic data generation techniques to ensure privacy and high fidelity.
🌟 About This Dataset
This dataset contains synthetic mental health survey responses across multiple demographics and occupations. It includes participant-reported stress levels, coping mechanisms, mood swings, and social… See the full description on the dataset page: https://huggingface.co/datasets/syncora/mental_health_survey_dataset.
Facebook
Twitterhttps://www.intelevoresearch.com/privacy-policyhttps://www.intelevoresearch.com/privacy-policy
North America Generative AI in Testing market is set to grow from USD 0.31B in 2024 to USD 5.8B by 2034, at a CAGR of 33.91%. Explore trends, drivers, and opportunities.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
The ai procurement intelligence market size is forecast to increase by USD 14.5 billion, at a CAGR of 42.9% between 2024 and 2029.
Enterprises are increasingly adopting AI procurement intelligence to enhance operational efficiency and achieve significant cost savings in response to persistent economic pressures. This drive for strategic cost management is met by the proliferation of generative AI and hyper-automation, which are being integrated into advanced procurement software. These technologies are enabling a shift toward predictive sourcing functions, allowing teams to forecast market conditions and automate complex decision-making processes. By leveraging natural language prompts and cognitive capabilities, these tools make sophisticated data analysis more accessible, empowering procurement professionals to focus on higher-value activities like negotiation and strategic supplier relationship management. The focus is on creating autonomous and strategic sourcing capabilities through industrial ai software.However, realizing the full potential of these advanced systems is often constrained by foundational issues related to data integrity and accessibility. Many organizations grapple with a fragmented data landscape, where procurement information is trapped in disparate silos with inconsistent taxonomies, making the creation of a unified data view a significant hurdle. Without meticulous data cleansing and normalization, the insights generated by AI algorithms can be skewed or misleading, which erodes user trust and undermines the business case for the technology. This highlights the importance of robust AI governance tools to manage data quality, security, and integration effectively within the framework of agentic AI for data engineering.
What will be the Size of the AI Procurement Intelligence Market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019 - 2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market is defined by a strategic shift toward proactive risk mitigation and enhanced supply chain resilience. Organizations are leveraging predictive analytics and real-time monitoring to anticipate disruptions from geopolitical or climate-related events. This move from a reactive to a proactive stance is enabled by AI-powered platforms that provide deep visibility into multi-tier supplier networks. The integration of predictive ai in supply chain systems is becoming standard practice for ensuring business continuity and managing complex global trade dynamics. This focus on foresight and preparedness underscores a fundamental change in procurement strategy.Operational efficiency is being transformed through procurement workflow automation and the adoption of hyper-automation. These technologies are streamlining routine tasks like invoice processing and purchase order generation, freeing up procurement professionals for more strategic activities. The use of generative AI is also changing user interaction via natural language prompts, making complex data analysis more accessible. This focus on intelligent automation and ai in project management helps organizations reduce sourcing cycle times and improve overall productivity.Supplier relationship management is evolving with the use of sophisticated AI tools for performance evaluation and strategic decision-making. AI-powered platforms assist in supplier discovery and vetting, ensuring that new partners meet rigorous standards for quality and compliance. These systems analyze supplier performance metrics to inform consolidation strategies and negotiation tactics. The ongoing development of ai for sales, from a procurement perspective, allows for more dynamic and data-driven interactions, fostering a collaborative and resilient supplier ecosystem.
How is this AI Procurement Intelligence Industry segmented?
The ai procurement intelligence industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in "USD million" for the period 2025-2029, as well as historical data from 2019 - 2023 for the following segments. ComponentSoftwareServicesDeploymentCloud-basedOn-premisesEnd-userLarge enterprisesSMEsGovernment and public sectorGeographyNorth AmericaUSCanadaMexicoEuropeGermanyUKFranceThe NetherlandsItalySpainAPACChinaJapanIndiaAustraliaSouth KoreaIndonesiaSouth AmericaBrazilArgentinaColombiaMiddle East and AfricaUAESouth AfricaTurkeyRest of World (ROW)
By Component Insights
The software segment is estimated to witness significant growth during the forecast period.The software segment forms the core of the market, comprising digital platforms and applications that enable data-driven procurement. These solutions, predominantly delivered via a Software-as-a-Service model, provide func
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic AI Developer Productivity Dataset — Behavioral + Cognitive Simulation
A synthetic data generation resource for modeling behavioral and cognitive dynamics in developers.
📘 About This Dataset
This dataset simulates productivity data from AI-assisted software developers. It blends behavioral signals, physiological inputs, and productivity metrics to explore the nuanced relationships between deep work, distractions, caffeine, AI usage, and cognitive strain.… See the full description on the dataset page: https://huggingface.co/datasets/syncora/developer-productivity-simulated-behavioral-data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Risk Information Query and Decision Generation Workflow.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
Authors: Shaolei Zhang, Ju Fan*, Meihao Fan, Guoliang Li, Xiaoyong Du
DeepAnalyze is the first agentic LLM for autonomous data science. It can autonomously complete a wide range of data-centric tasks without human intervention, supporting: 🛠 Entire data science pipeline: Automatically perform any data science tasks such as data preparation, analysis, modeling, visualization, and report generation. 🔍… See the full description on the dataset page: https://huggingface.co/datasets/RUC-DataLab/DataScience-Instruct-500K.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Effective animal training depends on well-structured training plans that ensure consistent progress and measurable outcomes. However, the creation of such plans is often time-intensive, repetitive, and detracts from hands-on training. Recent advancements in generative AI powered by large language models (LLMs) provide potential solutions but frequently fail to produce actionable, individualized plans tailored to specific contexts. This limitation is particularly significant given the diverse tasks performed by dogs–ranging from working roles in military and police operations to competitive sports–and the varying training philosophies among practitioners. To address these challenges, a modular agentic workflow framework is proposed, leveraging LLMs while mitigating their shortcomings. By decomposing the training plan generation process into specialized building blocks–autonomous agents that handle subtasks such as structuring progressions, ensuring welfare compliance, and adhering to team-specific standard operating procedures (SOPs)—this approach facilitates the creation of specific, actionable plans. The modular design further allows workflows to be tailored to the unique requirements of individual tasks and philosophies. As a proof of concept, a complete training plan generation workflow is presented, integrating these agents into a cohesive system. This framework prioritizes flexibility and adaptability, empowering trainers to create customized solutions while leveraging generative AI's capabilities. In summary, agentic workflows bridge the gap between cutting-edge technology and the practical, diverse needs of the animal training community. As such, they could form a crucial foundation for advancing computer-assisted animal training methodologies.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This repository contains the data presented in Benchmarking Agentic Workflow Generation. Code: https://github.com/zjunlp/WorfBench
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Citation
If you use this dataset, please cite: @misc{yu2025quasarquantumassemblycode, title={QUASAR: Quantum Assembly Code Generation Using Tool-Augmented LLMs via Agentic RL}, author={Cong Yu and Valter Uotila and Shilong Deng and Qingyuan Wu and Tuo Shi and Songlin Jiang and Lei You and Bo Zhao}, year={2025}, eprint={2510.00967}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2510.00967}, }
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAI Training Data | Annotated Checkout Flows for Retail, Restaurant, and Marketplace Websites Overview
Unlock the next generation of agentic commerce and automated shopping experiences with this comprehensive dataset of meticulously annotated checkout flows, sourced directly from leading retail, restaurant, and marketplace websites. Designed for developers, researchers, and AI labs building large language models (LLMs) and agentic systems capable of online purchasing, this dataset captures the real-world complexity of digital transactions—from cart initiation to final payment.
Key Features
Breadth of Coverage: Over 10,000 unique checkout journeys across hundreds of top e-commerce, food delivery, and service platforms, including but not limited to Walmart, Target, Kroger, Whole Foods, Uber Eats, Instacart, Shopify-powered sites, and more.
Actionable Annotation: Every flow is broken down into granular, step-by-step actions, complete with timestamped events, UI context, form field details, validation logic, and response feedback. Each step includes:
Page state (URL, DOM snapshot, and metadata)
User actions (clicks, taps, text input, dropdown selection, checkbox/radio interactions)
System responses (AJAX calls, error/success messages, cart/price updates)
Authentication and account linking steps where applicable
Payment entry (card, wallet, alternative methods)
Order review and confirmation
Multi-Vertical, Real-World Data: Flows sourced from a wide variety of verticals and real consumer environments, not just demo stores or test accounts. Includes complex cases such as multi-item carts, promo codes, loyalty integration, and split payments.
Structured for Machine Learning: Delivered in standard formats (JSONL, CSV, or your preferred schema), with every event mapped to action types, page features, and expected outcomes. Optional HAR files and raw network request logs provide an extra layer of technical fidelity for action modeling and RLHF pipelines.
Rich Context for LLMs and Agents: Every annotation includes both human-readable and model-consumable descriptions:
“What the user did” (natural language)
“What the system did in response”
“What a successful action should look like”
Error/edge case coverage (invalid forms, OOS, address/payment errors)
Privacy-Safe & Compliant: All flows are depersonalized and scrubbed of PII. Sensitive fields (like credit card numbers, user addresses, and login credentials) are replaced with realistic but synthetic data, ensuring compliance with privacy regulations.
Each flow tracks the user journey from cart to payment to confirmation, including:
Adding/removing items
Applying coupons or promo codes
Selecting shipping/delivery options
Account creation, login, or guest checkout
Inputting payment details (card, wallet, Buy Now Pay Later)
Handling validation errors or OOS scenarios
Order review and final placement
Confirmation page capture (including order summary details)
Why This Dataset?
Building LLMs, agentic shopping bots, or e-commerce automation tools demands more than just page screenshots or API logs. You need deeply contextualized, action-oriented data that reflects how real users interact with the complex, ever-changing UIs of digital commerce. Our dataset uniquely captures:
The full intent-action-outcome loop
Dynamic UI changes, modals, validation, and error handling
Nuances of cart modification, bundle pricing, delivery constraints, and multi-vendor checkouts
Mobile vs. desktop variations
Diverse merchant tech stacks (custom, Shopify, Magento, BigCommerce, native apps, etc.)
Use Cases
LLM Fine-Tuning: Teach models to reason through step-by-step transaction flows, infer next-best-actions, and generate robust, context-sensitive prompts for real-world ordering.
Agentic Shopping Bots: Train agents to navigate web/mobile checkouts autonomously, handle edge cases, and complete real purchases on behalf of users.
Action Model & RLHF Training: Provide reinforcement learning pipelines with ground truth “what happens if I do X?” data across hundreds of real merchants.
UI/UX Research & Synthetic User Studies: Identify friction points, bottlenecks, and drop-offs in modern checkout design by replaying flows and testing interventions.
Automated QA & Regression Testing: Use realistic flows as test cases for new features or third-party integrations.
What’s Included
10,000+ annotated checkout flows (retail, restaurant, marketplace)
Step-by-step event logs with metadata, DOM, and network context
Natural language explanations for each step and transition
All flows are depersonalized and privacy-compliant
Example scripts for ingesting, parsing, and analyzing the dataset
Flexible licensing for research or commercial use
Sample Categories Covered
Grocery delivery (Instacart, Walmart, Kroger, Target, etc.)
Restaurant takeout/delivery (Ub...