Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata: This data product is a unique offering in the realm of AI/ML training data. What sets it apart is the sheer volume and diversity of the dataset, which includes 4.5 million files spanning across 20 different categories. These categories range from Animals/Wildlife and The Arts to Technology and Transportation, providing a rich and varied dataset for AI/ML applications.
The data is sourced from Wirestock's platform, where creators upload and sell their photos, videos, and AI art online. This means that the data is not only vast but also constantly updated, ensuring a fresh and relevant dataset for your AI/ML needs. The data is collected in a GDPR-compliant manner, ensuring the privacy and rights of the creators are respected.
The primary use-cases for this data product are numerous. It is ideal for training machine learning models for image recognition, improving computer vision algorithms, and enhancing AI applications in various industries such as retail, healthcare, and transportation. The diversity of the dataset also means it can be used for more niche applications, such as training AI to recognize specific objects or scenes.
This data product fits into Wirestock's broader data offering as a key resource for AI/ML training. Wirestock is a platform for creators to sell their work, and this dataset is a collection of that work. It represents the breadth and depth of content available on Wirestock, making it a valuable resource for any company working with AI/ML.
The core benefits of this dataset are its volume, diversity, and quality. With 4.5 million files, it provides a vast resource for AI training. The diversity of the dataset, spanning 20 categories, ensures a wide range of images for training purposes. The quality of the images is also high, as they are sourced from creators selling their work on Wirestock.
In terms of how the data is collected, creators upload their work to Wirestock, where it is then sold on various marketplaces. This means the data is sourced directly from creators, ensuring a diverse and unique dataset. The data includes both the images themselves and associated metadata, providing additional context for each image.
The different image categories included in this dataset are Animals/Wildlife, The Arts, Backgrounds/Textures, Beauty/Fashion, Buildings/Landmarks, Business/Finance, Celebrities, Education, Emotions, Food Drinks, Holidays, Industrial, Interiors, Nature Parks/Outdoor, People, Religion, Science, Signs/Symbols, Sports/Recreation, Technology, Transportation, Vintage, Healthcare/Medical, Objects, and Miscellaneous. This wide range of categories ensures a diverse dataset that can cater to a variety of AI/ML applications.
This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('food101', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/food101-2.0.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background/Objectives: Advances in artificial intelligence now allow combined use of large language and vision models; however, there has been limited evaluation of their potential in dietary assessment. This data arose from a study that aimed to evaluate the accuracy of ChatGPT-4 in estimating nutritional content of commonly consumed meals from meal photographs.Methods: Meal photographs (n=114) were uploaded to ChatGPT, and it was asked to identify the foods in each meal, estimate their weight, and estimate the nutrient content of the meals for 16 nutrients for comparison with the known values. There were a total of 39 unique meals with each one photographed 3 times for 3 different portion sizes giving rise to 114 photographs. This dataset is in the form of an excel workbook containing four worksheets. The worksheet titled "ChatGPT Foods & Weights" contains the foods identified by ChatGPT in each of the 114 meal photographs as well as its estimate for the weight of each of those foods. The worksheet titled "Actual Foods & Weights" contains the true foods and weights for each of the meal photographs. The worksheet "ChatGPT Nutrition Estimates" contains ChatGPT's estimates of the nutrition content of each of the 114 meal photographs for 16 different nutrients. The worksheet "Actual Nutrition Content" contains the true nutrition content of the meals in the photographs.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is part of the following publication at the TransAI 2023 conference: R. Wallsberger, R. Knauer, S. Matzka; "Explainable Artificial Intelligence in Mechanical Engineering: A Synthetic Dataset for Comprehensive Failure Mode Analysis" DOI: http://dx.doi.org/10.1109/TransAI60598.2023.00032
This is the original XAI Drilling dataset optimized for XAI purposes and it can be used to evaluate explanations of such algortihms. The dataset comprises 20,000 data points, i.e., drilling operations, stored as rows, 10 features, one binary main failure label, and 4 binary subgroup failure modes, stored in columns. The main failure rate is about 5.0 % for the whole dataset. The features that constitute this dataset are as follows:
Process time t (s): This feature captures the full duration of each drilling operation, providing insights into efficiency and potential bottlenecks.
Main failure: This binary feature indicates if any significant failure on the drill bit occurred during the drilling process. A value of 1 flags a drilling process that encountered issues, which in this case is true when any of the subgroup failure modes are 1, while 0 indicates a successful drilling operation without any major failures.
Subgroup failures: - Build-up edge failure (215x): Represented as a binary feature, a build-up edge failure indicates the occurrence of material accumulation on the cutting edge of the drill bit due to a combination of low cutting speeds and insufficient cooling. A value of 1 signifies the presence of this failure mode, while 0 denotes its absence. - Compression chips failure (344x): This binary feature captures the formation of compressed chips during drilling, resulting from the factors high feed rate, inadequate cooling and using an incompatible drill bit. A value of 1 indicates the occurrence of at least two of the three factors above, while 0 suggests a smooth drilling operation without compression chips. - Flank wear failure (278x): A binary feature representing the wear of the drill bit's flank due to a combination of high feed rates and low cutting speeds. A value of 1 indicates significant flank wear, affecting the drilling operation's accuracy and efficiency, while 0 denotes a wear-free operation. - Wrong drill bit failure (300x): As a binary feature, it indicates the use of an inappropriate drill bit for the material being drilled. A value of 1 signifies a mismatch, leading to potential drilling issues, while 0 indicates the correct drill bit usage.
AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and Marketplace Websites Overview
Unlock the next generation of agentic commerce and automated shopping experiences with this comprehensive dataset of meticulously annotated checkout flows, sourced directly from leading retail, restaurant, and marketplace websites. Designed for developers, researchers, and AI labs building large language models (LLMs) and agentic systems capable of online purchasing, this dataset captures the real-world complexity of digital transactions—from cart initiation to final payment.
Key Features
Breadth of Coverage: Over 10,000 unique checkout journeys across hundreds of top e-commerce, food delivery, and service platforms, including but not limited to Walmart, Target, Kroger, Whole Foods, Uber Eats, Instacart, Shopify-powered sites, and more.
Actionable Annotation: Every flow is broken down into granular, step-by-step actions, complete with timestamped events, UI context, form field details, validation logic, and response feedback. Each step includes:
Page state (URL, DOM snapshot, and metadata)
User actions (clicks, taps, text input, dropdown selection, checkbox/radio interactions)
System responses (AJAX calls, error/success messages, cart/price updates)
Authentication and account linking steps where applicable
Payment entry (card, wallet, alternative methods)
Order review and confirmation
Multi-Vertical, Real-World Data: Flows sourced from a wide variety of verticals and real consumer environments, not just demo stores or test accounts. Includes complex cases such as multi-item carts, promo codes, loyalty integration, and split payments.
Structured for Machine Learning: Delivered in standard formats (JSONL, CSV, or your preferred schema), with every event mapped to action types, page features, and expected outcomes. Optional HAR files and raw network request logs provide an extra layer of technical fidelity for action modeling and RLHF pipelines.
Rich Context for LLMs and Agents: Every annotation includes both human-readable and model-consumable descriptions:
“What the user did” (natural language)
“What the system did in response”
“What a successful action should look like”
Error/edge case coverage (invalid forms, OOS, address/payment errors)
Privacy-Safe & Compliant: All flows are depersonalized and scrubbed of PII. Sensitive fields (like credit card numbers, user addresses, and login credentials) are replaced with realistic but synthetic data, ensuring compliance with privacy regulations.
Each flow tracks the user journey from cart to payment to confirmation, including:
Adding/removing items
Applying coupons or promo codes
Selecting shipping/delivery options
Account creation, login, or guest checkout
Inputting payment details (card, wallet, Buy Now Pay Later)
Handling validation errors or OOS scenarios
Order review and final placement
Confirmation page capture (including order summary details)
Why This Dataset?
Building LLMs, agentic shopping bots, or e-commerce automation tools demands more than just page screenshots or API logs. You need deeply contextualized, action-oriented data that reflects how real users interact with the complex, ever-changing UIs of digital commerce. Our dataset uniquely captures:
The full intent-action-outcome loop
Dynamic UI changes, modals, validation, and error handling
Nuances of cart modification, bundle pricing, delivery constraints, and multi-vendor checkouts
Mobile vs. desktop variations
Diverse merchant tech stacks (custom, Shopify, Magento, BigCommerce, native apps, etc.)
Use Cases
LLM Fine-Tuning: Teach models to reason through step-by-step transaction flows, infer next-best-actions, and generate robust, context-sensitive prompts for real-world ordering.
Agentic Shopping Bots: Train agents to navigate web/mobile checkouts autonomously, handle edge cases, and complete real purchases on behalf of users.
Action Model & RLHF Training: Provide reinforcement learning pipelines with ground truth “what happens if I do X?” data across hundreds of real merchants.
UI/UX Research & Synthetic User Studies: Identify friction points, bottlenecks, and drop-offs in modern checkout design by replaying flows and testing interventions.
Automated QA & Regression Testing: Use realistic flows as test cases for new features or third-party integrations.
What’s Included
10,000+ annotated checkout flows (retail, restaurant, marketplace)
Step-by-step event logs with metadata, DOM, and network context
Natural language explanations for each step and transition
All flows are depersonalized and privacy-compliant
Example scripts for ingesting, parsing, and analyzing the dataset
Flexible licensing for research or commercial use
Sample Categories Covered
Grocery delivery (Instacart, Walmart, Kroger, Target, etc.)
Restaurant takeout/delivery (Ub...
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The size of the U.S. Machine Learning (ML) Market was valued at USD 4.74 USD billion in 2023 and is projected to reach USD 43.38 USD billion by 2032, with an expected CAGR of 37.2% during the forecast period. The U.S. Machine Learning (ML) Market refers to the application and development of machine learning technologies within the United States. Machine learning, a subset of artificial intelligence (AI), involves algorithms and models that allow systems to learn from data, identify patterns, and make decisions or predictions without being explicitly programmed. In the U.S., the ML market is growing rapidly, driven by advancements in computing power, large data sets, and the increasing demand for automation and AI across industries. This remarkable ascent is fueled by a confluence of factors, including the advent of hybrid and genetically modified seeds, proactive government initiatives aimed at enhancing agricultural productivity, an escalating consciousness regarding food security, and the rapid advancement of technologies that underpin precision agriculture. Hybrid seeds, offering a potent combination of desirable traits from multiple parent varieties, are poised to revolutionize crop production by improving yield, resilience, and nutritional content. innovation. Key drivers for this market are: Growing Adoption of Mobile Commerce to Augment the Demand for Virtual Fitting Room Tool . Potential restraints include: Lack of Coding Skills Likely to Limit Market Growth. Notable trends are: Growing Implementation of Touch-based and Voice-based Infotainment Systems to Increase Adoption of Intelligent Cars.
In the process of migrating data to the current DDL platform, datasets with a large number of variables required splitting into multiple spreadsheets. They should be reassembled by the user to understand the data fully. This is the third spreadsheet of three in the Feed The Future Interim Population-Based Assessment of Cambodia, Modules H-I, Anthropometry and Food Consumed by Children.
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The AI in Food & Beverage market is experiencing explosive growth, projected to reach a market size of $9.68 billion in 2025 and exhibiting a remarkable Compound Annual Growth Rate (CAGR) of 38.30% from 2025 to 2033. This rapid expansion is driven by several key factors. Firstly, increasing demand for enhanced food safety and quality control is pushing adoption of AI-powered solutions for inspection and quality assurance throughout the supply chain. Secondly, the growing need for efficient production and optimized packaging processes is driving the integration of AI-powered automation and predictive maintenance systems. Thirdly, consumer engagement is increasingly leveraging AI through personalized recommendations and targeted marketing campaigns, particularly in the burgeoning e-commerce food sector. The market is segmented by application (food sorting, consumer engagement, quality control and safety compliance, production and packaging, maintenance, other applications) and end-user (hotels and restaurants, food processing industry, beverage industry). North America and Europe currently hold significant market shares, but the Asia-Pacific region is poised for substantial growth fueled by rapid technological advancements and increasing adoption in emerging economies. The presence of established players like Rockwell Automation, ABB, and TOMRA Sorting Solutions, alongside innovative startups, contributes to a dynamic and competitive landscape. The continued growth trajectory is expected to be fueled by ongoing technological advancements in computer vision, machine learning, and deep learning, enabling more sophisticated AI solutions for the food and beverage industry. The increasing availability of large datasets for training AI algorithms will further enhance the accuracy and efficiency of these solutions. However, challenges remain, including the high initial investment costs associated with implementing AI systems and the need for skilled workforce capable of deploying and maintaining these technologies. Addressing these challenges through strategic partnerships, government incentives, and ongoing technological advancements will be crucial in sustaining the market's impressive growth trajectory throughout the forecast period. Further segmentation analysis reveals a strong preference for AI-powered quality control solutions, driven by stricter regulatory compliance standards and consumer demand for high-quality, safe products. Recent developments include: May 2022: FANUC America, a CNCs, robotics, and ROBOMACHINES solutions provider, introduced the new DR-3iB/6 STAINLESS delta robot for primary food handling and picking and packing primary food products. The new DR-3iB/6 Stainless robot was expected to help companies maximize production efficiencies without compromising food safety., April 2022: Pudu Robotics, the global leader in commercial service robots, unveiled PUDU A1, its first compound delivery robot designed for employment in a restaurant setting. It included food recognition, positioning, and grasping technology. The robot incorporates the mechanical arm in the restaurant scenario, bridging the gap between the kitchen and the dining table. The robot calculates the space where the dishes are to be placed and correctly places the dishes on the table with optimal obstacle avoidance path planning in real-time.. Key drivers for this market are: Drastic Improvements in Efficiency Across the Supply Chain, Reduced Chance of Human Error and Associated Inaccuracies; Attractive, with the Ability to Generate Consumer Interest. Potential restraints include: Drastic Improvements in Efficiency Across the Supply Chain, Reduced Chance of Human Error and Associated Inaccuracies; Attractive, with the Ability to Generate Consumer Interest. Notable trends are: Consumer Engagement is Expected to Register a Significant Growth.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the UK English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world UK English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic British accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of UK English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Download the Meat Freshness Image Dataset with 2,266 images labeled into Fresh, Half-Fresh, and Spoiled categories. Perfect for building AI models in food safety and quality control to detect meat freshness based on visual cues.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides recipe ingredients with token-level annotations, originally sourced from the research paper "A Named Entity Based Approach to Model Recipes" by Diwan, Batra, and Bagler. It is designed to facilitate the training of Named Entity Recognition (NER) models capable of extracting key entities such as ingredient names, quantities, and units from recipe text. The data was obtained from the authors' GitHub repository, offering a structured resource for advanced natural language processing in the culinary domain.
The dataset is primarily composed of data from FOOD.com (gk), accounting for 78% of the content, with the remaining 22% originating from AllRecipes.com (ar). While specific row or record counts are not provided, the dataset is structured for training purposes, with token-level annotations. Data files are typically in CSV format.
This dataset is ideally suited for training and evaluating Named Entity Recognition (NER) models. It can be applied to extract specific entities from recipe ingredient descriptions, such as: * Identifying ingredient names. * Parsing quantities and their corresponding units. * Recognising processing states, temperatures, and other descriptive attributes of ingredients. It is valuable for knowledge mining in the food and beverage sector and for developing intelligent systems that understand recipe structures.
The dataset's coverage is global, without specific geographical limitations mentioned for the ingredients themselves. The listed date for the dataset is 17/06/2025, which appears to be a listing date. The content is derived from two prominent recipe websites, AllRecipes.com and FOOD.com, providing a broad range of ingredient descriptions.
CC0
This dataset is intended for researchers, data scientists, and developers working in fields such as: * Natural Language Processing (NLP). * Machine Learning (ML) and Artificial Intelligence (AI). * Food science and culinary informatics. * Those building applications for recipe analysis, smart kitchens, or dietary planning requiring structured ingredient data.
Original Data Source: Recipe Ingredient NER for Knowledge Mining
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Agricultural pests and diseases pose major losses to agricultural productivity, leading to significant economic losses and food safety risks. However, accurately identifying and controlling these pests is still very challenging due to the scarcity of labeling data for agricultural pests and the wide variety of pest species with different morphologies. To this end, we propose a two-stage target detection method that combines Cascade RCNN and Swin Transformer models. To address the scarcity of labeled data, we employ random cut-and-paste and traditional online enhancement techniques to expand the pest dataset and use Swin Transformer for basic feature extraction. Subsequently, we designed the SCF-FPN module to enhance the basic features to extract richer pest features. Specifically, the SCF component provides a self-attentive mechanism with a flexible sliding window to enable adaptive feature extraction based on different pest features. Meanwhile, the feature pyramid network (FPN) enriches multiple levels of features and enhances the discriminative ability of the whole network. Finally, to further improve our detection results, we incorporated non-maximum suppression (Soft NMS) and Cascade R-CNN’s cascade structure into the optimization process to ensure more accurate and reliable prediction results. In a detection task involving 28 pest species, our algorithm achieves 92.5%, 91.8%, and 93.7% precision in terms of accuracy, recall, and mean average precision (mAP), respectively, which is an improvement of 12.1%, 5.4%, and 7.6% compared to the original baseline model. The results demonstrate that our method can accurately identify and localize farmland pests, which can help improve farmland’s ecological environment.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global food waste management software market size was valued at USD 1.2 billion in 2023 and is projected to reach USD 3.5 billion by 2032, growing at a Compound Annual Growth Rate (CAGR) of 12.5% from 2024 to 2032. The significant growth in this market is driven by increasing awareness about food waste, stringent government regulations, and the adoption of advanced technologies for efficient food waste management.
One of the key growth factors propelling the food waste management software market is the rising global concern over food waste and its environmental impact. With approximately one-third of all food produced for human consumption wasted globally, there is a compelling need for efficient solutions to tackle this issue. Governments and organizations are increasingly recognizing that effective food waste management can mitigate environmental damage, save resources, and improve food security. This awareness has led to the adoption of sophisticated software solutions designed to streamline food waste tracking, reduction, and management processes.
The implementation of stringent regulations and policies by governments worldwide is another critical driver for the market. For instance, the European Union has set ambitious targets to reduce food waste by 50% by 2030, while countries like France and the United Kingdom have introduced laws that mandate businesses to donate unsold food. Such regulatory initiatives are compelling businesses to adopt food waste management software to comply with legal requirements, thus boosting market growth. These regulations not only encourage businesses to reduce waste but also foster collaboration across the food supply chain to achieve sustainable practices.
Advancements in technology are further catalyzing the growth of the food waste management software market. The integration of Internet of Things (IoT) devices, Artificial Intelligence (AI), and data analytics into food waste management solutions has revolutionized the way food waste is monitored and managed. These technologies enable real-time tracking of food waste, predictive analytics for waste reduction, and efficient resource allocation. The ability to analyze large datasets and derive actionable insights allows businesses to implement proactive measures, thereby reducing food waste and optimizing operations. This technological evolution is expected to continue driving market expansion over the forecast period.
Regionally, North America is anticipated to hold a significant share of the food waste management software market, owing to the presence of major market players, advanced technological infrastructure, and supportive government policies. The region's proactive stance on sustainability and waste reduction, coupled with the high adoption rate of innovative technologies, positions it as a key market for food waste management solutions. Additionally, Europe and Asia Pacific are also expected to witness substantial growth, driven by increasing regulatory pressures and rising consumer awareness about food waste issues.
The food waste management software market can be segmented by component into software and services. The software segment includes various types of applications designed to track, monitor, and manage food waste across different stages of the supply chain. These software solutions offer features such as data analytics, reporting, and integration with other systems to provide comprehensive waste management capabilities. The growing demand for such sophisticated software solutions is driven by the need for real-time tracking, predictive analytics, and enhanced operational efficiency. As businesses continue to seek ways to optimize their waste management processes, the software segment is expected to witness robust growth.
On the other hand, the services segment encompasses consulting, implementation, training, and support services provided alongside the software solutions. These services are crucial for ensuring the successful deployment and operation of food waste management software. Consulting services help organizations assess their waste management needs and design customized solutions, while implementation services ensure seamless integration of the software with existing systems. Training and support services are essential for educating users on how to effectively utilize the software and address any issues that may arise. The demand for these services is likely to grow in tandem with the increasing adoption of food waste management software, as organizations seek to maximize the
According to our latest research, the global AI Taste-Profile Generator market size reached USD 1.26 billion in 2024, and is expected to grow at a robust CAGR of 21.8% during the forecast period, reaching USD 8.61 billion by 2033. The rapid expansion of the AI Taste-Profile Generator market is primarily driven by increasing demand for personalized food and beverage experiences, technological advancements in artificial intelligence, and the growing adoption of data-driven solutions across the food, beverage, and hospitality sectors. As per the latest research, the market continues to witness significant investments from both established enterprises and emerging startups, further fueling innovation and market growth.
The primary growth factor propelling the AI Taste-Profile Generator market is the surging demand for personalized consumer experiences in the food and beverage industry. Consumers today expect tailored recommendations and unique product offerings that match their individual taste preferences. AI-powered taste-profile generators leverage advanced machine learning algorithms and large datasets to analyze consumer behavior, flavor preferences, and sensory data. This enables food manufacturers, restaurants, and beverage companies to develop new products and menus that cater to specific customer segments, thereby enhancing customer satisfaction and brand loyalty. The integration of AI-driven personalization not only improves the consumer experience but also drives higher sales conversion rates and repeat business, making it a critical growth lever for the industry.
Another key driver for the AI Taste-Profile Generator market is the accelerated digital transformation within the food and hospitality sectors. The adoption of cloud-based AI solutions and IoT-enabled devices allows for real-time data collection and analysis, enabling businesses to rapidly respond to changing consumer trends and preferences. Moreover, the proliferation of smart kitchens, connected appliances, and digital ordering platforms has created a fertile environment for AI-powered taste profiling tools. These technologies help businesses optimize their product offerings, reduce food waste, and streamline supply chain operations. The ability of AI taste-profile generators to deliver actionable insights and automate complex decision-making processes is significantly enhancing operational efficiency and profitability across the value chain.
Furthermore, the increasing focus on health and wellness is shaping the evolution of the AI Taste-Profile Generator market. Consumers are becoming more conscious of their dietary choices, seeking healthier alternatives without compromising on taste. AI-powered solutions can analyze nutritional data, allergen information, and individual health profiles to recommend food and beverage options that align with both taste preferences and health goals. This trend is particularly pronounced in the healthcare and wellness industries, where personalized meal planning and dietary recommendations are gaining traction. As regulatory frameworks around food safety and labeling become more stringent, AI taste-profile generators are poised to play a pivotal role in ensuring compliance while delivering value-added services to consumers.
Regionally, North America currently dominates the AI Taste-Profile Generator market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The strong presence of leading technology providers, high consumer awareness, and early adoption of AI-driven solutions in the food and beverage industry are key factors supporting market growth in these regions. Meanwhile, Asia Pacific is witnessing the fastest growth, driven by rising disposable incomes, a burgeoning food service sector, and increasing investments in digital transformation initiatives. Latin America and the Middle East & Africa are also emerging as promising markets, supported by urbanization and evolving consumer preferences, although their market sizes remain comparatively smaller at present.
Bluesky Social Dataset Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.
The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.
Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.
This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.
Dataset Here is a description of the dataset files.
followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers u, v, representing a directed following relation (i.e., user u follows user v). posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in 100 files, each containing the full posts of up to 50,000 users. Each post is stored as a JSON-formatted line. interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers, and represents a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author ,quoted_author, and date. graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread. feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author); feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values, namely the feed name, the user id, and the timestamp. feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp; scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.
Citation If used for research purposes, please cite the following paper describing the dataset details:
Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data. (2024) arXiv:2404.18984
Acknowledgments: This work is supported by :
the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu); SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021; EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The NEFSC Food Habits Database has two major sources of data. The first, and most extensive, is the standard NEFSC Bottom Trawl Surveys Program. During these surveys, food habits data are collected for a variety of species. Additionally, "process-oriented" cruises are conducted periodically to address specific questions related to the feeding ecology of the fish in the ecosystem. Both sources provide primarily stomach content information; composition, total and individual prey weights or volumes, and length of prey. Additional information associated with each fish predator is also collected. Other databases encompass the prey fields of these fish, and include zooplankton, ichthyoplankton, and benthic surveys.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
According to our latest research, the AI‑Guided Biocatalyst Discovery market size reached USD 1.47 billion in 2024, reflecting robust growth in the sector. The market is expected to exhibit a remarkable CAGR of 22.3% from 2025 to 2033, reaching a forecasted value of USD 11.97 billion by 2033. This exceptional growth trajectory is driven by the increasing integration of artificial intelligence into biocatalyst discovery processes, which significantly accelerates enzyme identification and optimization, thereby transforming industries such as pharmaceuticals, chemicals, food & beverages, and agriculture. As per our latest research, the market’s expansion is also attributed to the rising demand for sustainable and efficient bioprocesses, coupled with advancements in machine learning and deep learning algorithms, which are revolutionizing the field of biocatalysis.
The primary growth factor for the AI‑Guided Biocatalyst Discovery market is the mounting need for rapid and cost-effective enzyme discovery and engineering. Traditional biocatalyst discovery methods are often labor-intensive, time-consuming, and expensive. AI-guided techniques, leveraging advanced algorithms and large datasets, are enabling researchers to predict enzyme-substrate interactions, optimize reaction conditions, and design novel biocatalysts with unprecedented precision. This technological leap is not only reducing the time-to-market for new products but also enhancing the overall efficiency and sustainability of bioprocesses. The pharmaceutical sector, in particular, is witnessing significant benefits, as AI-driven biocatalyst discovery accelerates drug development pipelines and facilitates the production of novel therapeutics.
Another key driver propelling the AI‑Guided Biocatalyst Discovery market is the growing emphasis on green chemistry and sustainable industrial processes. With increasing regulatory pressure to minimize environmental impact, industries are turning to biocatalysts as eco-friendly alternatives to traditional chemical catalysts. AI-guided approaches are making it feasible to discover and engineer biocatalysts that exhibit high selectivity, stability, and activity under industrial conditions. This is particularly relevant in the chemicals and food & beverages sectors, where demand for cleaner and more efficient production methods is soaring. The convergence of AI and biotechnology is thus fostering a paradigm shift towards sustainability, further fueling market growth.
Furthermore, the proliferation of big data, advancements in high-throughput screening technologies, and increased collaboration between academia, research institutes, and industry players are catalyzing innovation in the AI‑Guided Biocatalyst Discovery market. The availability of vast biological datasets and the development of sophisticated AI models are enabling the systematic exploration of enzyme sequence-function relationships. This is paving the way for the discovery of novel biocatalysts with tailored properties for diverse applications. Additionally, significant investments from venture capitalists and government agencies are supporting R&D activities in this domain, further accelerating market expansion. The trend towards open innovation and data sharing is also fostering a collaborative ecosystem that is conducive to rapid technological advancements.
From a regional perspective, North America currently dominates the AI‑Guided Biocatalyst Discovery market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The presence of leading biotechnology firms, advanced research infrastructure, and supportive regulatory frameworks are key factors driving market growth in these regions. Asia Pacific is emerging as a high-growth market, fueled by increasing investments in AI and biotechnology, a burgeoning pharmaceutical industry, and supportive government initiatives. Latin America and the Middle East & Africa are also witnessing gradual adoption of AI-guided biocatalyst discovery technologies, albeit at a slower pace, primarily due to limited R&D infrastructure and funding constraints. Overall, the global landscape is characterized by dynamic innovation and increasing cross-border collaborations, which are expected to shape the future trajectory of the market.
What is it?The “Regional self-reliance model of the New England food system” explores future scenarios of regional food self-reliance. In this model, self-reliance is defined as the ratio of production to consumption and can be expressed for individual commodities, food groups, or the overall diet. The model allows a user to define assumptions about diet composition and target self-reliance for different groups of foods. The model estimates the regional self-reliance across seven food groups (grains, vegetables, fruits, dairy, protein-rich foods, fats and oils, and sweeteners) and for the overall diet. In addition, the model calculates land requirements for producing the target amounts of food from New England agriculture. These estimates are presented beside data on current land use to place the results in context.Why was it generated?The model was generated as part of the New England Feeding New England (NEFNE) project. The central question of NEFNE was, "What would it take for 30% of the food consumed in New England to be regionally produced by 2030?" The model addresses the agricultural production capacity of the region, while accounting for the contribution of capture fisheries and aquaculture to food production. The purpose of the model is to estimate the production capacity of the region’s land resources to evaluate the land requirements of increasing regional self-reliance in food.How was it generated?A team of researchers collaborated to construct the model. The model builds on prior work on regional self-reliance, the human carrying capacity of agricultural resources, and analysis of livestock feed requirements. As described below, the model estimates the land requirements of supplying a given level of self-reliance, accounting for food needs, food losses and waste, livestock feed requirements, crop yields, and land availability.Starting from the food consumption end of the food system, the model takes input data on food intake (in servings person-1 day-1) by food group (e.g., grains) and distributes consumption across primary food commodities from that food group (e.g., corn meal, wheat flour) in the Loss-Adjusted Food Supply. Intake for each primary food commodity is then converted into the equivalent quantity of agricultural commodity (in pounds year-1) needed to supply the region with a sufficient amount of that commodity to meet the target level of self-reliance, at a given projected population size. This conversion accounts for the serving size of the commodity (in grams), losses at different stages of the food system, and processing conversions. For animal products, a further step is taken to convert the quantity of food consumed into equivalent quantities of crop biomass required to feed the requisite livestock. Land requirements for each food are determined by dividing the agricultural commodity (for plant foods) or crop biomass requirements (for animal products) by regional average yields for the appropriate crop(s).Input data were collected from an array of secondary data sources, including, the Loss-Adjusted Food Supply, the Census of Agriculture, the New England Agricultural Bulletin, Major Land Uses, the Atlantic Coastal Cooperative Statistics Program Data Warehouse, and the NOAA Fisheries Landings data portal. Additional sources used to develop the model are cited in the workbook and reference information is provided in each worksheet. The unique contribution of the model is to organize the data in a form that permits exploration of alternative scenarios of diet, target self-reliance, and land availability for the New England region.
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset.
Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata: This data product is a unique offering in the realm of AI/ML training data. What sets it apart is the sheer volume and diversity of the dataset, which includes 4.5 million files spanning across 20 different categories. These categories range from Animals/Wildlife and The Arts to Technology and Transportation, providing a rich and varied dataset for AI/ML applications.
The data is sourced from Wirestock's platform, where creators upload and sell their photos, videos, and AI art online. This means that the data is not only vast but also constantly updated, ensuring a fresh and relevant dataset for your AI/ML needs. The data is collected in a GDPR-compliant manner, ensuring the privacy and rights of the creators are respected.
The primary use-cases for this data product are numerous. It is ideal for training machine learning models for image recognition, improving computer vision algorithms, and enhancing AI applications in various industries such as retail, healthcare, and transportation. The diversity of the dataset also means it can be used for more niche applications, such as training AI to recognize specific objects or scenes.
This data product fits into Wirestock's broader data offering as a key resource for AI/ML training. Wirestock is a platform for creators to sell their work, and this dataset is a collection of that work. It represents the breadth and depth of content available on Wirestock, making it a valuable resource for any company working with AI/ML.
The core benefits of this dataset are its volume, diversity, and quality. With 4.5 million files, it provides a vast resource for AI training. The diversity of the dataset, spanning 20 categories, ensures a wide range of images for training purposes. The quality of the images is also high, as they are sourced from creators selling their work on Wirestock.
In terms of how the data is collected, creators upload their work to Wirestock, where it is then sold on various marketplaces. This means the data is sourced directly from creators, ensuring a diverse and unique dataset. The data includes both the images themselves and associated metadata, providing additional context for each image.
The different image categories included in this dataset are Animals/Wildlife, The Arts, Backgrounds/Textures, Beauty/Fashion, Buildings/Landmarks, Business/Finance, Celebrities, Education, Emotions, Food Drinks, Holidays, Industrial, Interiors, Nature Parks/Outdoor, People, Religion, Science, Signs/Symbols, Sports/Recreation, Technology, Transportation, Vintage, Healthcare/Medical, Objects, and Miscellaneous. This wide range of categories ensures a diverse dataset that can cater to a variety of AI/ML applications.