Note:- Only publicly available data can be worked upon
AI & ML Training Data, encompassing Artificial Intelligence (AI) and Machine Learning Datasets, plays a pivotal role in empowering your models. At APISCRAPY, we take pride in our ability to aggregate data from a multitude of sources, ensuring that your models are trained on a rich and diverse set of information. This diversity is crucial for enhancing your model's robustness, allowing it to excel in real-world scenarios and challenges.
Our commitment to quality extends to providing organized and annotated data, saving you valuable time on preprocessing tasks. This not only expedites the training process but also ensures that you receive highly enriched datasets, primed for use in your AI and ML projects, including Deep Learning Datasets. Furthermore, our data is customizable to suit the unique requirements of your project, whether it involves text, images, audio, or other data types.
We understand that data quality and privacy are paramount in the world of AI & ML. Our stringent data quality control procedures eliminate inconsistencies and bias, while data anonymization safeguards sensitive information. As your AI and ML projects evolve, so do your data requirements.
APISCRAPY's AI & ML Training Data service offers several benefits for organizations and individuals involved in artificial intelligence (AI) and machine learning (ML) development. Here are key advantages associated with their advanced training data solutions:
AI & ML Training Data: APISCRAPY specializes in providing high-quality AI & ML Training Data, ensuring that datasets are meticulously curated and tailored to meet the specific needs of AI and ML projects.
Deep Learning Datasets: The service extends its support to deep learning projects by providing Deep Learning Datasets. These datasets offer the complexity and depth necessary for training advanced deep learning models.
Diverse Data Sources: APISCRAPY leverages a diverse range of data sources to compile AI & ML Training Data, providing datasets that encompass a wide array of real-world scenarios and variables.
Quality Assurance: The training data undergoes rigorous quality assurance processes, ensuring that it meets the highest standards for accuracy, relevance, and consistency, crucial for effective model training.
Versatile Applications: APISCRAPY's AI & ML Training Data is versatile and applicable to various AI and ML applications, including image recognition, natural language processing, and other advanced AI-driven functionalities.
APISCRAPY's services are highly scalable, ensuring you have access to the necessary resources when you need them. With real-time data feeds, data curation by experts, constant updates, and cost-efficiency, we are dedicated to providing high-value AI & ML Training Data solutions, ensuring your models remain current and effective
[Related tags:AI Training Data, Textual data, Machine Learning (ML) Data, Deep Learning (DL) Data, Annotated Imagery Data, Synthetic Data, Audio Data, Large Language Model (LLM) Data,ML Training Data, LLM Data, Generative AI Data, Code Base Training Data, Healthcare Training Data, Audio Annotation Services, AI-assisted Labeling, Audio Data, AI Training Data, Natural Language Processing (NLP) Data , Audio & speech Training data, Image Training Data, Video Training Data]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recent progress in generative artificial intelligence (gen-AI) has enabled the generation of photo-realistic and artistically-inspiring photos at a single click, catering to millions of users online. To explore how people use gen-AI models such as DALLE and StableDiffusion, it is critical to understand the themes, contents, and variations present in the AI-generated photos. In this work, we introduce TWIGMA (TWItter Generative-ai images with MetadatA), a comprehensive dataset encompassing 800,000 gen-AI images collected from Jan 2021 to March 2023 on Twitter, with associated metadata (e.g., tweet text, creation date, number of likes).
Through a comparative analysis of TWIGMA with natural images and human artwork, we find that gen-AI images possess distinctive characteristics and exhibit, on average, lower variability when compared to their non-gen-AI counterparts. Additionally, we find that the similarity between a gen-AI image and human images (i) is correlated with the number of likes; and (ii) can be used to identify human images that served as inspiration for the gen-AI creations. Finally, we observe a longitudinal shift in the themes of AI-generated images on Twitter, with users increasingly sharing artistically sophisticated content such as intricate human portraits, whereas their interest in simple subjects such as natural scenes and animals has decreased. Our analyses and findings underscore the significance of TWIGMA as a unique data resource for studying AI-generated images.
Note that in accordance with the privacy and control policy of Twitter, NO raw content from Twitter is included in this dataset and users could and need to retrieve the original Twitter content used for analysis using the Twitter id. In addition, users who want to access Twitter data should consult and follow rules and regulations closely at the official Twitter developer policy at https://developer.twitter.com/en/developer-terms/policy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionProviding one-on-one support to large cohorts is challenging, yet emerging AI technologies show promise in bridging the gap between the support students want and what educators can provide. They offer students a way to engage with their course material in a way that feels fluent and instinctive. Whilst educators may have views on the appropriates for AI, the tools themselves, as well as the novel ways in which they can be used, are continually changing.MethodsThe aim of this study was to probe students' familiarity with AI tools, their views on its current uses, their understanding of universities' AI policies, and finally their impressions of its importance, both to their degree and their future careers. We surveyed 453 psychology and sport science students across two institutions in the UK, predominantly those in the first and second year of undergraduate study, and conducted a series of five focus groups to explore the emerging themes of the survey in more detail.ResultsOur results showed a wide range of responses in terms of students' familiarity with the tools and what they believe AI tools could and should not be used for. Most students emphasized the importance of understanding how AI tools function and their potential applications in both their academic studies and future careers. The results indicated a strong desire among students to learn more about AI technologies. Furthermore, there was a significant interest in receiving dedicated support for integrating these tools into their coursework, driven by the belief that such skills will be sought after by future employers. However, most students were not familiar with their university's published AI policies.DiscussionThis research on pedagogical methods supports a broader long-term ambition to better understand and improve our teaching, learning, and student engagement through the adoption of AI and the effective use of technology and suggests a need for a more comprehensive approach to communicating these important guidelines on an on-going basis, especially as the tools and guidelines evolve.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains data related to X posts from The AI Thread series, a collection of X Thread posts related to Artificial Intelligence. The dataset includes various metrics like word count, readability score, post frequency, and engagement rate, collected over time. The goal of this dataset is to analyse how different features of the posts (such as length, media content, and readability) influence the engagement they receive (measured in likes, comments, shares, etc.).
Features:
id: Unique identifier for each post. word_count: The number of words in the post. reading_time(s): Estimated time in seconds it takes to read the post. readability_score: A score representing the post’s readability (Flesch Reading Ease). posts_per_thread: The number of posts in a given thread. topic_complexity: A subjective score representing the complexity of the topic. 1 for novice and 3 for advanced. media_count: The number of images, videos, or quizzes included in the post. posting_time: The time at which the post was made (IST) post_frequency: How often posted? (3: thrice a week; 1: once a week) impressions: The number of times the post was seen by users. emojis: The number of emojis used in the post. engagements: The number of likes, shares, comments, expanded the post received (target variable). This is recorded directly from X.
Purpose of the Dataset: This dataset can be used for analysing the relationship between various post characteristics (content, timing, frequency) and the engagement metrics. It's ideal for training machine learning models to predict engagement based on these features, or for analysing which features most strongly correlate with higher engagement.
Usage:
Predictive Modelling: Train machine learning models (like regression or classification models) to predict engagements. Exploratory Data Analysis (EDA): Explore trends in engagement based on different features such as posting time, media type, or word count. Feature Engineering: Develop new features or perform feature selection to improve model performance. Target Audience: Data scientists, researchers, social media analysts, and AI enthusiasts who want to explore the relationship between content features and engagement, as well as those working with predictive analytics or AI applications in social media.
License: This dataset is publicly available for research and educational purposes. Commercial use is not permitted unless specified by the license.
Source: X Analytics X Account: https://x.com/PulkitSahu89
This dataset originates from a series of experimental studies titled “Tough on People, Tolerant to AI? Differential Effects of Human vs. AI Unfairness on Trust” The project investigates how individuals respond to unfair behavior (distributive, procedural, and interactional unfairness) enacted by artificial intelligence versus human agents, and how such behavior affects cognitive and affective trust.1 Experiment 1a: The Impact of AI vs. Human Distributive Unfairness on TrustOverview: This dataset comes from an experimental study aimed at examining how individuals respond in terms of cognitive and affective trust when distributive unfairness is enacted by either an artificial intelligence (AI) agent or a human decision-maker. Experiment 1a specifically focuses on the main effect of the “type of decision-maker” on trust.Data Generation and Processing: The data were collected through Credamo, an online survey platform. Initially, 98 responses were gathered from students at a university in China. Additional student participants were recruited via Credamo to supplement the sample. Attention check items were embedded in the questionnaire, and participants who failed were automatically excluded in real-time. Data collection continued until 202 valid responses were obtained. SPSS software was used for data cleaning and analysis.Data Structure and Format: The data file is named “Experiment1a.sav” and is in SPSS format. It contains 28 columns and 202 rows, where each row corresponds to one participant. Columns represent measured variables, including: grouping and randomization variables, one manipulation check item, four items measuring distributive fairness perception, six items on cognitive trust, five items on affective trust, three items for honesty checks, and four demographic variables (gender, age, education, and grade level). The final three columns contain computed means for distributive fairness, cognitive trust, and affective trust.Additional Information: No missing data are present. All variable names are labeled in English abbreviations to facilitate further analysis. The dataset can be directly opened in SPSS or exported to other formats.2 Experiment 1b: The Mediating Role of Perceived Ability and Benevolence (Distributive Unfairness)Overview: This dataset originates from an experimental study designed to replicate the findings of Experiment 1a and further examine the potential mediating role of perceived ability and perceived benevolence.Data Generation and Processing: Participants were recruited via the Credamo online platform. Attention check items were embedded in the survey to ensure data quality. Data were collected using a rolling recruitment method, with invalid responses removed in real time. A total of 228 valid responses were obtained.Data Structure and Format: The dataset is stored in a file named Experiment1b.sav in SPSS format and can be directly opened in SPSS software. It consists of 228 rows and 40 columns. Each row represents one participant’s data record, and each column corresponds to a different measured variable. Specifically, the dataset includes: random assignment and grouping variables; one manipulation check item; four items measuring perceived distributive fairness; six items on perceived ability; five items on perceived benevolence; six items on cognitive trust; five items on affective trust; three items for attention check; and three demographic variables (gender, age, and education). The last five columns contain the computed mean scores for perceived distributive fairness, ability, benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be analyzed directly in SPSS or exported to other formats as needed.3 Experiment 2a: Differential Effects of AI vs. Human Procedural Unfairness on TrustOverview: This dataset originates from an experimental study aimed at examining whether individuals respond differently in terms of cognitive and affective trust when procedural unfairness is enacted by artificial intelligence versus human decision-makers. Experiment 2a focuses on the main effect of the decision agent on trust outcomes.Data Generation and Processing: Participants were recruited via the Credamo online survey platform from two universities located in different regions of China. A total of 227 responses were collected. After excluding those who failed the attention check items, 204 valid responses were retained for analysis. Data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2a.sav in SPSS format and can be directly opened in SPSS software. It contains 204 rows and 30 columns. Each row represents one participant’s response record, while each column corresponds to a specific variable. Variables include: random assignment and grouping; one manipulation check item; seven items measuring perceived procedural fairness; six items on cognitive trust; five items on affective trust; three attention check items; and three demographic variables (gender, age, and education). The final three columns contain computed average scores for procedural fairness, cognitive trust, and affective trust.Additional Notes: The dataset contains no missing values. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be directly analyzed in SPSS or exported to other formats as needed.4 Experiment 2b: Mediating Role of Perceived Ability and Benevolence (Procedural Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 2a and to further examine the potential mediating roles of perceived ability and perceived benevolence in shaping trust responses under procedural unfairness.Data Generation and Processing: Participants were working adults recruited through the Credamo online platform. A rolling data collection strategy was used, where responses failing attention checks were excluded in real time. The final dataset includes 235 valid responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2b.sav, which is in SPSS format and can be directly opened using SPSS software. It contains 235 rows and 43 columns. Each row corresponds to a single participant, and each column represents a specific measured variable. These include: random assignment and group labels; one manipulation check item; seven items measuring procedural fairness; six items for perceived ability; five items for perceived benevolence; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final five columns contain the computed average scores for procedural fairness, perceived ability, perceived benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to support future reuse and secondary analysis. The dataset can be directly analyzed in SPSS and easily converted into other formats if needed.5 Experiment 3a: Effects of AI vs. Human Interactional Unfairness on TrustOverview: This dataset comes from an experimental study that investigates how interactional unfairness, when enacted by either artificial intelligence or human decision-makers, influences individuals’ cognitive and affective trust. Experiment 3a focuses on the main effect of the “decision-maker type” under interactional unfairness conditions.Data Generation and Processing: Participants were college students recruited from two universities in different regions of China through the Credamo survey platform. After excluding responses that failed attention checks, a total of 203 valid cases were retained from an initial pool of 223 responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3a.sav, in SPSS format and compatible with SPSS software. It contains 203 rows and 27 columns. Each row represents a single participant, while each column corresponds to a specific measured variable. These include: random assignment and condition labels; one manipulation check item; four items measuring interactional fairness perception; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final three columns contain computed average scores for interactional fairness, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variable names are provided using standardized English abbreviations to facilitate secondary analysis. The data can be directly analyzed using SPSS and exported to other formats as needed.6 Experiment 3b: The Mediating Role of Perceived Ability and Benevolence (Interactional Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 3a and further examine the potential mediating roles of perceived ability and perceived benevolence under conditions of interactional unfairness.Data Generation and Processing: Participants were working adults recruited via the Credamo platform. Attention check questions were embedded in the survey, and responses that failed these checks were excluded in real time. Data collection proceeded in a rolling manner until a total of 227 valid responses were obtained. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3b.sav, in SPSS format and compatible with SPSS software. It includes 227 rows and
"Collection of 100,000 high-quality video clips across diverse real-world domains, designed to accelerate the training and optimization of computer vision and multimodal AI models."
Overview This dataset contains 100,000 proprietary and partner-produced video clips filmed in 4K/6K with cinema-grade RED cameras. Each clip is commercially cleared with full releases, structured metadata, and available in RAW or MOV/MP4 formats. The collection spans a wide variety of domains — people and lifestyle, healthcare and medical, food and cooking, office and business, sports and fitness, nature and landscapes, education, and more. This breadth ensures robust training data for computer vision, multimodal, and machine learning projects.
The data set All 100,000 videos have been reviewed for quality and compliance. The dataset is optimized for AI model training, supporting use cases from face and activity recognition to scene understanding and generative AI. Custom datasets can also be produced on demand, enabling clients to close data gaps with tailored, high-quality content.
About M-ART M-ART is a leading provider of cinematic-grade datasets for AI training. With extensive expertise in large-scale content production and curation, M-ART delivers both ready-to-use video datasets and fully customized collections. All data is proprietary, rights-cleared, and designed to help global AI leaders accelerate research, development, and deployment of next-generation models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports the doctoral dissertation The AI of the Beholder: A Quantitative Study on Human Perception and Appraisal of AI-Generated Images by Joshua Cunningham (Robert Morris University, 2025). The study investigates how individuals perceive and appraise artwork generated by artificial intelligence (AI) in comparison to human-created pieces. Specifically, it examines: (1) whether participants can accurately distinguish AI-generated from human-created artwork, (2) how age and exposure to AI art influence this ability and related appraisals, and (3) how digital versus traditional visual styles of AI art are perceived. The dataset includes anonymized survey responses collected from a diverse group of adult participants. Respondents were asked to evaluate a series of visual artworks—some created by humans, others by AI—across a range of styles, including both digital and traditional aesthetics. Additional demographic information such as age and prior exposure to AI tools was collected to assess moderating effects. The data were analyzed using SPSS to evaluate participant accuracy, preferences, and perceptions. This dataset can support further research into the psychological, aesthetic, and cultural dynamics of AI-generated content, as well as human-machine interaction in the creative arts.You may find the images used in this study, the original survey instrument, as well as a legend detailing each of the variables here: https://drive.google.com/drive/folders/125oaW82HpJUjI7EbaQWk_51puz0DgWKc?usp=sharing
The rapid advancements in generative AI models present new opportunities in the education sector. However, it is imperative to acknowledge and address the potential risks and concerns that may arise with their use. We collected Twitter data to identify key concerns related to the use of ChatGPT in education. This dataset is used to support the study "ChatGPT in education: A discourse analysis of worries and concerns on social media."
In this study, we particularly explored two research questions. RQ1 (Concerns): What are the key concerns that Twitter users perceive with using ChatGPT in education? RQ2 (Accounts): Which accounts are implicated in the discussion of these concerns? In summary, our study underscores the importance of responsible and ethical use of AI in education and highlights the need for collaboration among stakeholders to regulate AI policy.
Abstract: Supplementary material for the article: Herm, Lukas-Valentin: Impact of Explainable AI On Cognitive Load: Insights From An Empirical Study. In: 31st European Conference on Information Systems (ECIS). AIS Virtual Conference Series : AIS, 2023, bl conditionally accepted Abstract: "While the emerging research field of explainable artificial intelligence (XAI) claims to address the lack of explainability in high-performance machine learning models, in practice, XAI targets developers rather than actual end-users. Unsurprisingly, end-users are often unwilling to use XAI-based decision support systems. Similarly, there is limited interdisciplinary research on end-users’ behavior during XAI explanations usage, rendering it unknown how explanations may impact cognitive load and further affect end-user performance. Therefore, we conducted an empirical study with 271 prospective physicians, measuring their cognitive load, task performance, and task time for distinct implementation-independent XAI explanation types using a COVID-19 use case. We found that these explanation types strongly influence end-users’ cognitive load, task performance, and task time. Further, we contextualized a mental efficiency metric, ranking local XAI explanation types best, to provide recommendations for future applications and implications for sociotechnical XAI research." Using this data for academic publications is granted explicitly.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Creativity is core to being human. Generative AI—made readily available by powerful large language models (LLMs)—holds promise for humans to be more creative by offering new ideas, or less creative by anchoring on generative AI ideas. We study the causal impact of generative AI ideas on the production of short stories in an online experiment where some writers obtained story ideas from an LLM. We find that access to generative AI ideas causes stories to be evaluated as more creative, better written, and more enjoyable, especially among less creative writers. However, generative AI-enabled stories are more similar to each other than stories by humans alone. These results point to an increase in individual creativity at the risk of losing collective novelty. This dynamic resembles a social dilemma: with generative AI, writers are individually better off, but collectively a narrower scope of novel content is produced. Our results have implications for researchers, policy-makers, and practitioners interested in bolstering creativity. Methods This dataset is based on a pre-registered, two-phase experimental online study. In the first phase of our study, we recruited a group of N=293 participants (“writers”) who are asked to write a short, eight sentence story. Participants are randomly assigned to one of three conditions: Human only, Human with 1 GenAI idea, and Human with 5 GenAI ideas. In our Human only baseline condition, writers are assigned the task with no mention of or access to GenAI. In the two GenAI conditions, we provide writers with the option to call upon a GenAI technology (OpenAI’s GPT-4 model) to provide a three-sentence starting idea to inspire their own story writing. In one of the two GenAI conditions (Human with 5 GenAI ideas), writers can choose to receive up to five GenAI ideas, each providing a possibly different inspiration for their story. After completing their story, writers are asked to self-evaluate their story on novelty, usefulness, and several emotional characteristics. In the second phase, the stories composed by the writers are then evaluated by a separate group of N=600 participants (“evaluators”). Evaluators read six randomly selected stories without being informed about writers being randomly assigned to access GenAI in some conditions (or not). All stories are evaluated by multiple evaluators on novelty, usefulness, and several emotional characteristics. After disclosing to evaluators whether GenAI was used during the creative process, we ask evaluators to rate the extent to which ownership and hypothetical profits should be split between the writer and the AI. Finally, we elicit evaluators’ general views on the extent to which they believe that the use of AI in producing creative output is ethical, how story ownership and hypothetical profits should be shared between AI creators and human creators, and how AI should be credited in the involvement of the creative output. The data was collected on the online study platform Prolific. The data was then cleaned, processed and analyzed with Stata. For the Writer Study, of the 500 participants who began the study, 169 exited the study prior to giving consent, 22 were dropped for not giving consent, and 13 dropped out prior to completing the study. Three participants in the Human only condition admitted to using GenAI during their story writing exercise and—as per our pre-registration—they were therefore dropped from the analysis, resulting in a total number of writers and stories of 293. For the Evaluator Study, each evaluator was shown 6 stories (2 stories from each topic). The evaluations associated with the writers who did not complete the writer study and those in the Human only condition who acknowledged using AI to complete the story were dropped. Thus, there are a total of 3,519 evaluations of 293 stories made by 600 evaluators. Four evaluations remained for five evaluators, five evaluations remained for 71, and all six remained for 524 evaluators.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundGenerative artificial intelligence (Gen-AI) has emerged as a transformative tool in research and education. However, there is a mixed perception about its use. This study assessed the use, perception, prospect, and challenges of Gen-AI use in higher education.MethodsThis is a prospective, cross-sectional survey of university students in the United Kingdom (UK) distributed online between January and April 2024. Demography of participants and their perception of Gen-AI and other AI tools were collected and statistically analyzed to assess the difference in perception between various subgroups.ResultsA total of 136 students responded to the survey of which 59% (80) were male. The majority were aware of Gen-AI and other AI use in academia (61%) with 52% having personal experience of the tools. Grammar correction and idea generation were the two most common tasks of use, with 37% being regular users. Fifty-six percent of respondents agreed that AI gives an academic edge with 40% holding a positive overall perception about the use in academia. Comparatively, there was a statistically significant difference in overall perception between different age ranges (I2 = 27.39; p = 0.002) and levels of education (I2 = 20.07; p
More and more companies use artificial intelligence (AI). Research aimed to understand acceptance from the perspective of AI users or people affected by AI decisions. However, the perspective of decision-makers in companies (i.e., managers) has not been considered. To address this gap, we investigate managers’ acceptance of AI usage in companies, focusing on two potential determinants. Across four experimental studies (Ntotal = 2025), we tested whether the business area (i.e., human resources vs. finances/ marketing) and AI functionality affect managers’ acceptance of AI (i.e., perceived risk of negative consequences, willingness to invest). Findings indicate that managers (a) perceive more risk of and (b) are less willing to invest in AI usage in human resources than in finances and marketing. Besides, the results suggest that acceptance declines if functionality crosses a critical boundary and AI autonomously implements decisions without prior human control. Accordingly, the current research sheds light on the AI acceptance of managers and gives insights into the role of the business area and AI functionality.
A main aim of the study was understand how experts predict the automation of unpaid domestic work. To do so, we conducted a Delphi survey with technology experts in the UK and Japan. The data set includes answers collected from a forecasting exercise in which 65 AI experts from the UK (29 respondents) and Japan (36 respondents) were asked to estimate how automatable 17 housework and care work tasks are in the next 5 to 10 years. The experts were also asked to estimate the cost of the automations. In addition, background information, such as the experts' gender, age and field of expertise, were collected. Based on the respondents answers, the Delphi survey shows that on average 27% of time that people currently spend on doing unpaid domestic work could be automated in the next 5 years, and 39% in the next 10 years.This project brings unpaid domestic work into the discussion of AI and the future of labour and predicts the degree of automation of unpaid work in two distinct countries – the UK and Japan. To do this we evaluate the technological likelihood of automatibility of domestic work tasks using a grid of 17 such tasks identified in the UK Time Use Survey 2014-15 and the Japanese Survey on Time Use and Leisure Activities 2016. We use a panel consisting of technology experts to assess how quickly AI-powered domestic technologies will become not only technologically possible but also affordable for households. For the Delphi survey we recruited 65 respondents who are technology experts. 29 respondents are based in the UK, 36 are based in Japan. We consider "technology experts" as people with expert knowledge in AI or AI related technologies, including machine learning, robotics, or the social and/or business aspects of AI related technologies. Our approach was to recruit a balanced number of female and male respondents, as well as a balanced number in three different professional fields: academia, research and development, and business. We recruited the respondents through our own network, through snowball sampling, and through desktop research. We contacted the experts via email and LinkedIn, sending them the invitation to participate in the Delphi survey. Respondents based in the UK were contacted by the UK team using English as the communication language, and respondents based in Japan were contacted by the Japanese team using Japanese as the correspondence language. While Japanese respondents received a small monetary compensation, as it is expected in Japan, the UK respondents did not receive any monetary compensation.
Our Cinematic Dataset is a carefully selected collection of audio files with rich metadata, providing a wealth of information for machine learning applications such as generative AI music, Music Information Retrieval (MIR), and source separation. This dataset is specifically created to capture the rich and expressive quality of cinematic music, making it an ideal training environment for AI models. This dataset, which includes chords, instrumentation, key, tempo, and timestamps, is an invaluable resource for those looking to push AI's bounds in the field of audio innovation.
Strings, brass, woodwinds, and percussion are among the instruments used in the orchestral ensemble, which is a staple of film music. Strings, including violins, cellos, and double basses, are vital for communicating emotion, while brass instruments, such as trumpets and trombones, contribute to vastness and passion. Woodwinds, such as flutes and clarinets, give texture and nuance, while percussion instruments bring rhythm and impact. The careful arrangement of these parts produces distinct cinematic soundscapes, making the genre excellent for teaching AI models to recognize and duplicate complicated musical patterns.
Training models on this dataset provides a unique opportunity to explore the complexities of cinematic composition. The dataset's emphasis on important cinematic components, along with cinematic music's natural emotional storytelling ability, provides a solid platform for AI models to learn and compose music that captures the essence of engaging storylines. As AI continues to push creative boundaries, this Cinematic Music Dataset is a valuable tool for anybody looking to harness the compelling power of music in the digital environment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Hypothesis:
The hypothesis is that service quality and trust significantly influence customer satisfaction with Telkomsel’s Veronika chatbot. Key dimensions include reliability, responsiveness, and empathy in service quality, and trust based on the chatbot's ability, benevolence, and integrity.
Data and Data Collection:
Data for this study were collected from Generation Z users who have experience using Telkomsel’s Veronika chatbot. A structured questionnaire was administered to 240 respondents, 52.9% of whom were female and 47.1% male, with ages ranging from 18 to 22 years. The data collection occurred between May and June 2024, and the questionnaire was distributed via social media platforms such as Instagram, Line, and WhatsApp. Non-probability sampling methods, specifically purposive and quota sampling, were used to ensure that only those familiar with the chatbot were surveyed.
The questionnaire comprised 31 questions designed to assess three key variables: service quality, trust, and customer satisfaction. A five-point Likert scale, ranging from "Strongly Disagree" to "Strongly Agree," was employed for all questions. Service quality was evaluated using the SERVQUAL model, while trust was measured through dimensions of ability, benevolence, and integrity. Customer satisfaction was assessed using items adapted from the Customer Satisfaction Index (CSI).
Key Findings:
1.Service Quality: A significant positive impact on customer satisfaction was found (β = 0.496, p < 0.001), with reliability and responsiveness being key factors. The highest loading (0.837) was on Veronika’s ability to provide alternative solutions.
2.Trust: Trust was also a significant predictor (β = 0.337, p < 0.001), with confidentiality being the most important trust factor (outer loading = 0.835).
3.Customer Satisfaction: Satisfaction was strongly influenced by both service quality and trust, with outer loadings from 0.908 to 0.918, particularly in terms of the chatbot's clarity and communication effectiveness.
Data Interpretation:
Both service quality and trust are essential to customer satisfaction, with service quality being a stronger predictor. Users value reliability and responsiveness more than trust, though both are necessary for high satisfaction. The reliability of the questionnaire was confirmed with high Cronbach’s alpha values, such as 0.938 for service quality.
Conclusion and Implications:
Improving service quality, especially reliability and responsiveness, will enhance user satisfaction. Strengthening trust, particularly in data security, is also crucial. Future research should explore broader demographics and long-term effects, while qualitative studies could offer more insights into user experiences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
While social media has been proved as an exceptionally useful tool to interact with other people and massively and quickly spread helpful information, its great potential has been ill-intentionally leveraged as well to distort political elections and manipulate constituents. In the paper at hand, we analyzed the presence and behavior of social bots on Twitter in the context of the November 2019 Spanish general election. Throughout our study, we classified involved users as social bots or humans, and examined their interactions from a quantitative (i.e., amount of traffic generated and existing relations) and qualitative (i.e., user's political affinity and sentiment towards the most important parties) perspectives. Results demonstrated that a non-negligible amount of those bots actively participated in the election, supporting each of the five principal political parties.
The dataset at hand presents the data collected during the observation period (from October 4th, 2019 to November 11th, 2019). It includes both the anonymized tweets and the users' data.
Data have been exported in three formats to provide the maximum flexibility - MongoDB Dump BSONs: To import these data, please refer to the official MongoDB documentation. - JSON Exports: Both the users and the tweets collections have been exported as canonical JSON files. - CSV Exports (only tweets): The tweet collection has been exported as plain CSV file with comma separators.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Polish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Polish language, advancing the field of artificial intelligence.
Dataset Content:This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Polish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Polish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled Polish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Polish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Polish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The AI Training Dataset In Healthcare Market size was valued at USD 341.8 million in 2023 and is projected to reach USD 1464.13 million by 2032, exhibiting a CAGR of 23.1 % during the forecasts period. The growth is attributed to the rising adoption of AI in healthcare, increasing demand for accurate and reliable training datasets, government initiatives to promote AI in healthcare, and technological advancements in data collection and annotation. These factors are contributing to the expansion of the AI Training Dataset In Healthcare Market. Healthcare AI training data sets are vital for building effective algorithms, and enhancing patient care and diagnosis in the industry. These datasets include large volumes of Electronic Health Records, images such as X-ray and MRI scans, and genomics data which are thoroughly labeled. They help the AI systems to identify trends, forecast and even help in developing unique approaches to treating the disease. However, patient privacy and ethical use of a patient’s information is of the utmost importance, thus requiring high levels of anonymization and compliance with laws such as HIPAA. Ongoing expansion and variety of datasets are crucial to address existing bias and improve the efficiency of AI for different populations and diseases to provide safer solutions for global people’s health.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2
Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.
Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)
The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset provides a collection of behaviour biometrics data (commonly known as Keyboard, Mouse and Touchscreen (KMT) dynamics). The data was collected for use in a FinTech research project undertaken by academics and researchers at Computer Science Department, Edge Hill University, United Kingdom. The project called CyberSIgnature uses KMT dynamics data to distinguish between legitimate card owners and fraudsters. An application was developed that has a graphical user interface (GUI) similar to a standard online card payment form including fields for card type, name, card number, card verification code (cvc) and expiry date. Then, user KMT dynamics were captured while they entered fictitious card information on the GUI application.
The dataset consists of 1,760 KMT dynamic instances collected over 88 user sessions on the GUI application. Each user session involves 20 iterations of data entry in which the user is assigned a fictitious card information (drawn at random from a pool) to enter 10 times and subsequently presented with 10 additional card information, each to be entered once. The 10 additional card information is drawn from a pool that has been assigned or to be assigned to other users. A KMT data instance is collected during each data entry iteration. Thus, a total of 20 KMT data instances (i.e., 10 legitimate and 10 illegitimate) was collected during each user entry session on the GUI application.
The raw dataset is stored in .json format within 88 separate files. The root folder named behaviour_biometrics_dataset' consists of two sub-folders
raw_kmt_dataset' and `feature_kmt_dataset'; and a Jupyter notebook file (kmt_feature_classificatio.ipynb). Their folder and file content is described below:
-- raw_kmt_dataset': this folder contains 88 files, each named
raw_kmt_user_n.json', where n is a number from 0001 to 0088. Each file contains 20 instances of KMT dynamics data corresponding to a given fictitious card; and the data instances are equally split between legitimate (n = 10) and illegitimate (n = 10) classes. The legitimate class corresponds to KMT dynamics captured from the user that is assigned to the card detail; while the illegitimate class corresponds to KMT dynamics data collected from other users entering the same card detail.
-- feature_kmt_dataset': this folder contains two sub-folders, namely:
feature_kmt_json' and feature_kmt_xlsx'. Each folder contains 88 files (of the relevant format: .json or .xlsx) , each named
feature_kmt_user_n', where n is a number from 0001 to 0088. Each file contains 20 instances of features extracted from the corresponding `raw_kmt_user_n' file including the class labels (legitimate = 1 or illegitimate = 0).
-- `kmt_feature_classification.ipynb': this file contains python code necessary to generate features from the raw KMT files and apply simple machine learning classification task to generate results. The code is designed to run with minimal effort from the user.
Note:- Only publicly available data can be worked upon
AI & ML Training Data, encompassing Artificial Intelligence (AI) and Machine Learning Datasets, plays a pivotal role in empowering your models. At APISCRAPY, we take pride in our ability to aggregate data from a multitude of sources, ensuring that your models are trained on a rich and diverse set of information. This diversity is crucial for enhancing your model's robustness, allowing it to excel in real-world scenarios and challenges.
Our commitment to quality extends to providing organized and annotated data, saving you valuable time on preprocessing tasks. This not only expedites the training process but also ensures that you receive highly enriched datasets, primed for use in your AI and ML projects, including Deep Learning Datasets. Furthermore, our data is customizable to suit the unique requirements of your project, whether it involves text, images, audio, or other data types.
We understand that data quality and privacy are paramount in the world of AI & ML. Our stringent data quality control procedures eliminate inconsistencies and bias, while data anonymization safeguards sensitive information. As your AI and ML projects evolve, so do your data requirements.
APISCRAPY's AI & ML Training Data service offers several benefits for organizations and individuals involved in artificial intelligence (AI) and machine learning (ML) development. Here are key advantages associated with their advanced training data solutions:
AI & ML Training Data: APISCRAPY specializes in providing high-quality AI & ML Training Data, ensuring that datasets are meticulously curated and tailored to meet the specific needs of AI and ML projects.
Deep Learning Datasets: The service extends its support to deep learning projects by providing Deep Learning Datasets. These datasets offer the complexity and depth necessary for training advanced deep learning models.
Diverse Data Sources: APISCRAPY leverages a diverse range of data sources to compile AI & ML Training Data, providing datasets that encompass a wide array of real-world scenarios and variables.
Quality Assurance: The training data undergoes rigorous quality assurance processes, ensuring that it meets the highest standards for accuracy, relevance, and consistency, crucial for effective model training.
Versatile Applications: APISCRAPY's AI & ML Training Data is versatile and applicable to various AI and ML applications, including image recognition, natural language processing, and other advanced AI-driven functionalities.
APISCRAPY's services are highly scalable, ensuring you have access to the necessary resources when you need them. With real-time data feeds, data curation by experts, constant updates, and cost-efficiency, we are dedicated to providing high-value AI & ML Training Data solutions, ensuring your models remain current and effective
[Related tags:AI Training Data, Textual data, Machine Learning (ML) Data, Deep Learning (DL) Data, Annotated Imagery Data, Synthetic Data, Audio Data, Large Language Model (LLM) Data,ML Training Data, LLM Data, Generative AI Data, Code Base Training Data, Healthcare Training Data, Audio Annotation Services, AI-assisted Labeling, Audio Data, AI Training Data, Natural Language Processing (NLP) Data , Audio & speech Training data, Image Training Data, Video Training Data]