100+ datasets found

Artificial Intelligence (AI) awareness, use and impact, Great Britain
ons.gov.uk
cy.ons.gov.uk
xlsx
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2023). Artificial Intelligence (AI) awareness, use and impact, Great Britain [Dataset]. https://www.ons.gov.uk/businessindustryandtrade/itandinternetindustry/datasets/artificialintelligenceaiawarenessuseandimpactgreatbritain
Explore at:
xlsxAvailable download formats
Dataset updated
Jun 16, 2023
Dataset provided by
Office for National Statisticshttp://www.ons.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Area covered
United Kingdom
Description
Data from the Opinion and Lifestyle Survey (OPN) on the use of Artificial Intelligence (AI) and how people feel about its uptake in today’s society.
Z
Data from: TWIGMA: A dataset of AI-Generated Images with Metadata From...
data.niaid.nih.gov
zenodo.org
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Zou (2024). TWIGMA: A dataset of AI-Generated Images with Metadata From Twitter [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8031784
Explore at:
Dataset updated
May 28, 2024
Dataset provided by
James Zou
Yiqun Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Update May 2024: Fixed a data type issue with "id" column that prevented twitter ids from rendering correctly.

Recent progress in generative artificial intelligence (gen-AI) has enabled the generation of photo-realistic and artistically-inspiring photos at a single click, catering to millions of users online. To explore how people use gen-AI models such as DALLE and StableDiffusion, it is critical to understand the themes, contents, and variations present in the AI-generated photos. In this work, we introduce TWIGMA (TWItter Generative-ai images with MetadatA), a comprehensive dataset encompassing 800,000 gen-AI images collected from Jan 2021 to March 2023 on Twitter, with associated metadata (e.g., tweet text, creation date, number of likes).

Through a comparative analysis of TWIGMA with natural images and human artwork, we find that gen-AI images possess distinctive characteristics and exhibit, on average, lower variability when compared to their non-gen-AI counterparts. Additionally, we find that the similarity between a gen-AI image and human images (i) is correlated with the number of likes; and (ii) can be used to identify human images that served as inspiration for the gen-AI creations. Finally, we observe a longitudinal shift in the themes of AI-generated images on Twitter, with users increasingly sharing artistically sophisticated content such as intricate human portraits, whereas their interest in simple subjects such as natural scenes and animals has decreased. Our analyses and findings underscore the significance of TWIGMA as a unique data resource for studying AI-generated images.

Note that in accordance with the privacy and control policy of Twitter, NO raw content from Twitter is included in this dataset and users could and need to retrieve the original Twitter content used for analysis using the Twitter id. In addition, users who want to access Twitter data should consult and follow rules and regulations closely at the official Twitter developer policy at https://developer.twitter.com/en/developer-terms/policy.
A
AI Training Dataset In Healthcare Market Report
archivemarketresearch.com
doc, pdf, ppt
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). AI Training Dataset In Healthcare Market Report [Dataset]. https://www.archivemarketresearch.com/reports/ai-training-dataset-in-healthcare-market-5352
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Jun 20, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
global
Variables measured
Market Size
Description
The AI Training Dataset In Healthcare Market size was valued at USD 341.8 million in 2023 and is projected to reach USD 1464.13 million by 2032, exhibiting a CAGR of 23.1 % during the forecasts period. The growth is attributed to the rising adoption of AI in healthcare, increasing demand for accurate and reliable training datasets, government initiatives to promote AI in healthcare, and technological advancements in data collection and annotation. These factors are contributing to the expansion of the AI Training Dataset In Healthcare Market. Healthcare AI training data sets are vital for building effective algorithms, and enhancing patient care and diagnosis in the industry. These datasets include large volumes of Electronic Health Records, images such as X-ray and MRI scans, and genomics data which are thoroughly labeled. They help the AI systems to identify trends, forecast and even help in developing unique approaches to treating the disease. However, patient privacy and ethical use of a patient’s information is of the utmost importance, thus requiring high levels of anonymization and compliance with laws such as HIPAA. Ongoing expansion and variety of datasets are crucial to address existing bias and improve the efficiency of AI for different populations and diseases to provide safer solutions for global people’s health.
F
Polish Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Polish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/polish-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
The Polish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Polish language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Polish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Polish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Polish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Polish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Polish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
d
M-ART | Video Data | Global | 100,000 Stock videos | Including metadata and...
datarade.ai
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M-ART (2025). M-ART | Video Data | Global | 100,000 Stock videos | Including metadata and releases | Dataset for AI & ML [Dataset]. https://datarade.ai/data-products/m-art-video-data-global-100-000-stock-videos-includin-m-art
Explore at:
.csv, .jpeg, .mp4, .movAvailable download formats
Dataset updated
Sep 11, 2025
Dataset authored and provided by
M-ART
Area covered
Andorra, El Salvador, Bangladesh, Saint Helena, Estonia, Curaçao, Tunisia, Benin, Chad, Paraguay
Description
"Collection of 100,000 high-quality video clips across diverse real-world domains, designed to accelerate the training and optimization of computer vision and multimodal AI models."

Overview This dataset contains 100,000 proprietary and partner-produced video clips filmed in 4K/6K with cinema-grade RED cameras. Each clip is commercially cleared with full releases, structured metadata, and available in RAW or MOV/MP4 formats. The collection spans a wide variety of domains — people and lifestyle, healthcare and medical, food and cooking, office and business, sports and fitness, nature and landscapes, education, and more. This breadth ensures robust training data for computer vision, multimodal, and machine learning projects.

The data set All 100,000 videos have been reviewed for quality and compliance. The dataset is optimized for AI model training, supporting use cases from face and activity recognition to scene understanding and generative AI. Custom datasets can also be produced on demand, enabling clients to close data gaps with tailored, high-quality content.

About M-ART M-ART is a leading provider of cinematic-grade datasets for AI training. With extensive expertise in large-scale content production and curation, M-ART delivers both ready-to-use video datasets and fully customized collections. All data is proprietary, rights-cleared, and designed to help global AI leaders accelerate research, development, and deployment of next-generation models.
Global impact of AI and big-data analytics on jobs 2023-2027
statista.com
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Global impact of AI and big-data analytics on jobs 2023-2027 [Dataset]. https://www.statista.com/statistics/1383919/ai-bigdata-impact-jobs/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 2022 - Feb 2023
Area covered
Worldwide
Description
Between 2023 and 2027, the majority of companies surveyed worldwide expect big data to have a more positive than negative impact on the global job market and employment, with ** percent of the companies reporting the technology will create jobs and * percent expecting the technology to displace jobs. Meanwhile, artificial intelligence (AI) is expected to result in more significant labor market disruptions, with ** percent of organizations expecting the technology to displace jobs and ** percent expecting AI to create jobs.
AI corporate investment worldwide 2015-2022
statista.com
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). AI corporate investment worldwide 2015-2022 [Dataset]. https://www.statista.com/statistics/941137/ai-investment-and-funding-worldwide/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
In 2022, the global total corporate investment in artificial intelligence (AI) reached almost ** billion U.S. dollars, a slight decrease from the previous year. In 2018, the yearly investment in AI saw a slight downturn, but that was only temporary. Private investments account for a bulk of total AI corporate investment. AI investment has increased more than ******* since 2016, a staggering growth in any market. It is a testament to the importance of the development of AI around the world. What is Artificial Intelligence (AI)? Artificial intelligence, once the subject of people’s imaginations and the main plot of science fiction movies for decades, is no longer a piece of fiction, but rather commonplace in people’s daily lives whether they realize it or not. AI refers to the ability of a computer or machine to imitate the capacities of the human brain, which often learns from previous experiences to understand and respond to language, decisions, and problems. These AI capabilities, such as computer vision and conversational interfaces, have become embedded throughout various industries’ standard business processes. AI investment and startups The global AI market, valued at ***** billion U.S. dollars as of 2023, continues to grow driven by the influx of investments it receives. This is a rapidly growing market, looking to expand from billions to trillions of U.S. dollars in market size in the coming years. From 2020 to 2022, investment in startups globally, and in particular AI startups, increased by **** billion U.S. dollars, nearly double its previous investments, with much of it coming from private capital from U.S. companies. The most recent top-funded AI businesses are all machine learning and chatbot companies, focusing on human interface with machines.
e
Domestic Artificial Intelligence Delphi Survey, 2020-2021 - Dataset - B2FIND...
b2find.eudat.eu
Updated Oct 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Domestic Artificial Intelligence Delphi Survey, 2020-2021 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/87d909f7-8bbd-5a93-8319-6735139fa148
Explore at:
Dataset updated
Oct 23, 2023
Description
A main aim of the study was understand how experts predict the automation of unpaid domestic work. To do so, we conducted a Delphi survey with technology experts in the UK and Japan. The data set includes answers collected from a forecasting exercise in which 65 AI experts from the UK (29 respondents) and Japan (36 respondents) were asked to estimate how automatable 17 housework and care work tasks are in the next 5 to 10 years. The experts were also asked to estimate the cost of the automations. In addition, background information, such as the experts' gender, age and field of expertise, were collected. Based on the respondents answers, the Delphi survey shows that on average 27% of time that people currently spend on doing unpaid domestic work could be automated in the next 5 years, and 39% in the next 10 years.This project brings unpaid domestic work into the discussion of AI and the future of labour and predicts the degree of automation of unpaid work in two distinct countries – the UK and Japan. To do this we evaluate the technological likelihood of automatibility of domestic work tasks using a grid of 17 such tasks identified in the UK Time Use Survey 2014-15 and the Japanese Survey on Time Use and Leisure Activities 2016. We use a panel consisting of technology experts to assess how quickly AI-powered domestic technologies will become not only technologically possible but also affordable for households. For the Delphi survey we recruited 65 respondents who are technology experts. 29 respondents are based in the UK, 36 are based in Japan. We consider "technology experts" as people with expert knowledge in AI or AI related technologies, including machine learning, robotics, or the social and/or business aspects of AI related technologies. Our approach was to recruit a balanced number of female and male respondents, as well as a balanced number in three different professional fields: academia, research and development, and business. We recruited the respondents through our own network, through snowball sampling, and through desktop research. We contacted the experts via email and LinkedIn, sending them the invitation to participate in the Delphi survey. Respondents based in the UK were contacted by the UK team using English as the communication language, and respondents based in Japan were contacted by the Japanese team using Japanese as the correspondence language. While Japanese respondents received a small monetary compensation, as it is expected in Japan, the UK respondents did not receive any monetary compensation.
Success.ai | | US Premium B2B Emails & Phone Numbers Dataset - APIs and flat...
datarade.ai
Updated Oct 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Success.ai (2024). Success.ai | | US Premium B2B Emails & Phone Numbers Dataset - APIs and flat files available – 170M+, Verified Profiles - Best Price Guarantee [Dataset]. https://datarade.ai/data-products/success-ai-us-premium-b2b-emails-phone-numbers-dataset-success-ai
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Oct 12, 2024
Dataset provided by
Area covered
United States
Description
Success.ai offers a comprehensive, enterprise-ready B2B leads data solution, ideal for businesses seeking access to over 150 million verified employee profiles and 170 million work emails. Our data empowers organizations across industries to target key decision-makers, optimize recruitment, and fuel B2B marketing efforts. Whether you're looking for UK B2B data, B2B marketing data, or global B2B contact data, Success.ai provides the insights you need with pinpoint accuracy.

Tailored for B2B Sales, Marketing, Recruitment and more: Our B2B contact data and B2B email data solutions are designed to enhance your lead generation, sales, and recruitment efforts. Build hyper-targeted lists based on job title, industry, seniority, and geographic location. Whether you’re reaching mid-level professionals or C-suite executives, Success.ai delivers the data you need to connect with the right people.

API Features:

Real-Time Updates: Our APIs deliver real-time updates, ensuring that the contact data your business relies on is always current and accurate.

High Volume Handling: Designed to support up to 860k API calls per day, our system is built for scalability and responsiveness, catering to enterprises of all sizes.

Flexible Integration: Easily integrate with CRM systems, marketing automation tools, and other enterprise applications to streamline your workflows and enhance productivity.

Key Categories Served: B2B sales leads – Identify decision-makers in key industries, B2B marketing data – Target professionals for your marketing campaigns, Recruitment data – Source top talent efficiently and reduce hiring times, CRM enrichment – Update and enhance your CRM with verified, updated data, Global reach – Coverage across 195 countries, including the United States, United Kingdom, Germany, India, Singapore, and more.

Global Coverage with Real-Time Accuracy: Success.ai’s dataset spans a wide range of industries such as technology, finance, healthcare, and manufacturing. With continuous real-time updates, your team can rely on the most accurate data available: 150M+ Employee Profiles: Access professional profiles worldwide with insights including full name, job title, seniority, and industry. 170M Verified Work Emails: Reach decision-makers directly with verified work emails, available across industries and geographies, including Singapore and UK B2B data. GDPR-Compliant: Our data is fully compliant with GDPR and other global privacy regulations, ensuring safe and legal use of B2B marketing data.

Key Data Points for Every Employee Profile: Every profile in Success.ai’s database includes over 20 critical data points, providing the information needed to power B2B sales and marketing campaigns: Full Name, Job Title, Company, Work Email, Location, Phone Number, LinkedIn Profile, Experience, Education, Technographic Data, Languages, Certifications, Industry, Publications & Awards.

Use Cases Across Industries: Success.ai’s B2B data solution is incredibly versatile and can support various enterprise use cases, including: B2B Marketing Campaigns: Reach high-value professionals in industries such as technology, finance, and healthcare. Enterprise Sales Outreach: Build targeted B2B contact lists to improve sales efforts and increase conversions. Talent Acquisition: Accelerate hiring by sourcing top talent with accurate and updated employee data, filtered by job title, industry, and location. Market Research: Gain insights into employment trends and company profiles to enrich market research. CRM Data Enrichment: Ensure your CRM stays accurate by integrating updated B2B contact data. Event Targeting: Create lists for webinars, conferences, and product launches by targeting professionals in key industries.

Use Cases for Success.ai's Contact Data - Targeted B2B Marketing: Create precise campaigns by targeting key professionals in industries like tech and finance. - Sales Outreach: Build focused sales lists of decision-makers and C-suite executives for faster deal cycles. - Recruiting Top Talent: Easily find and hire qualified professionals with updated employee profiles. - CRM Enrichment: Keep your CRM current with verified, accurate employee data. - Event Targeting: Create attendee lists for events by targeting relevant professionals in key sectors. - Market Research: Gain insights into employment trends and company profiles for better business decisions. - Executive Search: Source senior executives and leaders for headhunting and recruitment. - Partnership Building: Find the right companies and key people to develop strategic partnerships.

Why Choose Success.ai’s Employee Data? Success.ai is the top choice for enterprises looking for comprehensive and affordable B2B data solutions. Here’s why: Unmatched Accuracy: Our AI-powered validation process ensures 99% accuracy across all data points, resulting in higher engagement and fewer bounces. Global Scale: With 150M+ employee profiles and 170M veri...
R
People Ai Dataset
universe.roboflow.com
zip
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Milestone01 (2025). People Ai Dataset [Dataset]. https://universe.roboflow.com/milestone01/people-ai
Explore at:
zipAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Milestone01
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
People Bounding Boxes
Description
People AI

## Overview People AI is a dataset for object detection tasks - it contains People annotations for 1,250 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
d
Customer Service Call Dataset [Multisector] – Annotated support transcripts...
datarade.ai
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WiserBrand.com (2025). Customer Service Call Dataset [Multisector] – Annotated support transcripts for training AI and improving CX [Dataset]. https://datarade.ai/data-products/customer-service-call-dataset-multisector-annotated-suppo-wiserbrand-com
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Apr 11, 2025
Dataset provided by
WiserBrand.com
Area covered
United States of America
Description
"This dataset contains transcribed customer support calls from companies in over 160 industries, offering a high-quality foundation for developing customer-aware AI systems and improving service operations. It captures how real people express concerns, frustrations, and requests — and how support teams respond.

Included in each record:

Full call transcription with labeled speakers (system, agent, customer)

Concise human-written summary of the conversation

Sentiment tag for the overall interaction: positive, neutral, or negative

Company name, duration, and geographic location of the caller

Call context includes industries such as eCommerce, banking, telecom, and streaming services

Common use cases:

Train NLP models to understand support calls and detect churn risk

Power complaint detection engines for customer success and support teams

Create high-quality LLM training sets with real support narratives

Build summarization and topic tagging pipelines for CX dashboards

Analyze tone shifts and resolution language in customer-agent interaction

This dataset is structured, high-signal, and ready for use in AI pipelines, CX design, and quality assurance systems. It brings full transparency to what actually happens during customer service moments — from routine fixes to emotional escalations."
Face Dataset Of People That Don't Exist
kaggle.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2023). Face Dataset Of People That Don't Exist [Dataset]. http://doi.org/10.34740/kaggle/dsv/6433550
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/6433550
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BwandoWando
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

All the images of faces here are generated using https://thispersondoesnotexist.com/

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F4c3d3569f4f9c12fc898d76390f68dab%2FBeFunky-collage.jpg?generation=1662079836729388&alt=media" alt="">

Copyrighting of AI Generated images

Under US copyright law, these images are technically not subject to copyright protection. Only "original works of authorship" are considered. "To qualify as a work of 'authorship' a work must be created by a human being," according to a US Copyright Office's report [PDF].

https://www.theregister.com/2022/08/14/ai_digital_artwork_copyright/

Tagging

I manually tagged all images as best as I could and separated them between the two classes below

Female- 3860 images

Male- 3013 images

Some may pass either female or male, but I will leave it to you to do the reviewing. I included toddlers and babies under Male/ Female

How it works

Each of the faces are totally fake, created using an algorithm called Generative Adversarial Networks (GANs).

A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in a game (in the form of a zero-sum game, where one agent's gain is another agent's loss).

Given a training set, this technique learns to generate new data with the same statistics as the training set. For example, a GAN trained on photographs can generate new photographs that look at least superficially authentic to human observers, having many realistic characteristics. Though originally proposed as a form of generative model for unsupervised learning, GANs have also proved useful for semi-supervised learning, fully supervised learning,and reinforcement learning.

https://www.youtube.com/watch?v=u8qPvzk0AfY

https://www.youtube.com/watch?v=dCKbRCUyop8

https://www.youtube.com/watch?v=SWoravHhsUU

Github implementation of website

https://github.com/NVlabs/stylegan2

https://github.com/lucidrains/stylegan2-pytorch

https://github.com/lucidrains/lightweight-gan

How I gathered the images

Just a simple Jupyter notebook that looped and invoked the website https://thispersondoesnotexist.com/ , saving all images locally
People Data Labs - Person Dataset
datarade.ai
.json, .csv
Updated Jul 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
People Data Labs (2020). People Data Labs - Person Dataset [Dataset]. https://datarade.ai/data-products/global-license
Explore at:
.json, .csvAvailable download formats
Dataset updated
Jul 6, 2020
Dataset provided by
People Data Labs Inc.
Authors
People Data Labs
Area covered
United States of America, Anguilla, Maldives, Tunisia, Wallis and Futuna, Bosnia and Herzegovina, Guinea, Afghanistan, Antarctica, Kenya
Description
People Data Labs is an aggregator of B2B person and company data. We source our globally compliant person dataset via our "Data Union".

The "Data Union" is our proprietary data sharing co-op. Customers opt-in to sharing their data and warrant that their data is fully compliant with global data privacy regulations. Some data sources are provided as a one time dump, others are refreshed every time we do a new data build. Our data sources come from a variety of verticals including HR Tech, Real Estate Tech, Identity/Anti-Fraud, Martech, and others. People Data Labs works with customers on compliance based topics. If a customer wishes to ensure anonymity, we work with them to anonymize the data.

Our person data has over 100 fields including resume data (work history, education), contact information (email, phone), demographic info (name, gender, birth date) and social profile information (linkedin, github, twitter, facebook, etc...).
m
Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...
data.mendeley.com
Updated Jul 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak [Dataset]. http://doi.org/10.17632/xmcg82mx9k.3
Explore at:
Unique identifier
https://doi.org/10.17632/xmcg82mx9k.3
Dataset updated
Jul 25, 2022
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset: N. Thakur, “MonkeyPox2022Tweets: The first public Twitter dataset on the 2022 MonkeyPox outbreak,” Preprints, 2022, DOI: 10.20944/preprints202206.0172.v2

Abstract The world is currently facing an outbreak of the monkeypox virus, and confirmed cases have been reported from 28 countries. Following a recent “emergency meeting”, the World Health Organization just declared monkeypox a global health emergency. As a result, people from all over the world are using social media platforms, such as Twitter, for information seeking and sharing related to the outbreak, as well as for familiarizing themselves with the guidelines and protocols that are being recommended by various policy-making bodies to reduce the spread of the virus. This is resulting in the generation of tremendous amounts of Big Data related to such paradigms of social media behavior. Mining this Big Data and compiling it in the form of a dataset can serve a wide range of use-cases and applications such as analysis of public opinions, interests, views, perspectives, attitudes, and sentiment towards this outbreak. Therefore, this work presents MonkeyPox2022Tweets, an open-access dataset of Tweets related to the 2022 monkeypox outbreak that were posted on Twitter since the first detected case of this outbreak on May 7, 2022. The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter, as well as with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

Data Description The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 to 23rd July 2022 (the most recent date at the time of dataset upload). The Tweet IDs are presented in 6 different .txt files based on the timelines of the associated tweets. The following provides the details of these dataset files. • Filename: TweetIDs_Part1.txt (No. of Tweet IDs: 13926, Date Range of the Tweet IDs: May 7, 2022 to May 21, 2022) • Filename: TweetIDs_Part2.txt (No. of Tweet IDs: 17705, Date Range of the Tweet IDs: May 21, 2022 to May 27, 2022) • Filename: TweetIDs_Part3.txt (No. of Tweet IDs: 17585, Date Range of the Tweet IDs: May 27, 2022 to June 5, 2022) • Filename: TweetIDs_Part4.txt (No. of Tweet IDs: 19718, Date Range of the Tweet IDs: June 5, 2022 to June 11, 2022) • Filename: TweetIDs_Part5.txt (No. of Tweet IDs: 47718, Date Range of the Tweet IDs: June 12, 2022 to June 30, 2022) • Filename: TweetIDs_Part6.txt (No. of Tweet IDs: 138711, Date Range of the Tweet IDs: July 1, 2022 to July 23, 2022)

The dataset contains only Tweet IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used.
Data from: Using Artificial Intelligence (AI)? Risk and Opportunity...
figshare.com
txt
Updated Sep 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebekka Schwesig; Nadia Said (2023). Using Artificial Intelligence (AI)? Risk and Opportunity Perception of AI Predict People’s Willingness to Use AI [Dataset]. http://doi.org/10.6084/m9.figshare.20589597.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20589597.v5
Dataset updated
Sep 18, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Rebekka Schwesig; Nadia Said
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset study1 and study2 / R code study1 and study 2
MultiSocial
zenodo.org
data.niaid.nih.gov
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal (2025). MultiSocial [Dataset]. http://doi.org/10.5281/zenodo.13846152
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13846152
Dataset updated
Aug 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Disclaimer

Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.

Data Source

The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:

Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.

Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).

Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.

Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).

WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.

From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.

The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).

The dataset has the following fields:

'text' - a text sample,

'label' - 0 for human-written text, 1 for machine-generated text,

'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,

'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,

'language' - the ISO 639-1 language code identifying the detected language of the given text,

'length' - word count of the given text,

'source' - a string identifying the source dataset / platform of the given text,

'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.

ToDo Statistics (under construction)
e
Dataset for: Do managers accept artificial intelligence? Insights into the...
b2find.eudat.eu
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataset for: Do managers accept artificial intelligence? Insights into the role of business area and AI functionality - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/616df94a-5444-5d31-95a4-05b3516f798f
Explore at:
Dataset updated
Feb 20, 2025
Description
More and more companies use artificial intelligence (AI). Research aimed to understand acceptance from the perspective of AI users or people affected by AI decisions. However, the perspective of decision-makers in companies (i.e., managers) has not been considered. To address this gap, we investigate managers’ acceptance of AI usage in companies, focusing on two potential determinants. Across four experimental studies (Ntotal = 2025), we tested whether the business area (i.e., human resources vs. finances/ marketing) and AI functionality affect managers’ acceptance of AI (i.e., perceived risk of negative consequences, willingness to invest). Findings indicate that managers (a) perceive more risk of and (b) are less willing to invest in AI usage in human resources than in finances and marketing. Besides, the results suggest that acceptance declines if functionality crosses a critical boundary and AI autonomously implements decisions without prior human control. Accordingly, the current research sheds light on the AI acceptance of managers and gives insights into the role of the business area and AI functionality.
11,130 People – Real Surveillance Re-ID Dataset (Indoor & Outdoor)
nexdata.ai
m.nexdata.ai
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). 11,130 People – Real Surveillance Re-ID Dataset (Indoor & Outdoor) [Dataset]. https://www.nexdata.ai/datasets/computervision/1160
Explore at:
Dataset updated
Sep 10, 2025
Dataset authored and provided by
Nexdata
Variables measured
Device, Data size, Data format, Data diversity, Annotation content, Quality Requirements, Collecting environment, Population distribution
Description
The data includes indoor scenes and outdoor scenes. The data includes males and females, and the age distribution is from children to the elderly. The data diversity includes different age groups, different time periods, different shooting angles, different human body orientations and postures, seasonal clothing. For annotation, the rectangular bounding boxes and 15 attributes of human body were annotated. This data can be used for re-ID, person tracking, AI/computer vision research in realistic surveillance environments and other tasks.
d
25M+ Images | AI Training Data | Annotated imagery data for AI | Object &...
datarade.ai
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Seeds, 25M+ Images | AI Training Data | Annotated imagery data for AI | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/15m-images-ai-training-data-annotated-imagery-data-for-a-data-seeds
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Data Seeds
Area covered
Yemen, Barbados, French Polynesia, Iceland, Saint Lucia, United Arab Emirates, Liberia, Morocco, Virgin Islands (U.S.), Nepal
Description
This dataset features over 25,000,000 high-quality general-purpose images sourced from photographers worldwide. Designed to support a wide range of AI and machine learning applications, it offers a richly diverse and extensively annotated collection of everyday visual content.

Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data, detailing camera settings such as aperture, ISO, shutter speed, and focal length. Additionally, each image is pre-annotated with object and scene detection metadata, making it ideal for tasks like classification, detection, and segmentation. Popularity metrics, derived from engagement on our proprietary platform, are also included.

2.Unique Sourcing Capabilities: the images are collected through a proprietary gamified platform for photographers. Competitions spanning various themes ensure a steady influx of diverse, high-quality submissions. Custom datasets can be sourced on-demand within 72 hours, allowing for specific requirements—such as themes, subjects, or scenarios—to be met efficiently.

Global Diversity: photographs have been sourced from contributors in over 100 countries, covering a wide range of human experiences, cultures, environments, and activities. The dataset includes images of people, nature, objects, animals, urban and rural life, and more—captured across different times of day, seasons, and lighting conditions.

High-Quality Imagery: the dataset includes images with resolutions ranging from standard to high-definition to meet the needs of various projects. Both professional and amateur photography styles are represented, offering a balance of realism and creativity across visual domains.

Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. This unique metric reflects how well the image resonates with a global audience, offering an additional layer of insight for AI models focused on aesthetics, engagement, or content curation.

AI-Ready Design: this dataset is optimized for AI applications, making it ideal for training models in general image recognition, multi-label classification, content filtering, and scene understanding. It integrates easily with leading machine learning frameworks and pipelines.

Licensing & Compliance: the dataset complies fully with data privacy regulations and offers transparent licensing for both commercial and academic use.

Use Cases: 1. Training AI models for general-purpose image classification and tagging. 2. Enhancing content moderation and visual search systems. 3. Building foundational datasets for large-scale vision-language models. 4. Supporting research in computer vision, multimodal AI, and generative modeling.

This dataset offers a comprehensive, diverse, and high-quality resource for training AI and ML models across a wide array of domains. Customizations are available to suit specific project needs. Contact us to learn more!
F
Canadian French Call Center Data for Healthcare AI
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Canadian French Call Center Data for Healthcare AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/healthcare-call-center-conversation-french-canada
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Canada, French
Dataset funded by
FutureBeeAI
Description
Introduction
This Canadian French Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of French speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.
Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.
Speech Data
The dataset features 30 Hours of dual-channel call center conversations between native Canadian French speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.
•Participant Diversity:
•
Speakers: 60 verified native Canadian French speakers from our contributor community.

•
Regions: Diverse provinces across Canada to ensure broad dialectal representation.

•
Participant Profile: Age range of 18–70 with a gender mix of 60% male and 40% female.

•RecordingDetails:
•
Conversation Nature: Naturally flowing, unscripted conversations.

•
Call Duration: Each session ranges between 5 to 15 minutes.

•
Audio Format: WAV format, stereo, 16-bit depth at 8kHz and 16kHz sample rates.

•
Recording Environment: Captured in clear conditions without background noise or echo.

Topic Diversity
The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).
•Inbound Calls:
•Appointment Scheduling
•New Patient Registration
•Surgical Consultation
•Dietary Advice and Consultations
•Insurance Coverage Inquiries
•Follow-up Treatment Requests, and more
•OutboundCalls:
•Appointment Reminders
•Preventive Care Campaigns
•Test Results & Lab Reports
•Health Risk Assessment Calls
•Vaccination Updates
•Wellness Subscription Outreach, and more
These real-world interactions help build speech models that understand healthcare domain nuances and user intent.
Transcription
Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.
•Transcription Includes:
•Speaker-identified Dialogues
•Time-coded Segments
•Non-speech Annotations (e.g., silence, cough)
•High transcription accuracy with word error rate is below 5%, backed by dual-layer QA checks.
Metadata
Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.
•
Participant Metadata: ID, gender, age, region, accent, and dialect.

•
Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

Usage and Applications
This dataset can be used across a range of healthcare and voice AI use cases:
•
<b

Facebook

Twitter

Click to copy link

Link copied

Cite

Office for National Statistics (2023). Artificial Intelligence (AI) awareness, use and impact, Great Britain [Dataset]. https://www.ons.gov.uk/businessindustryandtrade/itandinternetindustry/datasets/artificialintelligenceaiawarenessuseandimpactgreatbritain

Artificial Intelligence (AI) awareness, use and impact, Great Britain

Explore at:

xlsxAvailable download formats

Dataset updated

Jun 16, 2023

Dataset provided by

Office for National Statisticshttp://www.ons.gov.uk/

License

Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically

Area covered

United Kingdom

Description

Data from the Opinion and Lifestyle Survey (OPN) on the use of Artificial Intelligence (AI) and how people feel about its uptake in today’s society.

Clear search

Close search

Google apps

Main menu

Artificial Intelligence (AI) awareness, use and impact, Great Britain

Data from: TWIGMA: A dataset of AI-Generated Images with Metadata From...

AI Training Dataset In Healthcare Market Report

Polish Open Ended Question Answer Text Dataset

Dataset Content:

Question Diversity:

Answer Formats:

Data Format and Annotation Details:

Quality and Accuracy:

Continuous Updates and Customization:

License:

M-ART | Video Data | Global | 100,000 Stock videos | Including metadata and...

Global impact of AI and big-data analytics on jobs 2023-2027

AI corporate investment worldwide 2015-2022

Domestic Artificial Intelligence Delphi Survey, 2020-2021 - Dataset - B2FIND...

Success.ai | | US Premium B2B Emails & Phone Numbers Dataset - APIs and flat...

People Ai Dataset

People AI

Customer Service Call Dataset [Multisector] – Annotated support transcripts...

Face Dataset Of People That Don't Exist

Context

Copyrighting of AI Generated images

Tagging

How it works

Github implementation of website

How I gathered the images

People Data Labs - Person Dataset

Data from: MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022...

Data from: Using Artificial Intelligence (AI)? Risk and Opportunity...

MultiSocial

Disclaimer

Data Source

Dataset for: Do managers accept artificial intelligence? Insights into the...

11,130 People – Real Surveillance Re-ID Dataset (Indoor & Outdoor)

25M+ Images | AI Training Data | Annotated imagery data for AI | Object &...

Canadian French Call Center Data for Healthcare AI

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Artificial Intelligence (AI) awareness, use and impact, Great Britain