82 datasets found

e
openai.com Traffic Analytics Data
analytics.explodingtopics.com
Updated May 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). openai.com Traffic Analytics Data [Dataset]. https://analytics.explodingtopics.com/website/openai.com
Explore at:
Dataset updated
May 1, 2025
Variables measured
Global Rank, Monthly Visits, Authority Score, US Country Rank
Description
Traffic analytics, rankings, and competitive metrics for openai.com as of May 2025
openai_humaneval
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for OpenAI HumanEval

Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

Supported Tasks and Leaderboards Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
Data set for Research transformation: change in the era of open, AI and...
digitalscience.figshare.com
csv
Updated Oct 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Digital Science (2024). Data set for Research transformation: change in the era of open, AI and impact [Dataset]. http://doi.org/10.6084/m9.figshare.27210903.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27210903.v1
Dataset updated
Oct 28, 2024
Dataset authored and provided by
Digital Sciencehttp://digital-science.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw data and questionnaire - Research Transformation: change in the era of AI, open and impactTo explore how the research world is transforming, Digital Science released Research Transformation: change in the era of AI, open and impact to uncover how technology, open access, funding policies and other drivers of change are impacting the academic community. This is anonymised raw data and questionnaire for the research transformation survey developed by Digital Science between 29 May and 12 July 2024. The survey . The survey looked to understand how the research world is transforming, what’s influencing change and how roles are impacted. A total of 380 respondents from 70 countries in Europe, Middle East, and Africa (EMEA), Asia–Pacific (APAC) and the Americas participated in the survey. They held roles within the academic library, research office, faculty and leadership.
f
Data from: How generative AI models such as ChatGPT can be (mis)used in SPC...
tandf.figshare.com
html
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fadel M. Megahed; Ying-Ju Chen; Joshua A. Ferris; Sven Knoth; L. Allison Jones-Farmer (2024). How generative AI models such as ChatGPT can be (mis)used in SPC practice, education, and research? An exploratory study [Dataset]. http://doi.org/10.6084/m9.figshare.23532743.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23532743.v1
Dataset updated
Mar 6, 2024
Dataset provided by
Taylor & Francis
Authors
Fadel M. Megahed; Ying-Ju Chen; Joshua A. Ferris; Sven Knoth; L. Allison Jones-Farmer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Generative Artificial Intelligence (AI) models such as OpenAI’s ChatGPT have the potential to revolutionize Statistical Process Control (SPC) practice, learning, and research. However, these tools are in the early stages of development and can be easily misused or misunderstood. In this paper, we give an overview of the development of Generative AI. Specifically, we explore ChatGPT’s ability to provide code, explain basic concepts, and create knowledge related to SPC practice, learning, and research. By investigating responses to structured prompts, we highlight the benefits and limitations of the results. Our study indicates that the current version of ChatGPT performs well for structured tasks, such as translating code from one language to another and explaining well-known concepts but struggles with more nuanced tasks, such as explaining less widely known terms and creating code from scratch. We find that using new AI tools may help practitioners, educators, and researchers to be more efficient and productive. However, in their current stages of development, some results are misleading and wrong. Overall, the use of generative AI models in SPC must be properly validated and used in conjunction with other methods to ensure accurate results.
Sentiment Analysis Dataset for OpenAI Finetuning
kaggle.com
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harry Wang (2025). Sentiment Analysis Dataset for OpenAI Finetuning [Dataset]. https://www.kaggle.com/datasets/harrywang/sentiment-analysis-dataset-for-openai-finetuning/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Harry Wang
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This repo is prepared for my lecture on fine-tuning OpenAI models.

You can simply upload the train_chat_200.jsonl and val_chat_15.jsonl to OpenAI finetuning dashboard to finetune a model and see model behavior change.

This dataset is created based on the following dataset: Multi-Aspect Multi-Sentiment (MAMS) dataset for Aspect-Based Sentiment Analysis https://github.com/siat-nlp/MAMS-for-ABSA

The source code for processing the dataset is at https://github.com/harrywang/openai-finetuning.
h
openai-gsm8k
huggingface.co
Updated Aug 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Post-training-Data-Flywheel (2024). openai-gsm8k [Dataset]. https://huggingface.co/datasets/Post-training-Data-Flywheel/openai-gsm8k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 24, 2024
Dataset authored and provided by
Post-training-Data-Flywheel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Post-training-Data-Flywheel/openai-gsm8k dataset hosted on Hugging Face and contributed by the HF Datasets community
AIMO-24: Model (openai-community/gpt2-large)
kaggle.com
zip
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dinh Thoai Tran @ randrise.com (2024). AIMO-24: Model (openai-community/gpt2-large) [Dataset]. https://www.kaggle.com/datasets/dinhttrandrise/aimo-24-model-openai-community-gpt2-large/data
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 7, 2024
Authors
Dinh Thoai Tran @ randrise.com
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
language: en

license: mit

GPT-2 Large

Table of Contents

Model Details

How To Get Started With the Model

Uses

Risks, Limitations and Biases

Training

Evaluation

Environmental Impact

Technical Specifications

Citation Information

Model Card Authors

Model Details

Model Description: GPT-2 Large is the 774M parameter version of GPT-2, a transformer-based language model created and released by OpenAI. The model is a pretrained model on English language using a causal language modeling (CLM) objective.

Developed by: OpenAI, see associated research paper and GitHub repo for model developers.

Model Type: Transformer-based language model

Language(s): English

License: Modified MIT License

Related Models: GPT-2, GPT-Medium and GPT-XL

Resources for more information:

Research Paper

OpenAI Blog Post

GitHub Repo

OpenAI Model Card for GPT-2

Test the full generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large

How to Get Started with the Model

Use the code below to get started with the model. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed >>> generator = pipeline('text-generation', model='gpt2-large') >>> set_seed(42) >>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5) [{'generated_text': "Hello, I'm a language model, I can do language modeling. In fact, this is one of the reasons I use languages. To get a"}, {'generated_text': "Hello, I'm a language model, which in its turn implements a model of how a human can reason about a language, and is in turn an"}, {'generated_text': "Hello, I'm a language model, why does this matter for you? When I hear new languages, I tend to start thinking in terms"}, {'generated_text': "Hello, I'm a language model, a functional language... I don't need to know anything else. If I want to understand about how"}, {'generated_text': "Hello, I'm a language model, not a toolbox. In a nutshell, a language model is a set of attributes that define how"}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = GPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)

and in TensorFlow:

from transformers import GPT2Tokenizer, TFGPT2Model tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large') model = TFGPT2Model.from_pretrained('gpt2-large') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)

Uses

Direct Use

In their model card about GPT-2, OpenAI wrote:

The primary intended users of these models are AI researchers and practitioners.

We primarily imagine these language models will be used by researchers to better understand the behaviors, capabilities, biases, and constraints of large-scale generative language models.

Downstream Use

In their model card about GPT-2, OpenAI wrote:

Here are some secondary use cases we believe are likely:

Writing assistance: Grammar assistance, autocompletion (for normal prose or code)

Creative writing and art: exploring the generation of creative, fictional texts; aiding creation of poetry and other literary art.

Entertainment: Creation of games, chat bots, and amusing generations.

Misuse and Out-of-scope Use

In their model card about GPT-2, OpenAI wrote:

Because large-scale language models like GPT-2 ...
openai-news
huggingface.co
Updated Jul 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jina AI (2025). openai-news [Dataset]. https://huggingface.co/datasets/jinaai/openai-news
Explore at:
Dataset updated
Jul 20, 2025
Dataset authored and provided by
Jina AI
Description
Dataset Card for "openai-news" Dataset

This dataset was created from blog posts and news articles about OpenAI from their website. Queries are handcrafted.

Disclaimer

This dataset may contain publicly available images or text data. All data is provided for research and educational purposes only. If you are the rights holder of any content and have concerns regarding intellectual property or copyright, please contact us at "support-data (at) jina.ai" for removal. We do not… See the full description on the dataset page: https://huggingface.co/datasets/jinaai/openai-news.
c
AI Training Data Market will grow at a CAGR of 23.50% from 2024 to 2031.
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2025). AI Training Data Market will grow at a CAGR of 23.50% from 2024 to 2031. [Dataset]. https://www.cognitivemarketresearch.com/ai-training-data-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Jun 15, 2025
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global Ai Training Data market size is USD 1865.2 million in 2023 and will expand at a compound annual growth rate (CAGR) of 23.50% from 2023 to 2030.

The demand for Ai Training Data is rising due to the rising demand for labelled data and diversification of AI applications. Demand for Image/Video remains higher in the Ai Training Data market. The Healthcare category held the highest Ai Training Data market revenue share in 2023. North American Ai Training Data will continue to lead, whereas the Asia-Pacific Ai Training Data market will experience the most substantial growth until 2030.

Market Dynamics of AI Training Data Market

Key Drivers of AI Training Data Market

Rising Demand for Industry-Specific Datasets to Provide Viable Market Output

A key driver in the AI Training Data market is the escalating demand for industry-specific datasets. As businesses across sectors increasingly adopt AI applications, the need for highly specialized and domain-specific training data becomes critical. Industries such as healthcare, finance, and automotive require datasets that reflect the nuances and complexities unique to their domains. This demand fuels the growth of providers offering curated datasets tailored to specific industries, ensuring that AI models are trained with relevant and representative data, leading to enhanced performance and accuracy in diverse applications.

In July 2021, Amazon and Hugging Face, a provider of open-source natural language processing (NLP) technologies, have collaborated. The objective of this partnership was to accelerate the deployment of sophisticated NLP capabilities while making it easier for businesses to use cutting-edge machine-learning models. Following this partnership, Hugging Face will suggest Amazon Web Services as a cloud service provider for its clients.

(Source: about:blank)

Advancements in Data Labelling Technologies to Propel Market Growth

The continuous advancements in data labelling technologies serve as another significant driver for the AI Training Data market. Efficient and accurate labelling is essential for training robust AI models. Innovations in automated and semi-automated labelling tools, leveraging techniques like computer vision and natural language processing, streamline the data annotation process. These technologies not only improve the speed and scalability of dataset preparation but also contribute to the overall quality and consistency of labelled data. The adoption of advanced labelling solutions addresses industry challenges related to data annotation, driving the market forward amidst the increasing demand for high-quality training data.

In June 2021, Scale AI and MIT Media Lab, a Massachusetts Institute of Technology research centre, began working together. To help doctors treat patients more effectively, this cooperation attempted to utilize ML in healthcare.

www.ncbi.nlm.nih.gov/pmc/articles/PMC7325854/

Restraint Factors Of AI Training Data Market

Data Privacy and Security Concerns to Restrict Market Growth

A significant restraint in the AI Training Data market is the growing concern over data privacy and security. As the demand for diverse and expansive datasets rises, so does the need for sensitive information. However, the collection and utilization of personal or proprietary data raise ethical and privacy issues. Companies and data providers face challenges in ensuring compliance with regulations and safeguarding against unauthorized access or misuse of sensitive information. Addressing these concerns becomes imperative to gain user trust and navigate the evolving landscape of data protection laws, which, in turn, poses a restraint on the smooth progression of the AI Training Data market.

How did COVID–19 impact the Ai Training Data market?

The COVID-19 pandemic has had a multifaceted impact on the AI Training Data market. While the demand for AI solutions has accelerated across industries, the availability and collection of training data faced challenges. The pandemic disrupted traditional data collection methods, leading to a slowdown in the generation of labeled datasets due to restrictions on physical operations. Simultaneously, the surge in remote work and the increased reliance on AI-driven technologies for various applications fueled the need for diverse and relevant training data. This duali...
MMMLU
huggingface.co
Updated Sep 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2024). MMMLU [Dataset]. https://huggingface.co/datasets/openai/MMMLU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2024
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Multilingual Massive Multitask Language Understanding (MMMLU)

The MMLU is a widely recognized benchmark of general knowledge attained by AI models. It covers a broad range of topics from 57 different categories, covering elementary-level knowledge up to advanced professional subjects like law, physics, history, and computer science. We translated the MMLU’s test set into 14 languages using professional human translators. Relying on human translators for this evaluation increases… See the full description on the dataset page: https://huggingface.co/datasets/openai/MMMLU.
f
Data Sheet 1_Emotional prompting amplifies disinformation generation in AI...
figshare.com
docx
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rasita Vinay; Giovanni Spitale; Nikola Biller-Andorno; Federico Germani (2025). Data Sheet 1_Emotional prompting amplifies disinformation generation in AI large language models.docx [Dataset]. http://doi.org/10.3389/frai.2025.1543603.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1543603.s001
Dataset updated
Apr 7, 2025
Dataset provided by
Frontiers
Authors
Rasita Vinay; Giovanni Spitale; Nikola Biller-Andorno; Federico Germani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThe emergence of artificial intelligence (AI) large language models (LLMs), which can produce text that closely resembles human-written content, presents both opportunities and risks. While these developments offer significant opportunities for improving communication, such as in health-related crisis communication, they also pose substantial risks by facilitating the creation of convincing fake news and disinformation. The widespread dissemination of AI-generated disinformation adds complexity to the existing challenges of the ongoing infodemic, significantly affecting public health and the stability of democratic institutions.RationalePrompt engineering is a technique that involves the creation of specific queries given to LLMs. It has emerged as a strategy to guide LLMs in generating the desired outputs. Recent research shows that the output of LLMs depends on emotional framing within prompts, suggesting that incorporating emotional cues into prompts could influence their response behavior. In this study, we investigated how the politeness or impoliteness of prompts affects the frequency of disinformation generation by various LLMs.ResultsWe generated and evaluated a corpus of 19,800 social media posts on public health topics to assess the disinformation generation capabilities of OpenAI’s LLMs, including davinci-002, davinci-003, gpt-3.5-turbo, and gpt-4. Our findings revealed that all LLMs efficiently generated disinformation (davinci-002, 67%; davinci-003, 86%; gpt-3.5-turbo, 77%; and gpt-4, 99%). Introducing polite language to prompt requests yielded significantly higher success rates for disinformation (davinci-002, 79%; davinci-003, 90%; gpt-3.5-turbo, 94%; and gpt-4, 100%). Impolite prompting resulted in a significant decrease in disinformation production across all models (davinci-002, 59%; davinci-003, 44%; and gpt-3.5-turbo, 28%) and a slight reduction for gpt-4 (94%).ConclusionOur study reveals that all tested LLMs effectively generate disinformation. Notably, emotional prompting had a significant impact on disinformation production rates, with models showing higher success rates when prompted with polite language compared to neutral or impolite requests. Our investigation highlights that LLMs can be exploited to create disinformation and emphasizes the critical need for ethics-by-design approaches in developing AI technologies. We maintain that identifying ways to mitigate the exploitation of LLMs through emotional prompting is crucial to prevent their misuse for purposes detrimental to public health and society.
4
Supplementary data for the paper 'System 2 thinking in OpenAI’s o1-preview...
data.4tu.nl
zip
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joost de Winter; Dimitra Dodou; Yke Bauke Eisma (2024). Supplementary data for the paper 'System 2 thinking in OpenAI’s o1-preview model: Near-perfect performance on a mathematics exam' [Dataset]. http://doi.org/10.4121/2e663686-f656-4ff2-bb21-567ba4d4f03e.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/2e663686-f656-4ff2-bb21-567ba4d4f03e.v2
Dataset updated
Oct 18, 2024
Dataset provided by
4TU.ResearchData
Authors
Joost de Winter; Dimitra Dodou; Yke Bauke Eisma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the o1 model series, designed to handle System 2-like reasoning. While OpenAI’s benchmarks are promising, independent validation is still needed. In this study, we tested the o1-preview model twice on the Dutch ‘Mathematics B’ final exam. It scored a near-perfect 76 and 74 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the GPT-4o model scored 66 and 62 out of 76, well above the Dutch average of 40.63 points. Neither model had access to the exam figures. Since there was a risk of model contamination (i.e., the knowledge cutoff of o1-preview and GPT-4o was after the exam was published online), we repeated the procedure with a new Mathematics B exam that was published after the cutoff date. The results again indicated that o1-preview performed strongly (97.8th percentile), which suggests that contamination was not a factor. We also show that there is some variability in the output of o1-preview, which means that sometimes there is ‘luck’ (the answer is correct) or ‘bad luck’ (the output has diverged into something that is incorrect). We demonstrate that a self-consistency approach, where repeated prompts are given and the most common answer is selected, is a useful strategy for identifying the correct answer. It is concluded that while OpenAI’s new model series holds great potential, certain risks must be considered.
gsm8k
huggingface.co
Updated Aug 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset authored and provided by
OpenAIhttps://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for GSM8K

Dataset Summary

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
S
Synthetic Data Generation Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Synthetic Data Generation Report [Dataset]. https://www.datainsightsmarket.com/reports/synthetic-data-generation-1124388
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Jun 16, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The synthetic data generation market is experiencing explosive growth, driven by the increasing need for high-quality data in various applications, including AI/ML model training, data privacy compliance, and software testing. The market, currently estimated at $2 billion in 2025, is projected to experience a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated $10 billion by 2033. This significant expansion is fueled by several key factors. Firstly, the rising adoption of artificial intelligence and machine learning across industries demands large, high-quality datasets, often unavailable due to privacy concerns or data scarcity. Synthetic data provides a solution by generating realistic, privacy-preserving datasets that mirror real-world data without compromising sensitive information. Secondly, stringent data privacy regulations like GDPR and CCPA are compelling organizations to explore alternative data solutions, making synthetic data a crucial tool for compliance. Finally, the advancements in generative AI models and algorithms are improving the quality and realism of synthetic data, expanding its applicability in various domains. Major players like Microsoft, Google, and AWS are actively investing in this space, driving further market expansion. The market segmentation reveals a diverse landscape with numerous specialized solutions. While large technology firms dominate the broader market, smaller, more agile companies are making significant inroads with specialized offerings focused on specific industry needs or data types. The geographical distribution is expected to be skewed towards North America and Europe initially, given the high concentration of technology companies and early adoption of advanced data technologies. However, growing awareness and increasing data needs in other regions are expected to drive substantial market growth in Asia-Pacific and other emerging markets in the coming years. The competitive landscape is characterized by a mix of established players and innovative startups, leading to continuous innovation and expansion of market applications. This dynamic environment indicates sustained growth in the foreseeable future, driven by an increasing recognition of synthetic data's potential to address critical data challenges across industries.
h
openai-synthetic-data-skill-twin-classification
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Swetha Ravi Shankar, openai-synthetic-data-skill-twin-classification [Dataset]. https://huggingface.co/datasets/Swetha92/openai-synthetic-data-skill-twin-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Swetha Ravi Shankar
Description
Swetha92/openai-synthetic-data-skill-twin-classification dataset hosted on Hugging Face and contributed by the HF Datasets community
A
‘Getting Started with the Open Data Portal’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Getting Started with the Open Data Portal’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-getting-started-with-the-open-data-portal-009a/f2bcc339/?iid=000-384&v=presentation
Explore at:
Dataset updated
Nov 12, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Getting Started with the Open Data Portal’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/9ea219aa-135c-426a-a12b-8415440cbdd2 on 12 November 2021.

--- Dataset description provided by original source is as follows ---

Documentation for working with the open data catalog, datasets, visualizations, and filters.

--- Original source retains full ownership of the source dataset ---
Z
Geoparsing with Large Language Models: Leveraging the linguistic...
data.niaid.nih.gov
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous, Anonymous (2024). Geoparsing with Large Language Models: Leveraging the linguistic capabilities of generative AI to improve geographic information extraction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13862654
Explore at:
Dataset updated
Oct 2, 2024
Dataset authored and provided by
Anonymous, Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Geoparsing with Large Language Models

The .zip file included in this repository contains all the code and data required to reproduce the results from our paper. Note, however, that in order to run the OpenAI models, users will required an OpenAI API key and sufficient API credits.

Data

The data used for the paper are in the datasetst and results folders.

**Datasets: **This contains the XML files (LGL and Geovirus) and Json files (News2024) used to benchmark the models. It also contains all the data used to fine-tune the gpt-3.5 model, the prompt templates sent to the LLMs, and other data used for mapping and data creation.

**Results: **This contains the results for the models on the three datastes. The folder is separated by dataset, with a single .csv file giving the results for each model on each dataset separately. The .csv file is structured so that each row contains either a predicted toponym and an associated true toponym (along with assigned spatial coordinates), if the model correctly identified a toponym; otherwise the true toponym columns are empty for false positives and the predicted columns are empty for false negatives.

Code

The code is split into two seperate folders gpt_geoparser and notebooks.

**GPT_Geoparser: **this contains the classes and methods used process the XML and JSON articles (data.py), interact with the Nominatim API for geocoding (gazetteer.py), interact with the OpenAI API (gpt_handler.py), process the outputs from the GPT models (geoparser.py) and analyse the results (analysis.py).

Notebooks: This series of notebooks can be used to reproduce the results given in the paper. The file names a reasonably descriptive of what they do within the context of the paper.

Code/software

Requirements

Numpy

Pandas

Geopy

Scitkit-learn

lxml

openai

matplotlib

Contextily

Shapely

Geopandas

tqdm

huggingface_hub

Gnews

Access information

Other publicly accessible locations of the data:

The LGL and GeoVirus datasets can also be obtained here (opens in new window).

Abstract

Geoparsing- the process of associating textual data with geographic locations - is a key challenge in natural language processing. The often ambiguous and complex nature of geospatial language make geoparsing a difficult task, requiring sophisticated language modelling techniques. Recent developments in Large Language Models (LLMs) have demonstrated their impressive capability in natural language modelling, suggesting suitability to a wide range of complex linguistic tasks. In this paper, we evaluate the performance of four LLMs - GPT-3.5, GPT-4o, Llama-3.1-8b and Gemma-2-9b - in geographic information extraction by testing them on three geoparsing benchmark datasets: GeoVirus, LGL, and a novel dataset, News2024, composed of geotagged news articles published outside the models' training window. We demonstrate that, through techniques such as fine-tuning and retrieval-augmented generation, LLMs significantly outperform existing geoparsing models. The best performing models achieve a toponym extraction F1 score of 0.985 and toponym resolution accuracy within 161 km of 0.921. Additionally, we show that the spatial information encoded within the embedding space of these models may explain their strong performance in geographic information extraction. Finally, we discuss the spatial biases inherent in the models' predictions and emphasize the need for caution when applying these techniques in certain contexts.

Methods

This contains the data and codes required to reproduce the results from our paper. The LGL and GeoVirus datasets are pre-existing datasets, with references given in the manuscript. The News2024 dataset was constructed specifically for the paper.

To construct the News2024 dataset, we first created a list of 50 cities from around the world which have population greater than 1000000. We then used the GNews python package https://pypi.org/project/gnews/ (opens in new window) to find a news article for each location, published between 2024-05-01 and 2024-06-30 (inclusive). Of these articles, 47 were found to contain toponyms, with the three rejected articles referring to businesses which share a name with a city, and which did not otherwise mention any place names.

We used a semi autonmous approach to geotagging the articles. The articles were first processed using a Distil-BERT model, fine tuned for named entity recognicion. This provided a first estimate of the toponyms within the text. A human reviewer then read the articles, and accepted or rejected the machine tags, and added any tags missing from the machine tagging process. We then used OpenStreetMap to obtain geographic coordinates for the location, and to identify the toponym type (e.g. city, town, village, river etc). We also flagged if the toponym was acting as a geo-political entity, as these were reomved from the analysis process. In total, 534 toponyms were identified in the 47 news articles.
d
Getting Started on the Open Data Portal
catalog.data.gov
datasets.ai
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cos-data.seattle.gov (2024). Getting Started on the Open Data Portal [Dataset]. https://catalog.data.gov/dataset/getting-started-on-the-open-data-portal
Explore at:
Dataset updated
Dec 16, 2024
Dataset provided by
cos-data.seattle.gov
Description
This page provides some instructions, tips, and videos to help you perform common tasks on data.seattle.gov. It also contains links to more in-depth help content.
n
Data from: Generative AI enhances individual creativity but reduces the...
data.niaid.nih.gov
datadryad.org
zip
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anil Doshi; Oliver Hauser (2024). Generative AI enhances individual creativity but reduces the collective diversity of novel content [Dataset]. http://doi.org/10.5061/dryad.qfttdz0pm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.qfttdz0pm
Dataset updated
Jun 14, 2024
Dataset provided by
University of Exeter
University College London
Authors
Anil Doshi; Oliver Hauser
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Creativity is core to being human. Generative AI—made readily available by powerful large language models (LLMs)—holds promise for humans to be more creative by offering new ideas, or less creative by anchoring on generative AI ideas. We study the causal impact of generative AI ideas on the production of short stories in an online experiment where some writers obtained story ideas from an LLM. We find that access to generative AI ideas causes stories to be evaluated as more creative, better written, and more enjoyable, especially among less creative writers. However, generative AI-enabled stories are more similar to each other than stories by humans alone. These results point to an increase in individual creativity at the risk of losing collective novelty. This dynamic resembles a social dilemma: with generative AI, writers are individually better off, but collectively a narrower scope of novel content is produced. Our results have implications for researchers, policy-makers, and practitioners interested in bolstering creativity. Methods This dataset is based on a pre-registered, two-phase experimental online study. In the first phase of our study, we recruited a group of N=293 participants (“writers”) who are asked to write a short, eight sentence story. Participants are randomly assigned to one of three conditions: Human only, Human with 1 GenAI idea, and Human with 5 GenAI ideas. In our Human only baseline condition, writers are assigned the task with no mention of or access to GenAI. In the two GenAI conditions, we provide writers with the option to call upon a GenAI technology (OpenAI’s GPT-4 model) to provide a three-sentence starting idea to inspire their own story writing. In one of the two GenAI conditions (Human with 5 GenAI ideas), writers can choose to receive up to five GenAI ideas, each providing a possibly different inspiration for their story. After completing their story, writers are asked to self-evaluate their story on novelty, usefulness, and several emotional characteristics. In the second phase, the stories composed by the writers are then evaluated by a separate group of N=600 participants (“evaluators”). Evaluators read six randomly selected stories without being informed about writers being randomly assigned to access GenAI in some conditions (or not). All stories are evaluated by multiple evaluators on novelty, usefulness, and several emotional characteristics. After disclosing to evaluators whether GenAI was used during the creative process, we ask evaluators to rate the extent to which ownership and hypothetical profits should be split between the writer and the AI. Finally, we elicit evaluators’ general views on the extent to which they believe that the use of AI in producing creative output is ethical, how story ownership and hypothetical profits should be shared between AI creators and human creators, and how AI should be credited in the involvement of the creative output. The data was collected on the online study platform Prolific. The data was then cleaned, processed and analyzed with Stata. For the Writer Study, of the 500 participants who began the study, 169 exited the study prior to giving consent, 22 were dropped for not giving consent, and 13 dropped out prior to completing the study. Three participants in the Human only condition admitted to using GenAI during their story writing exercise and—as per our pre-registration—they were therefore dropped from the analysis, resulting in a total number of writers and stories of 293. For the Evaluator Study, each evaluator was shown 6 stories (2 stories from each topic). The evaluations associated with the writers who did not complete the writer study and those in the Human only condition who acknowledged using AI to complete the story were dropped. Thus, there are a total of 3,519 evaluations of 293 stories made by 600 evaluators. Four evaluations remained for five evaluators, five evaluations remained for 71, and all six remained for 524 evaluators.
D
Open Source Data Labelling Tool Market Report | Global Forecast From 2025 To...
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Open Source Data Labelling Tool Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-open-source-data-labelling-tool-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Open Source Data Labelling Tool Market Outlook

The global market size for Open Source Data Labelling Tools was valued at USD 1.5 billion in 2023 and is projected to reach USD 4.6 billion by 2032, growing at a compound annual growth rate (CAGR) of 13.2% during the forecast period. This significant growth can be attributed to the increasing adoption of artificial intelligence (AI) and machine learning (ML) across various industries, which drives the need for accurately labelled data to train these technologies effectively.

The rapid advancement and integration of AI and ML in numerous sectors serve as a primary growth factor for the Open Source Data Labelling Tool market. With the proliferation of big data, organizations are increasingly recognizing the importance of high-quality, annotated data sets to enhance the accuracy and efficiency of their AI models. The open-source nature of these tools offers flexibility and cost-effectiveness, making them an attractive choice for businesses of all sizes, especially startups and SMEs, which further fuels market growth.

Another key driver is the rising demand for automated data labelling solutions. Manual data labelling is a time-consuming and error-prone task, leading many organizations to seek automated tools that can swiftly and accurately label large datasets. Open source data labelling tools, often augmented with advanced features like natural language processing (NLP) and computer vision, provide a scalable solution to this challenge. This trend is particularly pronounced in data-intensive industries such as healthcare, automotive, and finance, where the precision of data labelling can significantly impact operational outcomes.

Additionally, the collaborative nature of open-source communities contributes to the market's growth. Continuous improvements and updates are driven by a global community of developers and researchers, ensuring that these tools remain at the cutting edge of technology. This ongoing innovation not only boosts the functionality and reliability of open-source data labelling tools but also fosters a sense of community and shared knowledge, encouraging more organizations to adopt these solutions.

In the realm of data labelling, Premium Annotation Tools have emerged as a significant player, offering advanced features that cater to the needs of enterprises seeking high-quality data annotation. These tools often come equipped with enhanced functionalities such as collaborative interfaces, real-time updates, and integration capabilities with existing AI systems. The premium nature of these tools ensures that they are designed to handle complex datasets with precision, thereby reducing the margin of error in data labelling processes. As businesses increasingly prioritize accuracy and efficiency, the demand for premium solutions is on the rise, providing a competitive edge in sectors where data quality is paramount.

From a regional perspective, North America holds a significant share of the market due to the robust presence of tech giants and a well-established IT infrastructure. The region's strong focus on AI research and development, coupled with substantial investments in technology, drives the demand for data labelling tools. Meanwhile, the Asia Pacific region is expected to exhibit the highest growth rate during the forecast period, attributed to the rapid digital transformation and increasing AI adoption across countries like China, India, and Japan.

Component Analysis

When dissecting the Open Source Data Labelling Tool market by component, it is evident that the segment is bifurcated into software and services. The software segment dominates the market, primarily due to the extensive range of features and functionalities that open-source data labelling software offers. These tools are customizable and can be tailored to meet specific needs, making them highly versatile and efficient. The software segment is expected to continue its dominance as more organizations seek comprehensive solutions that integrate seamlessly with their existing systems.

The services segment, while smaller in comparison, plays a crucial role in the overall market landscape. Services include support, training, and consulting, which are vital for organizations to effectively implement and utilize open-source data labelling tools. As the adoption of these tools grows, so does the demand for professional services that can aid in deployment, customization