63 datasets found
  1. openai-news

    • huggingface.co
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jina AI (2025). openai-news [Dataset]. https://huggingface.co/datasets/jinaai/openai-news
    Explore at:
    Dataset updated
    Jul 20, 2025
    Dataset authored and provided by
    Jina AI
    Description

    Dataset Card for "openai-news" Dataset

    This dataset was created from blog posts and news articles about OpenAI from their website. Queries are handcrafted.

      Disclaimer
    

    This dataset may contain publicly available images or text data. All data is provided for research and educational purposes only. If you are the rights holder of any content and have concerns regarding intellectual property or copyright, please contact us at "support-data (at) jina.ai" for removal. We do not… See the full description on the dataset page: https://huggingface.co/datasets/jinaai/openai-news.

  2. OpenAI HumanEval Code Gen

    • kaggle.com
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). OpenAI HumanEval Code Gen [Dataset]. https://www.kaggle.com/datasets/thedevastator/openai-humaneval-code-gen/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    OpenAI HumanEval Code Gen

    Handcrafted Python Programming Problems for Accurate Model Evaluation

    By Huggingface Hub [source]

    About this dataset

    This dataset released by OpenAI, HumanEval, offers a unique opportunity for developers and researchers to accurately evaluate their code generation models in a safe environment. It includes 164 handcrafted programming problems written by engineers and researchers from OpenAI specificially designed to test the correctness and scalability of code generation models. Written in Python, these programming problems cover docstrings and comments full of natural English text which can be difficult for computers to comprehend. Each programming problem also includes a function signature, body as well as several unit tests. Placed under the MIT License, this HumanEval dataset is ideal for any practitioner looking to judge the efficacy of their machine-generated code with trusted results!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The first step is to explore the data that is included in the set by viewing the columns included. This guide will focus on four key columns: prompt, canonical_solution, test and entry_point. - The prompt column contains natural English text describing the programming problem. - The canonical_solution column holds the correct solution to each programming problem as determined by OpenAI researchers or engineers who hand-crafted the dataset. - The test column contains unit tests designed to check for correctness when debugging or evaluating code generated by neural networks or other automated tools.
    - The entry_point column contains code for an entry point into each program which can be used as starting point while solving any programming problem from this dataset.

    With this information we can now begin utilizing this data set for our own projects from building new case studies for specific AI algorithms to developing automated programs that generate compatible source code instructions based off open AI datasets like Human Eval!

    Research Ideas

    • Training code generation models in a limited and supervised environment.
    • Benchmarking the performance of existing code generation models, as HumanEval consists of both the canonical solution for each problem and unit tests that can be used to evaluate model accuracy.
    • Using Natural Language Processing (NLP) algorithms on the docstrings and comments within HumanEval to develop better natural language understanding for programming contexts

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: test.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | prompt | A description of the programming problem. (String) | | canonical_solution | The expected solution to the programming problem. (String) | | test | Unit tests to verify the accuracy of the solution. (String) | | entry_point | The entry point for running the unit tests. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  3. c

    OpenAI releases their first open source models Price Prediction Data

    • coinbase.com
    Updated Oct 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). OpenAI releases their first open source models Price Prediction Data [Dataset]. https://www.coinbase.com/en-fr/price-prediction/base-openai-releases-their-first-open-source-models-997f
    Explore at:
    Dataset updated
    Oct 6, 2025
    Variables measured
    Growth Rate, Predicted Price
    Measurement technique
    User-defined projections based on compound growth. This is not a formal financial forecast.
    Description

    This dataset contains the predicted prices of the asset OpenAI releases their first open source models over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.

  4. Engagement with OpenAI and ChatGPT in Italy 2022-2023

    • statista.com
    Updated Apr 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2023). Engagement with OpenAI and ChatGPT in Italy 2022-2023 [Dataset]. https://www.statista.com/statistics/1379705/italy-openai-chatgpt-engagement/
    Explore at:
    Dataset updated
    Apr 25, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2022 - Jan 2023
    Area covered
    Italy
    Description

    In January 2023, ChatGPT registered over nine million interactions from users in Italy, up by over 300 percent compare to the previous month. By comparison, the OpenAI website registered 1.2 million actions performed by Italian users. At the end of March 2023, the main national privacy regulator in Italy prompted OpenAI to provide information on how and why the company collects user data, if the company wanted to avoid seeing its access to the Italian market blocked.

  5. OpenAI.com traffic in Italy 2023, by device

    • statista.com
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tiago Bianchi (2024). OpenAI.com traffic in Italy 2023, by device [Dataset]. https://www.statista.com/topics/4217/internet-usage-in-italy/
    Explore at:
    Dataset updated
    Jan 10, 2024
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Tiago Bianchi
    Area covered
    Italy
    Description

    In January 2023, over 60 percent of web traffic to the Open AI website from Italy was from mobile devices. By comparison, approximately 40 percent of visitors accessed the website via desktop devices. In March 2023, the national privacy regulator banned OpenAI's main product ChatGPT - an AI-powered chatbot that can mimic human interactions - with the regulator alleging the chatbot is violating European privacy laws. In April 2023, the Italian privacy regulator reported that ChatGPT will be allowed to operate in the country if OpenAI provides information on the purpose of its data collection as well as disallows minor users from accessing the website.

  6. c

    Operator by OpenAI Price Prediction Data

    • coinbase.com
    Updated Oct 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Operator by OpenAI Price Prediction Data [Dataset]. https://www.coinbase.com/en-sg/price-prediction/base-operator-by-openai-8e31
    Explore at:
    Dataset updated
    Oct 4, 2025
    Variables measured
    Growth Rate, Predicted Price
    Measurement technique
    User-defined projections based on compound growth. This is not a formal financial forecast.
    Description

    This dataset contains the predicted prices of the asset Operator by OpenAI over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.

  7. c

    OpenAI PreStocks Price Prediction Data

    • coinbase.com
    Updated Oct 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). OpenAI PreStocks Price Prediction Data [Dataset]. https://www.coinbase.com/price-prediction/solana-openai-prestocks-rpgf
    Explore at:
    Dataset updated
    Oct 2, 2025
    Variables measured
    Growth Rate, Predicted Price
    Measurement technique
    User-defined projections based on compound growth. This is not a formal financial forecast.
    Description

    This dataset contains the predicted prices of the asset OpenAI PreStocks over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.

  8. b

    ChatGPT Revenue and Usage Statistics (2025)

    • businessofapps.com
    Updated Feb 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Business of Apps (2023). ChatGPT Revenue and Usage Statistics (2025) [Dataset]. https://www.businessofapps.com/data/chatgpt-statistics/
    Explore at:
    Dataset updated
    Feb 9, 2023
    Dataset authored and provided by
    Business of Apps
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    ChatGPT was the chatbot that kickstarted the generative AI revolution, which has been responsible for hundreds of billions of dollars in data centres, graphics chips and AI startups. Launched by...

  9. c

    OpenAI Agent Price Prediction Data

    • coinbase.com
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). OpenAI Agent Price Prediction Data [Dataset]. https://www.coinbase.com/en-ar/price-prediction/openai-agent
    Explore at:
    Dataset updated
    Oct 1, 2025
    Variables measured
    Growth Rate, Predicted Price
    Measurement technique
    User-defined projections based on compound growth. This is not a formal financial forecast.
    Description

    This dataset contains the predicted prices of the asset OpenAI Agent over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.

  10. openai-news_deprecated

    • huggingface.co
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jina AI (2024). openai-news_deprecated [Dataset]. https://huggingface.co/datasets/jinaai/openai-news_deprecated
    Explore at:
    Dataset updated
    Apr 11, 2024
    Dataset authored and provided by
    Jina AI
    Description

    Dataset Card for "openai-news" Dataset

    This dataset was created from blog posts and news articles about OpenAI from their website. Queries are handcrafted.

      Disclaimer
    

    This dataset may contain publicly available images or text data. All data is provided for research and educational purposes only. If you are the rights holder of any content and have concerns regarding intellectual property or copyright, please contact us at "support-data (at) jina.ai" for removal. We do not… See the full description on the dataset page: https://huggingface.co/datasets/jinaai/openai-news_deprecated.

  11. w

    Dataset of news about OPENAI

    • workwithdata.com
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of news about OPENAI [Dataset]. https://www.workwithdata.com/datasets/news?f=1&fcol0=page_name&fop0=%3D&fval0=OPENAI
    Explore at:
    Dataset updated
    May 16, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about news. It has 238 rows and is filtered where the keywords includes OPENAI. It features 10 columns including source, publication date, section, and news link.

  12. Z

    Geoparsing with Large Language Models: Leveraging the linguistic...

    • data.niaid.nih.gov
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous, Anonymous (2024). Geoparsing with Large Language Models: Leveraging the linguistic capabilities of generative AI to improve geographic information extraction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13862654
    Explore at:
    Dataset updated
    Oct 2, 2024
    Dataset authored and provided by
    Anonymous, Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Geoparsing with Large Language Models

    The .zip file included in this repository contains all the code and data required to reproduce the results from our paper. Note, however, that in order to run the OpenAI models, users will required an OpenAI API key and sufficient API credits.

    Data

    The data used for the paper are in the datasetst and results folders.

    **Datasets: **This contains the XML files (LGL and Geovirus) and Json files (News2024) used to benchmark the models. It also contains all the data used to fine-tune the gpt-3.5 model, the prompt templates sent to the LLMs, and other data used for mapping and data creation.

    **Results: **This contains the results for the models on the three datastes. The folder is separated by dataset, with a single .csv file giving the results for each model on each dataset separately. The .csv file is structured so that each row contains either a predicted toponym and an associated true toponym (along with assigned spatial coordinates), if the model correctly identified a toponym; otherwise the true toponym columns are empty for false positives and the predicted columns are empty for false negatives.

    Code

    The code is split into two seperate folders gpt_geoparser and notebooks.

    **GPT_Geoparser: **this contains the classes and methods used process the XML and JSON articles (data.py), interact with the Nominatim API for geocoding (gazetteer.py), interact with the OpenAI API (gpt_handler.py), process the outputs from the GPT models (geoparser.py) and analyse the results (analysis.py).

    Notebooks: This series of notebooks can be used to reproduce the results given in the paper. The file names a reasonably descriptive of what they do within the context of the paper.

    Code/software

    Requirements

    Numpy

    Pandas

    Geopy

    Scitkit-learn

    lxml

    openai

    matplotlib

    Contextily

    Shapely

    Geopandas

    tqdm

    huggingface_hub

    Gnews

    Access information

    Other publicly accessible locations of the data:

    The LGL and GeoVirus datasets can also be obtained here (opens in new window).

    Abstract

    Geoparsing- the process of associating textual data with geographic locations - is a key challenge in natural language processing. The often ambiguous and complex nature of geospatial language make geoparsing a difficult task, requiring sophisticated language modelling techniques. Recent developments in Large Language Models (LLMs) have demonstrated their impressive capability in natural language modelling, suggesting suitability to a wide range of complex linguistic tasks. In this paper, we evaluate the performance of four LLMs - GPT-3.5, GPT-4o, Llama-3.1-8b and Gemma-2-9b - in geographic information extraction by testing them on three geoparsing benchmark datasets: GeoVirus, LGL, and a novel dataset, News2024, composed of geotagged news articles published outside the models' training window. We demonstrate that, through techniques such as fine-tuning and retrieval-augmented generation, LLMs significantly outperform existing geoparsing models. The best performing models achieve a toponym extraction F1 score of 0.985 and toponym resolution accuracy within 161 km of 0.921. Additionally, we show that the spatial information encoded within the embedding space of these models may explain their strong performance in geographic information extraction. Finally, we discuss the spatial biases inherent in the models' predictions and emphasize the need for caution when applying these techniques in certain contexts.

    Methods

    This contains the data and codes required to reproduce the results from our paper. The LGL and GeoVirus datasets are pre-existing datasets, with references given in the manuscript. The News2024 dataset was constructed specifically for the paper.

    To construct the News2024 dataset, we first created a list of 50 cities from around the world which have population greater than 1000000. We then used the GNews python package https://pypi.org/project/gnews/ (opens in new window) to find a news article for each location, published between 2024-05-01 and 2024-06-30 (inclusive). Of these articles, 47 were found to contain toponyms, with the three rejected articles referring to businesses which share a name with a city, and which did not otherwise mention any place names.

    We used a semi autonmous approach to geotagging the articles. The articles were first processed using a Distil-BERT model, fine tuned for named entity recognicion. This provided a first estimate of the toponyms within the text. A human reviewer then read the articles, and accepted or rejected the machine tags, and added any tags missing from the machine tagging process. We then used OpenStreetMap to obtain geographic coordinates for the location, and to identify the toponym type (e.g. city, town, village, river etc). We also flagged if the toponym was acting as a geo-political entity, as these were reomved from the analysis process. In total, 534 toponyms were identified in the 47 news articles.

  13. c

    OpenAI tokenized stock (PreStocks) Price Prediction Data

    • coinbase.com
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). OpenAI tokenized stock (PreStocks) Price Prediction Data [Dataset]. https://www.coinbase.com/en-au/price-prediction/solana-openai-prestocks-ebv9
    Explore at:
    Dataset updated
    Sep 29, 2025
    Variables measured
    Growth Rate, Predicted Price
    Measurement technique
    User-defined projections based on compound growth. This is not a formal financial forecast.
    Description

    This dataset contains the predicted prices of the asset OpenAI tokenized stock (PreStocks) over the next 16 years. This data is calculated initially using a default 5 percent annual growth rate, and after page load, it features a sliding scale component where the user can then further adjust the growth rate to their own positive or negative projections. The maximum positive adjustable growth rate is 100 percent, and the minimum adjustable growth rate is -100 percent.

  14. openai-news_beir

    • huggingface.co
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jina AI (2025). openai-news_beir [Dataset]. https://huggingface.co/datasets/jinaai/openai-news_beir
    Explore at:
    Dataset updated
    Jul 20, 2025
    Dataset authored and provided by
    Jina AI
    Description

    This is a copy of https://huggingface.co/datasets/jinaai/openai-news reformatted into the BEIR format. For any further information like license, please refer to the original dataset.

      Disclaimer
    

    This dataset may contain publicly available images or text data. All data is provided for research and educational purposes only. If you are the rights holder of any content and have concerns regarding intellectual property or copyright, please contact us at "support-data (at) jina.ai" for… See the full description on the dataset page: https://huggingface.co/datasets/jinaai/openai-news_beir.

  15. 4

    Supplementary data for the paper: System 2 thinking in OpenAI’s o1-preview...

    • data.4tu.nl
    zip
    Updated Sep 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joost de Winter; Dimitra Dodou; Yke Bauke Eisma (2024). Supplementary data for the paper: System 2 thinking in OpenAI’s o1-preview model: Near-perfect performance on a mathematics exam [Dataset]. http://doi.org/10.4121/2e663686-f656-4ff2-bb21-567ba4d4f03e.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 23, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Joost de Winter; Dimitra Dodou; Yke Bauke Eisma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The processes underlying human cognition are often divided into System 1, which involves fast, intuitive thinking, and System 2, which involves slow, deliberate reasoning. Previously, large language models were criticized for lacking the deeper, more analytical capabilities of System 2. In September 2024, OpenAI introduced the o1 model series, designed to handle System 2-like reasoning. While OpenAI’s benchmarks are promising, independent validation is still needed. In this study, we tested the o1-preview model twice on the Dutch ‘Mathematics B’ final exam. It scored a near-perfect 76 and 74 out of 76 points. For context, only 24 out of 16,414 students in the Netherlands achieved a perfect score. By comparison, the GPT-4o model scored 66 and 62 out of 76, well above the Dutch average of 40.63 points. Neither model had access to the exam figures. Since there was a risk of model contamination (i.e., the knowledge cutoff of o1-preview and GPT-4o was after the exam was published online), we repeated the procedure with a new Mathematics B exam that was published after the cutoff date. The results again indicated that o1-preview performed strongly (97.8th percentile), which suggests that contamination was not a factor. We also show that there is some variability in the output of o1-preview, which means that sometimes there is ‘luck’ (the answer is correct) or ‘bad luck’ (the output has diverged into something that is incorrect). We demonstrate that a self-consistency approach, where repeated prompts are given and the most common answer is selected, is a useful strategy for identifying the correct answer. It is concluded that while OpenAI’s new model series holds great potential, certain risks must be considered.

  16. S

    OpenAI vs. Anthropic Statistics 2025: Scale, Revenue & Trust Compared

    • sqmagazine.co.uk
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SQ Magazine (2025). OpenAI vs. Anthropic Statistics 2025: Scale, Revenue & Trust Compared [Dataset]. https://sqmagazine.co.uk/openai-vs-anthropic-statistics/
    Explore at:
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    SQ Magazine
    License

    https://sqmagazine.co.uk/privacy-policy/https://sqmagazine.co.uk/privacy-policy/

    Time period covered
    Jan 1, 2024 - Dec 31, 2025
    Area covered
    Global
    Description

    OpenAI and Anthropic lead the generative AI field with impressive growth, expanding capabilities, and mounting investor attention. Their competition shapes how businesses, developers, and governments adopt AI tools, from automating workflows to powering advanced coding assistants. Dive into the data to see how their trajectories compare, and explore insights that...

  17. MMMLU

    • huggingface.co
    Updated Sep 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2024). MMMLU [Dataset]. https://huggingface.co/datasets/openai/MMMLU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2024
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multilingual Massive Multitask Language Understanding (MMMLU)

    The MMLU is a widely recognized benchmark of general knowledge attained by AI models. It covers a broad range of topics from 57 different categories, covering elementary-level knowledge up to advanced professional subjects like law, physics, history, and computer science. We translated the MMLU’s test set into 14 languages using professional human translators. Relying on human translators for this evaluation increases… See the full description on the dataset page: https://huggingface.co/datasets/openai/MMMLU.

  18. f

    Data from: Hallucination by Design: The Hidden Incentives of AI

    • figshare.com
    pdf
    Updated Sep 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Habib Lantyer (2025). Hallucination by Design: The Hidden Incentives of AI [Dataset]. http://doi.org/10.6084/m9.figshare.30081982.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 9, 2025
    Dataset provided by
    figshare
    Authors
    Victor Habib Lantyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hallucination by Design: The Hidden Incentives of AI investigates the structural roots and systemic persistence of hallucinations in generative artificial intelligence. Moving beyond anecdotal accounts such as Mata v. Avianca (2023), where lawyers relied on fabricated precedents produced by ChatGPT, this paper reframes hallucination as an inevitable statistical consequence of language model training and evaluation. Drawing on the theoretical framework proposed by Kalai, Nachum, and Zhang in their seminal 2025 paper Why Language Models Hallucinate, the analysis demonstrates that generative error is not a mysterious anomaly but a mathematically predictable outcome of epistemic uncertainty, data sparsity, and inadequate modeling. More crucially, it argues that the persistence of hallucinations is reinforced by sociotechnical incentives: benchmark regimes that penalize abstention and reward confident guessing, effectively training models to behave like “test-taking students” who never leave a question blank. Technical mitigations such as Retrieval-Augmented Generation (RAG) alleviate but do not resolve this incentive misalignment. The study concludes that trustworthy AI will not emerge spontaneously from larger models, but must be engineered through new evaluation paradigms, regulatory frameworks, and ethical commitments that reward epistemic humility and veracity. For law, medicine, and other high-stakes domains, this shift reframes hallucination from a computational defect into a matter of professional responsibility, demanding a cultural, legal, and philosophical reorientation toward integrity rather than mere performance.

  19. w

    Dataset of city, country, foundation year and revenues of companies called...

    • workwithdata.com
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of city, country, foundation year and revenues of companies called OpenAI [Dataset]. https://www.workwithdata.com/datasets/companies?col=city%2Ccompany%2Ccountry%2Cfoundation_year%2Crevenues&f=1&fcol0=company&fop0=%3D&fval0=OpenAI
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about companies. It has 2 rows and is filtered where the company is OpenAI. It features 5 columns: city, country, revenues, and foundation year.

  20. Artificial Intelligence Market in the Education Sector in US by End-user and...

    • technavio.com
    pdf
    Updated Aug 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2022). Artificial Intelligence Market in the Education Sector in US by End-user and Education model - Forecast and Analysis 2022-2026 [Dataset]. https://www.technavio.com/report/artificial-intelligence-market-in-the-education-sector-in-us-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 24, 2022
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2022 - 2026
    Description

    Snapshot img

    The artificial intelligence market share in the education sector in the US is expected to increase by USD 374.3 million from 2021 to 2026, and the market’s growth momentum will accelerate at a CAGR of 48.15%.

    This artificial intelligence market in the education sector in the US research report provides valuable insights on the post-COVID-19 impact on the market, which will help companies evaluate their business approaches. Furthermore, this report extensively covers the artificial intelligence market segmentation in the education sector in US by end-user (higher education and K-12) and education model (learner model, pedagogical model, and domain model). The artificial intelligence market in the education sector in US report also offers information on several market vendors, including Alphabet Inc., Carnegie Learning Inc., Century-Tech Ltd., Cognii, DreamBox Learning Inc., Fishtree Inc., Intellinetics Inc., International Business Machines Corp., Jenzabar Inc, John Wiley and Sons Inc., LAIX Inc., McGraw Hill Education Inc., Microsoft Corp., Nuance Communications Inc., Pearson Plc, PleIQ Smart Toys Spa, Providence Equity Partners LLC, Quantum Adaptive Learning LLC, Tangible Play Inc., and True Group Inc. among others.

    What will the Artificial Intelligence Market Size in the Education Sector in US be During the Forecast Period?

    Download Report Sample to Unlock the Artificial Intelligence Market Size in the Education Sector in US for the Forecast Period and Other Important Statistics

    Artificial Intelligence Market in the Education Sector in the US: Key Drivers, Trends, and Challenges

    Based on our research output, there has been a positive impact on the market growth during and post-COVID-19 era. The increasing demand for ITS is notably driving the artificial intelligence market growth in the education sector in the US, although factors such as security and privacy concerns may impede the market growth. Our research analysts have studied the historical data and deduced the key market drivers and the COVID-19 pandemic impact on the artificial intelligence industry in the education sector. The holistic analysis of the drivers will help in deducing end goals and refining marketing strategies to gain a competitive edge.

    Key Artificial Intelligence Market Driver in the Education Sector in US

    The increasing demand for ITS is one of the major drivers impacting the artificial intelligence market in the education sector growth. ITS is increasingly being adopted in schools, colleges, and universities owing to the various benefits offered by it. Vendors such as Carnegie Mellon University offer AI software that acts as tutors, guiding students by devising step-by-step personalized learning paths. Carnegie Mellon University offers a series of mathematics tutors for middle schoolers. In addition, the increasing adoption of IAL software further drives the demand for ITS. Mc Graw Hill offers IAL software called ALEKS. It is a web-based AI assessment and learning system that uses adaptive learning to assess the knowledge of students. The advent of these AI technologies drives the growth of the market.

    Key Artificial Intelligence Market Trend in the Education Sector in US

    Growing emphasis on crowdsourced tutoring is one of the major trends influencing the artificial intelligence market in the education sector growth. One of the major trends that foster market growth is the rising emphasis on the use of AI for crowdsourced tutoring. Today, children do not just learn in the classroom; social media platforms also play an important role in their learning. The advent of online educational services has further fostered knowledge acquisition from social platforms. With the increase in the advent of AI learning technologies such as ML, deep learning, and NLP, it has become easy to obtain remote help from social websites and social networks. For example, the Brainly app enables users to ask homework questions and receive automatic answers that are verified by fellow students as well as educators on the platform. It also uses AI algorithms to personalize its platform's networking features and provide users with an experiential learning environment.

    Key Artificial Intelligence Market Challenge in the Education Sector in US

    Security and privacy concerns is one of the major challenges impeding the artificial intelligence market in the education sector growth. Artificial intelligence software is highly vulnerable to cyber-attacks. Considering that it contains a ton of data, hackers are constantly devising ways to attack this software to breach the data. It could be dangerous for the victims of such cyber-attacks to have their personal information in the open. AI models use student data to design personalized pathways for students. The process of developing an AI algorithm and its functioning often requires the algorithm to collect huge amounts of student data such as their perfo

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jina AI (2025). openai-news [Dataset]. https://huggingface.co/datasets/jinaai/openai-news
Organization logo

openai-news

jinaai/openai-news

Explore at:
42 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 20, 2025
Dataset authored and provided by
Jina AI
Description

Dataset Card for "openai-news" Dataset

This dataset was created from blog posts and news articles about OpenAI from their website. Queries are handcrafted.

  Disclaimer

This dataset may contain publicly available images or text data. All data is provided for research and educational purposes only. If you are the rights holder of any content and have concerns regarding intellectual property or copyright, please contact us at "support-data (at) jina.ai" for removal. We do not… See the full description on the dataset page: https://huggingface.co/datasets/jinaai/openai-news.

Search
Clear search
Close search
Google apps
Main menu