26 datasets found
  1. f

    Data from: Analyzing student prompts and their effect on ChatGPT’s...

    • tandf.figshare.com
    txt
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghadeer Sawalha; Imran Taj; Abdulhadi Shoufan (2024). Analyzing student prompts and their effect on ChatGPT’s performance [Dataset]. http://doi.org/10.6084/m9.figshare.26970708.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Ghadeer Sawalha; Imran Taj; Abdulhadi Shoufan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large language models present new opportunities for teaching and learning. The response accuracy of these models, however, is believed to depend on the prompt quality which can be a challenge for students. In this study, we aimed to explore how undergraduate students use ChatGPT for problem-solving, what prompting strategies they develop, the link between these strategies and the model’s response accuracy, the existence of individual prompting tendencies, and the impact of gender in this context. Our students used ChatGPT to solve five problems related to embedded systems and provided the solutions and the conversations with this model. We analyzed the conversations thematically to identify prompting strategies and applied different quantitative analyses to establish relationships between these strategies and the response accuracy and other factors. The findings indicate that students predominantly employ three types of prompting strategies: single copy-and-paste prompting (SCP), single reformulated prompting (SRP), and multiple-question prompting (MQP). ChatGPT’s response accuracy using SRP and MQP was significantly higher than using SCP, with effect sizes of -0.94 and -0.69, respectively. The student-by-student analysis revealed some tendencies. For example, 26 percent of the students consistently copied and pasted the questions into ChatGPT without any modification. Students who used MQP showed better performance in the final exam than those who did not use this prompting strategy. As for gender, female students tended to make extensive use of SCP, whereas male students tended to mix SCP and MQP. We conclude that students develop different prompting strategies that lead to different response qualities and learning. More research is needed to deepen our understanding and inform effective educational practices in the AI era.

  2. h

    awesome-chatgpt-prompts

    • huggingface.co
    Updated Dec 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatih Kadir Akın (2022). awesome-chatgpt-prompts [Dataset]. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2022
    Authors
    Fatih Kadir Akın
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    🧠 Awesome ChatGPT Prompts [CSV dataset]

    This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub

      License
    

    CC-0

  3. [WSESE] [Prompt Engineering in Data Analysis] Included and Excluded Papers

    • figshare.com
    xlsx
    Updated Feb 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Valença; Ronnie de Souza Santos; Reydne Santos; Matheus de Morais Leça (2025). [WSESE] [Prompt Engineering in Data Analysis] Included and Excluded Papers [Dataset]. http://doi.org/10.6084/m9.figshare.28326737.v6
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 2, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Lucas Valença; Ronnie de Souza Santos; Reydne Santos; Matheus de Morais Leça
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context. The use of large language models for qualitative analysis is gaining attention in various fields, including software engineering, where qualitative methods are essential to understanding human and social factors. Goal. This study aimed to investigate how LLMs are currently used in qualitative analysis and how they can be used in software engineering research, focusing on identifying the benefits, limitations, and practices associated with their application. Method. We conducted a systematic mapping study and analyzed 21 relevant studies to explore reports of using LLM for qualitative analysis reported in the literature. Findings. Our findings indicate that LLMs are primarily used for tasks such as coding, thematic analysis, and data categorization, with benefits including increased efficiency and support for new researchers. However, limitations such as output variability, challenges capturing nuanced perspectives, and ethical concerns regarding privacy and transparency were also evident. Discussions. The study highlights the need for structured strategies and guidelines to optimize LLM use in qualitative research within software engineering. Such strategies could enhance the effectiveness of LLMs while addressing ethical considerations. Conclusion. While LLMs show promise for supporting qualitative analysis, human expertise remains essential for data interpretation, and continued exploration of best practices will be crucial for their effective integration into empirical software engineering research.

  4. D

    Data Analytics Consulting Service Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Data Analytics Consulting Service Report [Dataset]. https://www.datainsightsmarket.com/reports/data-analytics-consulting-service-1458356
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Jun 3, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Data Analytics Consulting Services market is experiencing robust growth, driven by the increasing adoption of data-driven decision-making across various industries. The market, estimated at $150 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033, reaching approximately $400 billion by 2033. This expansion is fueled by several key factors. The surge in data volume and variety necessitates specialized expertise in data analytics, pushing organizations to seek professional consulting services. Furthermore, the growing need for advanced analytics techniques, including predictive modeling, machine learning, and AI, is driving demand for sophisticated consulting solutions. The rise of cloud computing and big data technologies is also contributing to market growth by enabling easier data storage, processing, and analysis. Finally, regulatory compliance requirements, such as GDPR and CCPA, are prompting businesses to invest in data governance and analytics consulting to ensure data security and privacy. The competitive landscape is characterized by a mix of large multinational consulting firms (Accenture, Deloitte, EY, PwC, McKinsey, BCG) and specialized data analytics consultancies (DataArt, Infosys, Appnovation, InData Labs, etc.). These firms offer a wide range of services, including data strategy development, data warehousing and integration, business intelligence implementation, advanced analytics solutions, and data visualization services. While significant growth is anticipated, challenges remain. These include the shortage of skilled data scientists and analysts, the complexity of integrating various data sources, and the need for robust data security measures. The market is segmented based on various factors such as service type, industry vertical, and geographic region, allowing firms to target specific niches and maximize their market penetration. The North American market currently holds the largest market share, followed by Europe and Asia Pacific, but growth in emerging economies is expected to be substantial in the coming years.

  5. Replication Package of the paper "Large Language Models for Multilingual...

    • zenodo.org
    zip
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2025). Replication Package of the paper "Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality" [Dataset]. http://doi.org/10.5281/zenodo.15028641
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality

    Abstract

    Having been trained in the wild, Large Language Models (LLMs) may suffer from different types of bias. As shown in previous studies outside software engineering, this includes a language bias, i.e., these models perform differently depending on the language used for the query/prompt. However, so far the impact of language bias on source code generation has not been thoroughly investigated. Therefore, in this paper, we study the influence of the language adopted in the prompt on the quality of the source code generated by three LLMs, specifically GPT, Claude, and DeepSeek. We consider 230 coding tasks for Python and 230 for Java, and translate their related prompts into four languages: Chinese, Hindi, Spanish, and Italian. After generating the code, we measure code quality in terms of passed tests, code metrics, warnings generated by static analysis tools, and language used for the identifiers. Results indicate that (i) source code generated from the English queries is not necessarily better in terms of passed test and quality metrics, (ii) the quality for different languages varies depending on the programming language and LLM being used, and (iii) the generated code tend to contain mixes of comments and literals written in English and the language used to formulate the prompt.

    Replication Package

    This replication package is organized into two main directories: data and scripts. The datadirectory contains all the data used in the analysis, including prompts and final results. The scripts directory contains all the Python scripts used for code generation and analysis.

    Data

    The data directory contains five subdirectories, each corresponding to a stage in the analysis pipeline. These are enumerated to reflect the order of the process:

    1. prompt_translation: Contains files with manually translated prompts for each language. Each file is associated with both Python and Java. The structure of each file is as follows:

      • id: The ID of the query in the CoderEval benchmark.
      • prompt: The original English prompt.
      • summary: The original summary.
      • code: The original code.
      • translation: The translation generated by GPT.
      • correction: The manual correction of the GPT-generated translation.
      • correction_tag: A list of tags indicating the corrections made to the translation.
      • generated_code: This column is initially empty and will contain the code generated from the translated prompt.
    2. generation: Contains the code generated by the three LLMs for each programming language and natural language. Each subdirectory (e.g., java_chinese_claude) contains the following:

      • files: The files with the generated code (named by the query ID).
      • report: Reports generated by static analysis tools.
      • A CSV file (e.g., java_chinese_claude.csv) containing the generated code in the corresponding column.
    3. tests: Contains input files for the testing process and the results of the tests. Files in the input_files directory are formatted according to the CoderEval benchmark requirements. The results directory holds the output of the testing process.

    4. quantitative_analysis: Contains all the csv reports of the static analysis tools and test output for all languages and models. These files are the inputs for the statistical analysis. The directory stats contains all the output tables for the statistical analysis, which are shown in paper's tables.

    5. qualitative_analysis: Contains files used for the qualitative analysis:

      • CohenKappaagreement.csv: A file containing the subset used to compute Cohen's kappa metrics for manual analysis.
      • files: Contains all files for the qualitative analysis. Each file has the following columns:
        • id: The ID of the query in the CoderEval benchmark.
        • generated_code: The code generated by the model.
        • comments: The language used for comments.
        • identifiers: The language used for identifiers.
        • literals: The language used for literals.
        • notes: Additional notes.
    6. ablation_study: Contains files for the ablation study. Each file has the following columns:

      • id: The ID of the query in the CoderEval benchmark.
      • prompt: The prompt used for code generation.
      • generated_code, comments, identifiers, and literals: Same as in the qualitative analysis. results.pdf: This file shows the table containing all the percentages of comments, identifiers and literals extracted from the csv files of the ablation study.

      Files prefixed with italian contain prompts with signatures and docstrings translated into Italian. The system prompt used is the same as the initial one (see the paper). Files with the english prefix have prompts with the original signature (in English) and the docstring in Italian. The system prompt differs as follows:

    You are an AI that only responds with Python code. You will be given a function signature and its docstring by the user. Write your full implementation (restate the function signature).
    Use a Python code block to write your response.
    Comments and identifiers must be in Italian. 
    For example:
    ```python
    print("Hello World!")

    Scripts

    The scripts directory contains all the scripts used to perform all the generations and analysis. All files are properly commented. Here a brief description of each file:

    • code_generation.py: This script automates code generation using AI models (GPT, DeepSeek, and Claude) for different programming and natural languages. It reads prompts from CSV files, generates code based on the prompts, and saves the results in structured directories. It logs the process, handles errors, and stores the generated code in separate files for each iteration.

    • computeallanalysis.py: This script performs static code analysis on generated code files using different models, languages, and programming languages. It runs various analyses (Flake8, Pylint, Lizard) depending on the programming language: for Python, it runs all three analyses, while for Java, only Lizard is executed. The results are stored in dedicated report directories for each iteration. The script ensures the creation of necessary directories and handles any errors that occur during the analysis process.

    • createtestjava.py: This script processes Java code generated by different models and languages, extracting methods using a JavaParser server. It iterates through multiple iterations of generated code, extracts the relevant method code (or uses the full code if no method is found), and stores the results in a JSONL file for each language and model combination.

    • deepseek_model.py: This function sends a request to the DeepSeek API, passing a system and user prompt, and extracts the generated code snippet based on the specified programming language. It prints the extracted code in blue to the console, and if any errors occur during the request or extraction, it prints an error message in red. If successful, it returns the extracted code snippet; otherwise, it returns None.

    • extractpmdreport.py: This script processes PMD analysis reports in SARIF format and converts them into CSV files. It extracts the contents of ZIP files containing the PMD reports, parses the SARIF file to gather analysis results, and saves the findings in a CSV file. The output includes details such as file names, rules, messages, and the count of issues found. The script iterates through multiple languages, models, and iterations, ensuring that PMD reports are properly processed and saved for each combination.

    • flake_analysis.py: The flake_analysis function runs Flake8 to analyze Python files for errors and generates a CSV report summarizing the results. It processes the output, extracting error details such as filenames, error codes, and messages. The errors are grouped by file and saved in a CSV file for easy review.

    • generatepredictionclaude_java.py: The generatecodefrom_prompt function processes a JSON file containing prompts, generates Java code using the Claude API, and saves the generated code to a new JSON file. It validates each prompt, ensures it's JSON-serializable, and sends it to the Claude API for code generation. If the generation is successful, the code is stored in a structured format, and the output is saved to a JSON file for further use.

    • generatepredictionclaude_python.py: This code defines a function generatecodefrom_prompt that processes a JSON file containing prompts, generates Python code using the Claude API, and saves the generated code to a new JSON file. It handles invalid values and ensures all prompts are

  6. f

    Example prompts, their task-related features, and their assigned complexity...

    • plos.figshare.com
    xls
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacqueline A. Jansen; Artür Manukyan; Nour Al Khoury; Altuna Akalin (2025). Example prompts, their task-related features, and their assigned complexity values. [Dataset]. http://doi.org/10.1371/journal.pone.0317084.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Jacqueline A. Jansen; Artür Manukyan; Nour Al Khoury; Altuna Akalin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example prompts, their task-related features, and their assigned complexity values.

  7. Z

    Can Developers Prompt? A Controlled Experiment for Code Documentation...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maalej, Walid (2024). Can Developers Prompt? A Controlled Experiment for Code Documentation Generation [Replication Package] [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13127237
    Explore at:
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Kruse, Hans-Alexander
    Puhlfürß, Tim
    Maalej, Walid
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Summary of Artifacts

    This is the replication package for the paper titled 'Can Developers Prompt? A Controlled Experiment for Code Documentation Generation' that is part of the 40th IEEE International Conference on Software Maintenance and Evolution (ICSME), from October 6 to 11, 2024, located in Flagstaff, AZ, USA.

    Full Abstract

    Large language models (LLMs) bear great potential for automating tedious development tasks such as creating and maintaining code documentation. However, it is unclear to what extent developers can effectively prompt LLMs to create concise and useful documentation. We report on a controlled experiment with 20 professionals and 30 computer science students tasked with code documentation generation for two Python functions. The experimental group freely entered ad-hoc prompts in a ChatGPT-like extension of Visual Studio Code, while the control group executed a predefined few-shot prompt. Our results reveal that professionals and students were unaware of or unable to apply prompt engineering techniques. Especially students perceived the documentation produced from ad-hoc prompts as significantly less readable, less concise, and less helpful than documentation from prepared prompts. Some professionals produced higher quality documentation by just including the keyword Docstring in their ad-hoc prompts. While students desired more support in formulating prompts, professionals appreciated the flexibility of ad-hoc prompting. Participants in both groups rarely assessed the output as perfect. Instead, they understood the tools as support to iteratively refine the documentation. Further research is needed to understand which prompting skills and preferences developers have and which support they need for certain tasks.

    Author Information

    Name Affiliation Email

    Hans-Alexander Kruse Universität Hamburg hans-alexander.kruse@studium.uni-hamburg.de

    Tim Puhlfürß Universität Hamburg tim.puhlfuerss@uni-hamburg.de

    Walid Maalej Universität Hamburg walid.maalej@uni-hamburg.de

    Citation Information

    @inproceedings{kruse-icsme-2024, author={Kruse, Hans-Alexander and Puhlf{"u}r{\ss}, Tim and Maalej, Walid}, booktitle={2022 IEEE International Conference on Software Maintenance and Evolution}, title={Can Developers Prompt? A Controlled Experiment for Code Documentation Generation}, year={2024}, doi={tba}, }

    Artifacts Overview

    1. Preprint

    The file kruse-icsme-2024-preprint.pdf is the preprint version of the official paper. You should read the paper in detail to understand the study, especially its methodology and results.

    1. Results

    The folder results includes two subfolders, explained in the following.

    Demographics RQ1 RQ2

    The subfolder Demographics RQ1 RQ2 provides Jupyter Notebook file evaluation.ipynb for analyzing (1) the experiment participants' submissions of the digital survey and (2) the ad-hoc prompts that the experimental group entered into their tool. Hence, this file provides demographic information about the participants and results for the research questions 1 and 2. Please refer to the README file inside this subfolder for installation steps of the Jupyter Notebook file.

    RQ2

    The subfolder RQ2 contains further subfolders with Microsoft Excel files specific to the results of research question 2:

    The subfolder UEQ contains three times the official User Experience Questionnaire (UEQ) analysis Excel tool, with data entered from all participants/students/professionals.

    The subfolder Open Coding contains three Excel files with the open-coding results for the free-text answers that participants could enter at the end of the survey to state additional positive and negative comments about their experience during the experiment. The Consensus file provides the finalized version of the open coding process.

    1. Extension

    The folder extension contains the code of the Visual Studio Code (VS Code) extension developed in this study to generate code documentation with predefined prompts. Please refer to the README file inside the folder for installation steps. Alternatively, you can install the deployed version of this tool, called Code Docs AI, via the VS Code Marketplace.

    You can install the tool to generate code documentation with ad-hoc prompts directly via the VS Code Marketplace. We did not include the code of this extension in this replication package due to license conflicts (GNUv3 vs. MIT).

    1. Survey

    The folder survey contains PDFs of the digital survey in two versions:

    The file Survey.pdf contains the rendered version of the survey (how it was presented to participants).

    The file SurveyOptions.pdf is an export of the LimeSurvey web platform. Its main purpose is to provide the technical answer codes, e.g., AO01 and AO02, that refer to the rendered answer texts, e.g., Yes and No. This can help you if you want to analyze the CSV files inside the results folder (instead of using the Jupyter Notebook file), as the CSVs contain the answer codes, not the answer texts. Please note that an export issue caused page 9 to be almost blank. However, this problem is negligible as the question on this page only contained one free-text answer field.

    1. Appendix

    The folder appendix provides additional material about the study:

    The subfolder tool_screenshots contains screenshots of both tools.

    The file few_shots.txt lists the few shots used for the predefined prompt tool.

    The file test_functions.py lists the functions used in the experiment.

    Revisions

    Version Changelog

    1.0.0 Initial upload

    1.1.0 Add paper preprint. Update abstract.

    1.2.0 Update replication package based on ICSME Artifact Track reviews

    License

    See LICENSE file.

  8. f

    Results of chi-square test testing all different prompting strategies over...

    • plos.figshare.com
    xls
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacqueline A. Jansen; Artür Manukyan; Nour Al Khoury; Altuna Akalin (2025). Results of chi-square test testing all different prompting strategies over the various complexities. [Dataset]. http://doi.org/10.1371/journal.pone.0317084.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Jacqueline A. Jansen; Artür Manukyan; Nour Al Khoury; Altuna Akalin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results of chi-square test testing all different prompting strategies over the various complexities.

  9. D

    Cookie Tracking Software Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Cookie Tracking Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/cookie-tracking-software-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Cookie Tracking Software Market Outlook



    The global cookie tracking software market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach around USD 4.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 16.8% during the forecast period. This growth is driven by increasing digitalization, heightened demand for personalized marketing, and stringent data privacy regulations. Companies are investing heavily in technologies that can help them track user behavior, optimize user experiences, and ensure compliance with evolving privacy laws, which fuels market growth.



    One of the primary growth factors for the cookie tracking software market is the increasing emphasis on personalized marketing. As companies strive to offer more tailored user experiences, they require sophisticated tools to collect and analyze user data. Cookie tracking software enables businesses to capture detailed insights into user preferences and behaviors, allowing them to deliver personalized content and advertisements. This capability significantly enhances customer engagement and conversion rates, making it a critical component for digital marketing strategies.



    Another contributing factor is the rising implementation of data privacy regulations worldwide. Legislation such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States has necessitated more transparent and secure data tracking practices. Cookie tracking software helps organizations comply with these regulations by providing features like consent management and data anonymization. This ensures that businesses can continue to leverage user data while adhering to legal requirements, thereby mitigating the risk of hefty fines and reputational damage.



    The growing adoption of digital platforms during the COVID-19 pandemic has further accelerated the demand for cookie tracking software. With an increasing number of consumers shifting to online shopping and remote work environments, businesses have had to ramp up their digital presence. This surge in digital activity has underscored the importance of effective user tracking and data analysis, prompting more companies to invest in advanced cookie tracking solutions to better understand and cater to their online audiences.



    In the realm of digital marketing, Subscription Analytics Software is becoming increasingly vital as businesses transition to subscription-based models. This software provides companies with the tools to analyze customer subscription data, offering insights into customer behavior, preferences, and churn rates. By leveraging these insights, businesses can optimize their subscription offerings, tailor marketing strategies, and enhance customer retention. The integration of subscription analytics with cookie tracking software can further enrich data collection, enabling a more comprehensive understanding of user interactions and preferences. As the subscription economy continues to grow, the demand for robust analytics solutions that can seamlessly integrate with existing digital marketing tools is expected to rise.



    Regionally, North America is expected to hold a significant share of the cookie tracking software market, driven by the presence of major technology companies and high internet penetration rates. Europe is also anticipated to see robust growth due to stringent data protection regulations and widespread digitalization initiatives. Meanwhile, the Asia Pacific region is projected to experience the fastest growth, fueled by rapid economic development, increasing internet usage, and the proliferation of e-commerce platforms in countries like China and India.



    Component Analysis



    The cookie tracking software market can be segmented by component into software and services. The software segment, which includes various types of cookie tracking applications and platforms, is expected to dominate the market. This is largely due to the continuous advancements in technology and the increasing need for sophisticated tools to analyze vast amounts of user data. Companies are constantly seeking robust software solutions that can provide real-time insights and seamless integration with other digital marketing tools.



    Within the software segment, several sub-categories exist, including standalone cookie tracking software and integrated solutions that form part of broader digital marketing platforms. Sta

  10. PROSPECT: Professional Role Effects on Specialized Perspective Enhancement...

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Dec 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keisuke Sato; Keisuke Sato (2024). PROSPECT: Professional Role Effects on Specialized Perspective Enhancement in Conversational Task [Dataset]. http://doi.org/10.5281/zenodo.14567800
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Keisuke Sato; Keisuke Sato
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 29, 2024
    Description

    ### Data Availability Statement (for the paper)

    All dialogue logs and final responses collected in this study are publicly available in the PROSPECT repository on Zenodo (DOI: [to be assigned]). The repository contains PDF files of complete dialogue histories and Markdown files of final comprehensive analyses for all conditions and models used in this study, allowing for reproducibility and further analysis.

    ### README.md for Zenodo

    # PROSPECT: Professional Role Effects on Specialized Perspective Enhancement in Conversational Task

    ## Overview
    This repository (PROSPECT) contains the dataset associated with the paper:
    > "Empirical Investigation of Expertise, Multiperspectivity, and Abstraction Induction in Conversational AI Outputs through Professional Role Assignment to Both User and AI"

    This research analyzed changes in dialogue logs and final responses when professional roles were assigned to both user and AI sides across multiple Large Language Models (LLMs). This repository provides the complete dialogue logs (PDF format) and final responses (Markdown format) used in the analysis.

    ## Directory Structure
    The repository structure under the top directory (`PROSPECT/`) is as follows:

    ```
    PROSPECT/
    ├── dialogue/ # Dialogue histories (PDF)
    │ ├── none/
    │ ├── ai_only/
    │ ├── user_only/
    │ └── both/
    └── final_answers/ # Final responses (Markdown)
    ├── none/
    ├── ai_only/
    ├── user_only/
    └── both/
    ```

    - **dialogue/**
    - Contains raw dialogue logs in PDF format. Subdirectories represent role assignment conditions:
    - `none/`: No roles assigned to either user or AI
    - `ai_only/`: Role assigned to AI only
    - `user_only/`: Role assigned to user only
    - `both/`: Roles assigned to both user and AI
    - **final_answers/**
    - Contains final comprehensive analysis responses in Markdown format. Directory structure mirrors that of `dialogue/`.

    ## File Naming Convention
    Files in each directory follow this naming convention:
    ```
    [AI]_[conditionNumber]-[roleNumber].pdf
    [AI]_[conditionNumber]-[roleNumber].md
    ```
    - `[AI]`: AI model name used for dialogue (e.g., ChatGPT, ChatGPT-o1, Claude, Gemini)
    - `[conditionNumber]`: Number indicating role assignment condition
    - 0: none
    - 1: ai_only
    - 2: user_only
    - 3: both
    - `[roleNumber]`: Professional role number
    - 0: No role
    - 1: Detective
    - 2: Psychologist
    - 3: Artist
    - 4: Architect
    - 5: Natural Scientist

    ### Examples:
    - `ChatGPT_3-1.pdf`: Dialogue log with ChatGPT-4o model under "both" condition (3) with detective role (1)
    - `Gemini_1-4.md`: Final response from Gemini model under "ai_only" condition (1) with architect role (4)

    ## Role Number Reference
    | roleNumber | Professional Role |
    |-----------:|:-----------------|
    | 0 | No role |
    | 1 | Detective |
    | 2 | Psychologist |
    | 3 | Artist |
    | 4 | Architect |
    | 5 | Natural Scientist|

    ## Data Description
    - **Dialogue Histories (PDF format)**
    Complete logs of questions and responses from each session, preserved as captured during the research. All dialogues were conducted in Japanese. While assistant version information is not included, implementation dates and model names are recorded within the files.
    - **Final Responses (Markdown format)**
    Excerpted responses to the final "comprehensive analysis request" as Markdown files, intended for text analysis and keyword extraction. All responses are in Japanese.

    *Note: This dataset contains dialogues and responses exclusively in Japanese. Researchers interested in lexical analysis or content analysis should consider this language specification.

    ## How to Use
    1. Please maintain the folder hierarchy after downloading.
    2. For meta-analysis or lexical analysis, refer to PDFs for complete dialogues and Markdown files for final responses.
    3. Utilize for research reproduction, secondary analysis, or meta-analysis.

    ## License
    This dataset is released under the **CC BY 4.0** License.
    - Free to use and modify, but please cite this repository (DOI) and the associated paper when using the data.

    ## Related Publication


    ## Disclaimer
    - The dialogue logs contain no personal information or confidential data.
    - The provided logs and responses reflect the research timing; identical prompts may yield different responses due to AI model updates.
    - The creators assume no responsibility for any damages resulting from the use of this dataset.

    ## Contact
    For questions or requests, please contact skeisuke@ibaraki-ct.ac.jp.

  11. E

    Global Laser Methane Telemetry Detection Module Market Industry Best...

    • statsndata.org
    excel, pdf
    Updated Jul 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Laser Methane Telemetry Detection Module Market Industry Best Practices 2025-2032 [Dataset]. https://www.statsndata.org/report/laser-methane-telemetry-detection-module-market-299604
    Explore at:
    pdf, excelAvailable download formats
    Dataset updated
    Jul 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Laser Methane Telemetry Detection Module market is witnessing significant growth as industries increasingly prioritize safety and environmental compliance. These modules utilize advanced laser technology to detect methane and other gases in real-time, providing critical data for prompt decision-making and risk m

  12. u

    Data from: Can Large Language Models Identify Locations Better Than Linked...

    • portaldelaciencia.uva.es
    • zenodo.org
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    García-Zarza, Pablo; Asensio-Pérez, Juan I.; Bote-Lorenzo, Miguel L.; Sánchez-Turrión, Luis F.; Taibi, Davide; Vega-Gorgojo, Guillermo; García-Zarza, Pablo; Asensio-Pérez, Juan I.; Bote-Lorenzo, Miguel L.; Sánchez-Turrión, Luis F.; Taibi, Davide; Vega-Gorgojo, Guillermo (2025). Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning? [Dataset]. https://portaldelaciencia.uva.es/documentos/6856990b6364e456d3a65544
    Explore at:
    Dataset updated
    2025
    Authors
    García-Zarza, Pablo; Asensio-Pérez, Juan I.; Bote-Lorenzo, Miguel L.; Sánchez-Turrión, Luis F.; Taibi, Davide; Vega-Gorgojo, Guillermo; García-Zarza, Pablo; Asensio-Pérez, Juan I.; Bote-Lorenzo, Miguel L.; Sánchez-Turrión, Luis F.; Taibi, Davide; Vega-Gorgojo, Guillermo
    Description

    This repository contains the dataset and analysis associated with the research paper "Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning?" Presented at the 20th European Conference on Technology Enhanced Learning (ECTEL), 2025.

    Overview

    Ubiquitous learning (u-learning) applications often rely on identifying relevant Points of Interest (POIs) where students can engage in contextualized learning tasks. Traditionally, these POIs have been retrieved from structured datasets like Linked Open Data (LOD). However, with the rise of Large Language Models (LLMs), a new question arises: can LLMs outperform LOD in identifying such locations?

    This study compares the performance of a LOD dataset (Wikidata) and two LLMs (ChatGPT and DeepSeek) in retrieving 16th-century cultural heritage sites (churches, cathedrals, castles, and palaces) across three European cities (two in Spain and one in Italy) and their regions.

    Dataset

    The file LODvsLLMs.xlsx includes:

    Raw data retrieved from Wikidata and the two LLMs.

    SPARQL queries and LLM prompts used for data collection.

    Comparative analysis across four key dimensions:

    Accuracy: Are the retrieved sites real and verifiable?

    Consistency: Do repeated queries yield stable results?

    Completeness: How exhaustive are the lists of POIs?

    Validity: Are the geographic coordinates and Wikipedia links correct?

    Key Findings

    LOD (Wikidata) outperformed LLMs in terms of consistency, completeness (especially in larger regions), and validity of data.

    LLMs were able to retrieve some POIs not found in Wikidata, but also introduced hallucinations and invalid links.

    A hybrid approach combining LOD and LLMs is proposed for future u-learning applications to maximize coverage and reliability.

    Citation

    If you use this dataset or refer to the findings in your work, please cite the original paper presented at ECTEL 2025.

    García-Zarza, P., Asensio-Pérez, J.I., Bote-Lorenzo, M.L., Sánchez-Turrión, L.F., Taibi, D., Vega-Gorgojo, G., Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning Proceedings of 20th European Conference on Technology Enhanced Learning, ECTEL 2025, Newcastle & Durham, United Kingdom, September 2025.

  13. d

    A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott McGrath (2025). A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data [Dataset]. http://doi.org/10.5061/dryad.s4mw6m9cv
    Explore at:
    Dataset updated
    Aug 1, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Scott McGrath
    Time period covered
    Jan 1, 2023
    Description

    Objective: Our objective is to evaluate the efficacy of ChatGPT 4 in accurately and effectively delivering genetic information, building on previous findings with ChatGPT 3.5. We focus on assessing the utility, limitations, and ethical implications of using ChatGPT in medical settings. Materials and Methods: A structured questionnaire, including the Brief User Survey (BUS-15) and custom questions, was developed to assess ChatGPT 4's clinical value. An expert panel of genetic counselors and clinical geneticists independently evaluated ChatGPT 4's responses to these questions. We also involved comparative analysis with ChatGPT 3.5, utilizing descriptive statistics and using R for data analysis. Results: ChatGPT 4 demonstrated improvements over 3.5 in context recognition, relevance, and informativeness. However, performance variability and concerns about the naturalness of the output were noted. No significant difference in accuracy was found between ChatGPT 3.5 and 4.0. Notably, the effic..., Study Design This study was conducted to evaluate the performance of ChatGPT 4 (March 23rd, 2023)  Model) in the context of genetic counseling and education. The evaluation involved a structured questionnaire, which included questions selected from the Brief User Survey (BUS-15) and additional custom questions designed to assess the clinical value of ChatGPT 4's responses. Questionnaire Development The questionnaire was built on Qualtrics, which comprised twelve questions: seven selected from the BUS-15 preceded by two additional questions that we designed. The initial questions focused on quality and answer relevancy: 1.    The overall quality of the Chatbot’s response is: (5-point Likert: Very poor to Very Good) 2.    The Chatbot delivered an answer that provided the relevant information you would include if asked the question. (5-point Likert: Strongly disagree to Strongly agree) The BUS-15 questions (7-point Likert: Strongly disagree to Strongly agree) focused on: 1.    Recogniti..., , # A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data

    https://doi.org/10.5061/dryad.s4mw6m9cv

    This data was captured when evaluating the ability of ChatGPT to address questions patients may ask it about three genetic conditions (BRCA1, HFE, and MLH1). This data is associated with the JAMIA article of the similar name with the DOIÂ 10.1093/jamia/ocae128

    Description of the data and file structure

    1. Key: This tab contains the data structure, explaining the survey questions, and potential responses available.
    2. Prompt Responses: This tab contains the prompts used for ChatGPT, and the response provided from each model (3.5 and 4)
    3. GPT 4 Results: This tab provides the responses collected from the medical experts (genetic counselors and clinical geneticist) from the Qualtrics survey.
    4. Accuracy (Qx_1): This tab contains the subset of results from both the Ch...
  14. D

    Oil and Gas Pipeline Leak Detection Market Report | Global Forecast From...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Oil and Gas Pipeline Leak Detection Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/oil-and-gas-pipeline-leak-detection-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Oil and Gas Pipeline Leak Detection Market Outlook



    The global oil and gas pipeline leak detection market size is projected to experience significant growth, with an expected valuation rising from USD 2.37 billion in 2023 to USD 3.89 billion by 2032, reflecting a healthy compound annual growth rate (CAGR) of 5.6% from 2024 to 2032. This market expansion is largely fueled by the increasing emphasis on safety and environmental regulations, the growing complexity of pipeline networks, and the dire need for efficient and reliable leak detection systems. As governments and organizations worldwide become more aware of and committed to reducing the environmental impacts of fossil fuel extraction and transportation, the demand for advanced leak detection technologies has intensified, driving market growth.



    One of the primary factors contributing to the growth of the oil and gas pipeline leak detection market is the stringent regulatory frameworks being implemented globally to prevent environmental disasters. These regulations mandate the installation of sophisticated leak detection systems to minimize the risk of oil spills and gas leaks, which can have catastrophic environmental and economic consequences. The increasing public awareness and pressure on governments to ensure the safety and integrity of oil and gas infrastructure have also played a crucial role in driving the market's expansion. Furthermore, the adoption of best practices and international standards in pipeline monitoring and maintenance is further propelling the demand for innovative and reliable leak detection technologies.



    Technological advancements in the oil and gas industry have paved the way for the development of more efficient and accurate leak detection systems. Innovations such as acoustic/ultrasonic sensors, fiber optic technologies, and advanced data analytics are improving the precision and reliability of leak detection, thereby reducing operational risks and potential losses. The integration of Internet of Things (IoT) and artificial intelligence (AI) in pipeline monitoring systems enhances real-time data collection and analysis, enabling prompt detection and response to leaks. These cutting-edge technologies are not only enhancing the effectiveness of leak detection but also reducing the overall costs associated with pipeline monitoring and maintenance, making them increasingly attractive to oil and gas companies.



    The growing global energy demand and the expansion of oil and gas pipeline networks, especially in emerging economies, are also driving the need for efficient leak detection systems. As countries endeavor to secure their energy supply and improve infrastructure, significant investments are being made in the construction and maintenance of extensive pipeline networks. This expansion necessitates robust leak detection solutions to ensure the safe and efficient transportation of oil and gas resources. Additionally, the shift towards unconventional oil and gas resources, such as shale gas and deepwater drilling, presents new challenges in leak detection, further increasing the demand for advanced technologies.



    Pipeline Leak Detectors play a crucial role in ensuring the safety and efficiency of oil and gas transportation. These detectors are designed to identify leaks quickly and accurately, minimizing the risk of environmental damage and economic loss. By utilizing advanced technologies such as acoustic sensors and fiber optics, pipeline leak detectors can provide real-time monitoring and immediate alerts, allowing operators to respond swiftly to any potential issues. This capability is particularly important in complex pipeline networks, where undetected leaks can lead to significant operational challenges. As the industry continues to evolve, the integration of pipeline leak detectors with digital technologies like AI and IoT is enhancing their effectiveness, offering more precise detection and predictive maintenance capabilities.



    Technology Analysis



    The technology segment of the oil and gas pipeline leak detection market encompasses various sophisticated systems, each offering unique advantages in detecting leaks with precision. Acoustic/ultrasonic technology, for instance, stands out for its ability to detect leaks through sound waves. This method is particularly effective in situations where traditional methods may fall short, as it can monitor for changes in noise levels along pipeline routes, indicating potential leaks. The sensitivity of acoustic/ultrasonic systems to sound variations makes th

  15. Data Center Construction Market in Southeast Asia by Construction Components...

    • technavio.com
    pdf
    Updated May 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2021). Data Center Construction Market in Southeast Asia by Construction Components and Geography - Forecast and Analysis 2021-2025 [Dataset]. https://www.technavio.com/report/data-center-construction-market-industry-in-southeast-asia-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2021
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2020 - 2025
    Area covered
    South East Asia
    Description

    Snapshot img

    The data center construction market in southeast asia size is expected to grow by USD 3.61 billion and record a CAGR of 12% during 2021-2025. This post-pandemic data center construction market in southeast asia report has assessed the shift in consumer behavior and has identified and explored the upcoming trends and drivers that the vendors can capitalize on to support prompt business decisions. In this data center construction market in southeast asia analysis report, key drivers such as increase in investment in data centers have been discussed with emerging growth regions, which will offer immense business opportunities. Our analysts have also identified challenges such as system integration and interoperability issues, which will impede market growth. With these insights, the vendors can recreate their plan of action to obtain growth opportunities in the future. This data center construction market in southeast asia report further entails segmentation by geography (Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia) and construction component (electrical construction, mechanical construction, consulting and other services, and integrating software). The available actionable insights on the segmentations, in this report, will enable a better understanding of the target audience and changing demand patterns.

    Who are the Key Vendors in the Data Center Construction Market In Southeast Asia?

    The data center construction market in southeast asia forecast report provides insights on complete key vendor profiles and their business strategies to reimage themselves. The profiles include information on the production, competitive landscape, sustainability, and prospects of the leading companies including:

    ABB Ltd.
    AECOM
    Eaton Corporation Plc
    Hewlett Packard Enterprise Development LP
    Legrand SA 
    M+W Group GmbH
    Ove Arup & Partners International Ltd.
    Rittal GmbH & Co. KG
    Schneider Electric SE
    Vertiv Holdings Co.
    

    Our analysts have extensively outlined successful business strategies deployed by the key vendors in this market research report. The data center construction market in southeast asia is fragmented and the vendors are deploying various organic and inorganic growth strategies to compete in the market.

    To make the most of the opportunities, vendors should focus on fast-growing segments, while maintaining their positions in the slow-growing segments. The data center construction market in southeast asia further offers well-structured marketing strategies to overcome the negative post-COVID-19 impact, if any, on each product and service segment.

    Which are the Key Regional Markets for Data Center Construction Market In Southeast Asia?

    The report offers an up-to-date analysis of the geographical composition of the market. Singapore will record a fast growth rate during 2021-2025, owing to which the region should offer several growth opportunities to market vendors. The rise in iot solutions will significantly influence data center construction market in southeast asia growth in this region. From the statistical study of the geographic landscape, you can interpret and understand the competitive intelligence and regional opportunities in store for vendors for 2021-2025.

    35% of the market's growth will originate from Singapore during the forecast period. Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia are the key markets for data center construction market in southeast asia in the region. This report provides estimations of the contribution of all regions to the growth of the data center construction market in southeast asia size.

        Data Center Construction Market In Southeast Asia Scope
    
    
    
    
        Report Coverage
    
    
        Details
    
    
    
    
        Page number
    
    
        120
    
    
    
    
        Base year
    
    
        2020
    
    
    
    
        Forecast period
    
    
        2021-2025
    
    
    
    
        Growth momentum & CAGR
    
    
        Accelerate at a CAGR of 12%
    
    
    
    
        Market growth 2021-2025
    
    
        USD 3.61 billion
    
    
    
    
        Market structure
    
    
        Fragmented
    
    
    
    
        YoY growth (%)
    
    
        9.45
    
    
    
    
        Regional analysis
    
    
        Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia
    
    
    
    
        Performing market contribution
    
    
        Singapore at 35%
    
    
    
    
        Key consumer countries
    
    
        Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia
    
    
    
    
        Competitive landscape
    
    
        Leading companies, competitive strategies, consumer engagement scope
    
    
    
    
        Companies profiled
    
    
        ABB Ltd., AECOM, Eaton Corporation Plc, Hewlett Packard Enterprise Development LP, Legrand SA , M+W Group GmbH, Ove Arup & Partners International Ltd., Rittal GmbH & Co. KG, Schneider Electric SE, and Vertiv Holdings Co.
    
    
    
    
        Market Dynamics
    
    
        Parent market a
    
  16. Z

    Geoparsing with Large Language Models: Leveraging the linguistic...

    • data.niaid.nih.gov
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous, Anonymous (2024). Geoparsing with Large Language Models: Leveraging the linguistic capabilities of generative AI to improve geographic information extraction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13862654
    Explore at:
    Dataset updated
    Oct 2, 2024
    Dataset authored and provided by
    Anonymous, Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Geoparsing with Large Language Models

    The .zip file included in this repository contains all the code and data required to reproduce the results from our paper. Note, however, that in order to run the OpenAI models, users will required an OpenAI API key and sufficient API credits.

    Data

    The data used for the paper are in the datasetst and results folders.

    **Datasets: **This contains the XML files (LGL and Geovirus) and Json files (News2024) used to benchmark the models. It also contains all the data used to fine-tune the gpt-3.5 model, the prompt templates sent to the LLMs, and other data used for mapping and data creation.

    **Results: **This contains the results for the models on the three datastes. The folder is separated by dataset, with a single .csv file giving the results for each model on each dataset separately. The .csv file is structured so that each row contains either a predicted toponym and an associated true toponym (along with assigned spatial coordinates), if the model correctly identified a toponym; otherwise the true toponym columns are empty for false positives and the predicted columns are empty for false negatives.

    Code

    The code is split into two seperate folders gpt_geoparser and notebooks.

    **GPT_Geoparser: **this contains the classes and methods used process the XML and JSON articles (data.py), interact with the Nominatim API for geocoding (gazetteer.py), interact with the OpenAI API (gpt_handler.py), process the outputs from the GPT models (geoparser.py) and analyse the results (analysis.py).

    Notebooks: This series of notebooks can be used to reproduce the results given in the paper. The file names a reasonably descriptive of what they do within the context of the paper.

    Code/software

    Requirements

    Numpy

    Pandas

    Geopy

    Scitkit-learn

    lxml

    openai

    matplotlib

    Contextily

    Shapely

    Geopandas

    tqdm

    huggingface_hub

    Gnews

    Access information

    Other publicly accessible locations of the data:

    The LGL and GeoVirus datasets can also be obtained here (opens in new window).

    Abstract

    Geoparsing- the process of associating textual data with geographic locations - is a key challenge in natural language processing. The often ambiguous and complex nature of geospatial language make geoparsing a difficult task, requiring sophisticated language modelling techniques. Recent developments in Large Language Models (LLMs) have demonstrated their impressive capability in natural language modelling, suggesting suitability to a wide range of complex linguistic tasks. In this paper, we evaluate the performance of four LLMs - GPT-3.5, GPT-4o, Llama-3.1-8b and Gemma-2-9b - in geographic information extraction by testing them on three geoparsing benchmark datasets: GeoVirus, LGL, and a novel dataset, News2024, composed of geotagged news articles published outside the models' training window. We demonstrate that, through techniques such as fine-tuning and retrieval-augmented generation, LLMs significantly outperform existing geoparsing models. The best performing models achieve a toponym extraction F1 score of 0.985 and toponym resolution accuracy within 161 km of 0.921. Additionally, we show that the spatial information encoded within the embedding space of these models may explain their strong performance in geographic information extraction. Finally, we discuss the spatial biases inherent in the models' predictions and emphasize the need for caution when applying these techniques in certain contexts.

    Methods

    This contains the data and codes required to reproduce the results from our paper. The LGL and GeoVirus datasets are pre-existing datasets, with references given in the manuscript. The News2024 dataset was constructed specifically for the paper.

    To construct the News2024 dataset, we first created a list of 50 cities from around the world which have population greater than 1000000. We then used the GNews python package https://pypi.org/project/gnews/ (opens in new window) to find a news article for each location, published between 2024-05-01 and 2024-06-30 (inclusive). Of these articles, 47 were found to contain toponyms, with the three rejected articles referring to businesses which share a name with a city, and which did not otherwise mention any place names.

    We used a semi autonmous approach to geotagging the articles. The articles were first processed using a Distil-BERT model, fine tuned for named entity recognicion. This provided a first estimate of the toponyms within the text. A human reviewer then read the articles, and accepted or rejected the machine tags, and added any tags missing from the machine tagging process. We then used OpenStreetMap to obtain geographic coordinates for the location, and to identify the toponym type (e.g. city, town, village, river etc). We also flagged if the toponym was acting as a geo-political entity, as these were reomved from the analysis process. In total, 534 toponyms were identified in the 47 news articles.

  17. f

    Representative open-ended responses to the prompt “how would having the...

    • figshare.com
    xls
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa McCartney; Jessica Colon (2023). Representative open-ended responses to the prompt “how would having the [module] during your chosen time have better prepared you for life after FIU? [Dataset]. http://doi.org/10.1371/journal.pone.0285176.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Melissa McCartney; Jessica Colon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What would you have done differently in regards to career preparation?” that connect to the inductive code of “better prepared to apply to graduate/professional schools or jobs” Relevant sections of the student response are bolded and underlined.

  18. Replication package of the paper "Where is Code Generated by LLMs Coming...

    • zenodo.org
    zip
    Updated Nov 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2024). Replication package of the paper "Where is Code Generated by LLMs Coming From? A Study with Gemini and Bing CoPilot" [Dataset]. http://doi.org/10.5281/zenodo.14051606
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Replication Package

    This replication package contains the necessary tools, data, and scripts for reproducing the results of our paper: "Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot". Below is a detailed description of the directory structure and the contents of this package.

    Contents

    The replication package is organized into two main directories:

    • assets: This directory contains all .csv files used as input for the script and the outputted .csv file used to perform the manual and automated analyses for RQ1 and RQ2.

    • script: This directory contains all scripts for RQ1 and RQ2.

    In the following, we describe the content of each directory:

    assets

    This directory contains the tools and resources required for our study.

    dataset: Contains the main datasets used in the study.

    • annotationStore.csv: Input dataset for our analyses, originating from the CODESEARCHNET dataset.

    • queries.csv: .csv file containing the queries used for the experiments filtered from the CODESEARCHNET dataset. This file contains the following columns:

      • Language: Programming language of the query
      • Query: Query used for the experiment
      • GitHubUrl: GitHub URL related to a snippet that addresses the query
      • Relevance: Relevance of the linked GitHub snippet to the query

    data: Contains the datasets and results of all analyses.

    • queries.csv: General input queries. This file contains the following columns:

      • Language: Programming language of the query
      • Query: Query used for the snippet generation
      • Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about:
    • queries_filled.csv: Similar to the previous file, but also containing the output produced by the LLM-based assistants. This file contains the following columns:

      • Language: Programming language of the query
      • Query: Query used for the snippet generation
      • Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about:
      • Notes: General notes that provide additional context or information about the query or prompt.
      • Gemini_Answer(n): The generated code snippets by Gemini.
      • Gemini(n): The external links provided by Gemini.
      • Prompt (repeated)
      • Note: Notes that provide additional context or information about the query or prompt.
      • Copilot_Answer(n): The generated code snippets by Bing-Copilot.
      • Copilot_Bing(n): The external links provided by Bing-Copilot.

    copilot || gemini: Contains the data related to the specific LLM. These two subdirectories have the same internal structure.

    • queries.csv: The queries_filled.csv file, filtered for the specific LLM.
    • queries_noTrivial.csv: Contains only the queries with at least one nontrivial generated snippet.
    • external_links.csv: External links extracted from the LLMs output.

    • external_links_filled.csv: Snippets extracted from the external links.

      • index: Query ID
      • source: Snippet ID
      • url: Link URL
      • note: Notes that provide additional context or information about the query or prompt
      • code(n): The n-th code snippet extracted from the source

    manual_analysis: Manual analysis results.

    • manual_analysis.csv:
      • index: Query ID
      • query: Query used for the snippet generation
      • generatedsnippet(n): The n-th code snippet generated by the LLM-based assistant
      • trivial_1: Manual analysis of whether or not the snippet was trivial (validator 1)
      • trivial_2: Manual analysis of whether or not the snippet was trivial (validator 2)
      • trivial_final: Manual analysis of whether or not the snippet was trivial (final classification if there is a disagreement)
      • source: URL to analyze
      • sourcetype1: Type of the source (validator 1)
      • sourcetype2: Type of the source (validator 2)
      • sourcetypefinal: Type of the source (final classification if there is a disagreement)
      • relatedtoquery_1: Relevance of the link to the query (validator 1)
      • relatedtoquery_2: Relevance of the link to the query (validator 2)
      • relatedtoquery_final: Relevance of the link to the query (final classification if there is a disagreement)
      • relatedtosnippets_1: Relevance of the generated snippet to those in the link (validator 1)
      • relatedtosnippets_2: Relevance of the generated snippet to those in the link (validator 2)
      • relatedtosnippets_final: Relevance of the generated snippet to those in the link (final classification if there is a disagreement)
    • manual_analysis_noTrivial.csv: As in the previous file, but only the queries with at least one nontrivial generated code snippet.

    clone_detector: Output and intermediate files for clone detection with Copilot data.

    • copilot_tokens || gemini_tokens: Contains the output the tokenization of the generated code snippets and the code snippets extracted from the external links.
    • merged_llm_ext_link.csv: All possible pairs (Cartesian product) (code snippet extracted from the external links, generated code snippet). This file is the input of the clone detection tool.
      • ID_query: Query ID
      • query: Query used for the snippet generation
      • language: Programming language of the query
      • generated_snippet: The generated code snippet by the LLM-based assistant
      • IDgensnippet: The index of the generated code snippet
      • LOCgensnippet: The number of lines of code of the generated code snippet
      • ID_source: Source ID
      • source: Source URL
      • source_snippet: Code snippet extracted from the source
      • IDsourcesnippet: ID of the code snippet extracted from the source
      • LOCsourcesnippet: The number of lines of code of the code snippet extracted from the source
      • note: Notes that provide additional context or information about the query or prompt
    • clone_detection_output.csv: Contains the clone detection results.
      • ID_query: The index of the query
      • query: Query used for the snippet generation
      • language: The programming language of the query
      • generated_snippet: The generated code snippet by the LLM-based assistant
      • IDgensnippet: The index of the generated code snippet
      • LOCgensnippet: The number of lines of code of the generated code snippet
      • ID_source: Source ID
      • source: Source URL
      • source_snippet: Code snippet extracted from the source
      • IDsourcesnippet: ID of the code snippet extracted from the source
      • LOCsourcesnippet: The number of lines of code of the code snippet extracted from the source
      • note: Notes that provide additional context or information about the query or prompt
      • clone_detected: bBolean value that indicates whether a clone has been detected (1 = detected, 0 = not detected)
      • cloning_ratio: Ratio of the number of lines of code of the generated code snippet has been detected as a clone in the code snippet extracted from the source
      • cloned_lines: The number of lines of code of the generated code snippet that has been detected as a clone in the code snippet extracted from the source

    cosine_sim: Cosine similarity results.

    • cosine_sim_output.csv: Contains the cosine similarity results
      • query_id: Query ID
      • snippet_id:ID the generated code snippet
      • source_id: ID of the source
      • sourcesnippetid: ID of the code snippet extracted from the source
        • cosine_similarity: The cosine similarity between the generated code snippet and the code snippet extracted from the source

    quant_analysis: Quantitative analysis results.

    • topN_links_se.csv: Contains the top-N links extracted from the search engine.
      • id: Query ID
      • query: The query
      • url: Link URL

    • merged_clone_cosine.csv: Contains the merged results of the clone detection and cosine similarity.
      • ID_query: Query ID
      • query: The query
      • language: The programming language of the query
      • generated_snippet: The generated code snippet by the LLM-based assistant
      • IDgensnippet: The ID of the generated code

  19. h

    llama2-sst2-fine-tuning

    • huggingface.co
    Updated Aug 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yifei (2023). llama2-sst2-fine-tuning [Dataset]. https://huggingface.co/datasets/OneFly7/llama2-sst2-fine-tuning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2023
    Authors
    Yifei
    Description

    Dataset Card for "llama2-sst2-finetuning"

      Dataset Description
    

    The Llama2-sst2-fine-tuning dataset is designed for supervised fine-tuning of the LLaMA V2 based on the GLUE SST2 for sentiment analysis classification task.We provide two subsets: training and validation.To ensure the effectiveness of fine-tuning, we convert the data into the prompt template for LLaMA V2 supervised fine-tuning, where the data will follow this format:
    [INST] <

  20. O

    5 Day Payment Target; Better Payment Practice Code (BPPC)

    • opalpro.cs.upb.de
    • cloud.csiss.gmu.edu
    • +2more
    Updated Jun 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NHS Digital (2019). 5 Day Payment Target; Better Payment Practice Code (BPPC) [Dataset]. http://opalpro.cs.upb.de/zh_CN/dataset/groups/5_day_payment_target_better_payment_practice_code_bppc_
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/csvAvailable download formats
    Dataset updated
    Jun 23, 2019
    Dataset provided by
    NHS Digital
    License

    http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence

    Description

    The NHS Information Centre 5 Day Payment Target; Better Payment Practice Code (BPPC)

    This information shows the Additional Monitor Returns Report - (Prompt payment Analysis of duration between Invoice Receipt and Invoice Payment in Working Days)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ghadeer Sawalha; Imran Taj; Abdulhadi Shoufan (2024). Analyzing student prompts and their effect on ChatGPT’s performance [Dataset]. http://doi.org/10.6084/m9.figshare.26970708.v1

Data from: Analyzing student prompts and their effect on ChatGPT’s performance

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Dec 12, 2024
Dataset provided by
Taylor & Francis
Authors
Ghadeer Sawalha; Imran Taj; Abdulhadi Shoufan
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Large language models present new opportunities for teaching and learning. The response accuracy of these models, however, is believed to depend on the prompt quality which can be a challenge for students. In this study, we aimed to explore how undergraduate students use ChatGPT for problem-solving, what prompting strategies they develop, the link between these strategies and the model’s response accuracy, the existence of individual prompting tendencies, and the impact of gender in this context. Our students used ChatGPT to solve five problems related to embedded systems and provided the solutions and the conversations with this model. We analyzed the conversations thematically to identify prompting strategies and applied different quantitative analyses to establish relationships between these strategies and the response accuracy and other factors. The findings indicate that students predominantly employ three types of prompting strategies: single copy-and-paste prompting (SCP), single reformulated prompting (SRP), and multiple-question prompting (MQP). ChatGPT’s response accuracy using SRP and MQP was significantly higher than using SCP, with effect sizes of -0.94 and -0.69, respectively. The student-by-student analysis revealed some tendencies. For example, 26 percent of the students consistently copied and pasted the questions into ChatGPT without any modification. Students who used MQP showed better performance in the final exam than those who did not use this prompting strategy. As for gender, female students tended to make extensive use of SCP, whereas male students tended to mix SCP and MQP. We conclude that students develop different prompting strategies that lead to different response qualities and learning. More research is needed to deepen our understanding and inform effective educational practices in the AI era.

Search
Clear search
Close search
Google apps
Main menu