26 datasets found

f
Data from: Analyzing student prompts and their effect on ChatGPT’s...
tandf.figshare.com
txt
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghadeer Sawalha; Imran Taj; Abdulhadi Shoufan (2024). Analyzing student prompts and their effect on ChatGPT’s performance [Dataset]. http://doi.org/10.6084/m9.figshare.26970708.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26970708.v1
Dataset updated
Dec 12, 2024
Dataset provided by
Taylor & Francis
Authors
Ghadeer Sawalha; Imran Taj; Abdulhadi Shoufan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Large language models present new opportunities for teaching and learning. The response accuracy of these models, however, is believed to depend on the prompt quality which can be a challenge for students. In this study, we aimed to explore how undergraduate students use ChatGPT for problem-solving, what prompting strategies they develop, the link between these strategies and the model’s response accuracy, the existence of individual prompting tendencies, and the impact of gender in this context. Our students used ChatGPT to solve five problems related to embedded systems and provided the solutions and the conversations with this model. We analyzed the conversations thematically to identify prompting strategies and applied different quantitative analyses to establish relationships between these strategies and the response accuracy and other factors. The findings indicate that students predominantly employ three types of prompting strategies: single copy-and-paste prompting (SCP), single reformulated prompting (SRP), and multiple-question prompting (MQP). ChatGPT’s response accuracy using SRP and MQP was significantly higher than using SCP, with effect sizes of -0.94 and -0.69, respectively. The student-by-student analysis revealed some tendencies. For example, 26 percent of the students consistently copied and pasted the questions into ChatGPT without any modification. Students who used MQP showed better performance in the final exam than those who did not use this prompting strategy. As for gender, female students tended to make extensive use of SCP, whereas male students tended to mix SCP and MQP. We conclude that students develop different prompting strategies that lead to different response qualities and learning. More research is needed to deepen our understanding and inform effective educational practices in the AI era.
h
awesome-chatgpt-prompts
huggingface.co
Updated Dec 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatih Kadir Akın (2022). awesome-chatgpt-prompts [Dataset]. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2022
Authors
Fatih Kadir Akın
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
🧠 Awesome ChatGPT Prompts [CSV dataset]

This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub

License

CC-0
[WSESE] [Prompt Engineering in Data Analysis] Included and Excluded Papers
figshare.com
xlsx
Updated Feb 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Valença; Ronnie de Souza Santos; Reydne Santos; Matheus de Morais Leça (2025). [WSESE] [Prompt Engineering in Data Analysis] Included and Excluded Papers [Dataset]. http://doi.org/10.6084/m9.figshare.28326737.v6
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28326737.v6
Dataset updated
Feb 2, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Lucas Valença; Ronnie de Souza Santos; Reydne Santos; Matheus de Morais Leça
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context. The use of large language models for qualitative analysis is gaining attention in various fields, including software engineering, where qualitative methods are essential to understanding human and social factors. Goal. This study aimed to investigate how LLMs are currently used in qualitative analysis and how they can be used in software engineering research, focusing on identifying the benefits, limitations, and practices associated with their application. Method. We conducted a systematic mapping study and analyzed 21 relevant studies to explore reports of using LLM for qualitative analysis reported in the literature. Findings. Our findings indicate that LLMs are primarily used for tasks such as coding, thematic analysis, and data categorization, with benefits including increased efficiency and support for new researchers. However, limitations such as output variability, challenges capturing nuanced perspectives, and ethical concerns regarding privacy and transparency were also evident. Discussions. The study highlights the need for structured strategies and guidelines to optimize LLM use in qualitative research within software engineering. Such strategies could enhance the effectiveness of LLMs while addressing ethical considerations. Conclusion. While LLMs show promise for supporting qualitative analysis, human expertise remains essential for data interpretation, and continued exploration of best practices will be crucial for their effective integration into empirical software engineering research.
D
Data Analytics Consulting Service Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Data Analytics Consulting Service Report [Dataset]. https://www.datainsightsmarket.com/reports/data-analytics-consulting-service-1458356
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Jun 3, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Analytics Consulting Services market is experiencing robust growth, driven by the increasing adoption of data-driven decision-making across various industries. The market, estimated at $150 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033, reaching approximately $400 billion by 2033. This expansion is fueled by several key factors. The surge in data volume and variety necessitates specialized expertise in data analytics, pushing organizations to seek professional consulting services. Furthermore, the growing need for advanced analytics techniques, including predictive modeling, machine learning, and AI, is driving demand for sophisticated consulting solutions. The rise of cloud computing and big data technologies is also contributing to market growth by enabling easier data storage, processing, and analysis. Finally, regulatory compliance requirements, such as GDPR and CCPA, are prompting businesses to invest in data governance and analytics consulting to ensure data security and privacy. The competitive landscape is characterized by a mix of large multinational consulting firms (Accenture, Deloitte, EY, PwC, McKinsey, BCG) and specialized data analytics consultancies (DataArt, Infosys, Appnovation, InData Labs, etc.). These firms offer a wide range of services, including data strategy development, data warehousing and integration, business intelligence implementation, advanced analytics solutions, and data visualization services. While significant growth is anticipated, challenges remain. These include the shortage of skilled data scientists and analysts, the complexity of integrating various data sources, and the need for robust data security measures. The market is segmented based on various factors such as service type, industry vertical, and geographic region, allowing firms to target specific niches and maximize their market penetration. The North American market currently holds the largest market share, followed by Europe and Asia Pacific, but growth in emerging economies is expected to be substantial in the coming years.
Replication Package of the paper "Large Language Models for Multilingual...
zenodo.org
zip
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). Replication Package of the paper "Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality" [Dataset]. http://doi.org/10.5281/zenodo.15028641
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15028641
Dataset updated
Mar 14, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality

Abstract

Having been trained in the wild, Large Language Models (LLMs) may suffer from different types of bias. As shown in previous studies outside software engineering, this includes a language bias, i.e., these models perform differently depending on the language used for the query/prompt. However, so far the impact of language bias on source code generation has not been thoroughly investigated. Therefore, in this paper, we study the influence of the language adopted in the prompt on the quality of the source code generated by three LLMs, specifically GPT, Claude, and DeepSeek. We consider 230 coding tasks for Python and 230 for Java, and translate their related prompts into four languages: Chinese, Hindi, Spanish, and Italian. After generating the code, we measure code quality in terms of passed tests, code metrics, warnings generated by static analysis tools, and language used for the identifiers. Results indicate that (i) source code generated from the English queries is not necessarily better in terms of passed test and quality metrics, (ii) the quality for different languages varies depending on the programming language and LLM being used, and (iii) the generated code tend to contain mixes of comments and literals written in English and the language used to formulate the prompt.

Replication Package

This replication package is organized into two main directories: data and scripts. The datadirectory contains all the data used in the analysis, including prompts and final results. The scripts directory contains all the Python scripts used for code generation and analysis.

Data

The data directory contains five subdirectories, each corresponding to a stage in the analysis pipeline. These are enumerated to reflect the order of the process:

prompt_translation: Contains files with manually translated prompts for each language. Each file is associated with both Python and Java. The structure of each file is as follows:

id: The ID of the query in the CoderEval benchmark.

prompt: The original English prompt.

summary: The original summary.

code: The original code.

translation: The translation generated by GPT.

correction: The manual correction of the GPT-generated translation.

correction_tag: A list of tags indicating the corrections made to the translation.

generated_code: This column is initially empty and will contain the code generated from the translated prompt.

generation: Contains the code generated by the three LLMs for each programming language and natural language. Each subdirectory (e.g., java_chinese_claude) contains the following:

files: The files with the generated code (named by the query ID).

report: Reports generated by static analysis tools.

A CSV file (e.g., java_chinese_claude.csv) containing the generated code in the corresponding column.

tests: Contains input files for the testing process and the results of the tests. Files in the input_files directory are formatted according to the CoderEval benchmark requirements. The results directory holds the output of the testing process.

quantitative_analysis: Contains all the csv reports of the static analysis tools and test output for all languages and models. These files are the inputs for the statistical analysis. The directory stats contains all the output tables for the statistical analysis, which are shown in paper's tables.

qualitative_analysis: Contains files used for the qualitative analysis:

CohenKappaagreement.csv: A file containing the subset used to compute Cohen's kappa metrics for manual analysis.

files: Contains all files for the qualitative analysis. Each file has the following columns:

id: The ID of the query in the CoderEval benchmark.

generated_code: The code generated by the model.

comments: The language used for comments.

identifiers: The language used for identifiers.

literals: The language used for literals.

notes: Additional notes.

ablation_study: Contains files for the ablation study. Each file has the following columns:

id: The ID of the query in the CoderEval benchmark.

prompt: The prompt used for code generation.

generated_code, comments, identifiers, and literals: Same as in the qualitative analysis. results.pdf: This file shows the table containing all the percentages of comments, identifiers and literals extracted from the csv files of the ablation study.

Files prefixed with italian contain prompts with signatures and docstrings translated into Italian. The system prompt used is the same as the initial one (see the paper). Files with the english prefix have prompts with the original signature (in English) and the docstring in Italian. The system prompt differs as follows:

You are an AI that only responds with Python code. You will be given a function signature and its docstring by the user. Write your full implementation (restate the function signature). Use a Python code block to write your response. Comments and identifiers must be in Italian. For example: ```python print("Hello World!")

Scripts

The scripts directory contains all the scripts used to perform all the generations and analysis. All files are properly commented. Here a brief description of each file:

code_generation.py: This script automates code generation using AI models (GPT, DeepSeek, and Claude) for different programming and natural languages. It reads prompts from CSV files, generates code based on the prompts, and saves the results in structured directories. It logs the process, handles errors, and stores the generated code in separate files for each iteration.

computeallanalysis.py: This script performs static code analysis on generated code files using different models, languages, and programming languages. It runs various analyses (Flake8, Pylint, Lizard) depending on the programming language: for Python, it runs all three analyses, while for Java, only Lizard is executed. The results are stored in dedicated report directories for each iteration. The script ensures the creation of necessary directories and handles any errors that occur during the analysis process.

createtestjava.py: This script processes Java code generated by different models and languages, extracting methods using a JavaParser server. It iterates through multiple iterations of generated code, extracts the relevant method code (or uses the full code if no method is found), and stores the results in a JSONL file for each language and model combination.

deepseek_model.py: This function sends a request to the DeepSeek API, passing a system and user prompt, and extracts the generated code snippet based on the specified programming language. It prints the extracted code in blue to the console, and if any errors occur during the request or extraction, it prints an error message in red. If successful, it returns the extracted code snippet; otherwise, it returns None.

extractpmdreport.py: This script processes PMD analysis reports in SARIF format and converts them into CSV files. It extracts the contents of ZIP files containing the PMD reports, parses the SARIF file to gather analysis results, and saves the findings in a CSV file. The output includes details such as file names, rules, messages, and the count of issues found. The script iterates through multiple languages, models, and iterations, ensuring that PMD reports are properly processed and saved for each combination.

flake_analysis.py: The flake_analysis function runs Flake8 to analyze Python files for errors and generates a CSV report summarizing the results. It processes the output, extracting error details such as filenames, error codes, and messages. The errors are grouped by file and saved in a CSV file for easy review.

generatepredictionclaude_java.py: The generatecodefrom_prompt function processes a JSON file containing prompts, generates Java code using the Claude API, and saves the generated code to a new JSON file. It validates each prompt, ensures it's JSON-serializable, and sends it to the Claude API for code generation. If the generation is successful, the code is stored in a structured format, and the output is saved to a JSON file for further use.

generatepredictionclaude_python.py: This code defines a function generatecodefrom_prompt that processes a JSON file containing prompts, generates Python code using the Claude API, and saves the generated code to a new JSON file. It handles invalid values and ensures all prompts are
f
Example prompts, their task-related features, and their assigned complexity...
plos.figshare.com
xls
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacqueline A. Jansen; Artür Manukyan; Nour Al Khoury; Altuna Akalin (2025). Example prompts, their task-related features, and their assigned complexity values. [Dataset]. http://doi.org/10.1371/journal.pone.0317084.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317084.t001
Dataset updated
Feb 21, 2025
Dataset provided by
PLOS ONE
Authors
Jacqueline A. Jansen; Artür Manukyan; Nour Al Khoury; Altuna Akalin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example prompts, their task-related features, and their assigned complexity values.
Z
Can Developers Prompt? A Controlled Experiment for Code Documentation...
data.niaid.nih.gov
zenodo.org
Updated Sep 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maalej, Walid (2024). Can Developers Prompt? A Controlled Experiment for Code Documentation Generation [Replication Package] [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13127237
Explore at:
Dataset updated
Sep 11, 2024
Dataset provided by
Kruse, Hans-Alexander
Puhlfürß, Tim
Maalej, Walid
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Summary of Artifacts

This is the replication package for the paper titled 'Can Developers Prompt? A Controlled Experiment for Code Documentation Generation' that is part of the 40th IEEE International Conference on Software Maintenance and Evolution (ICSME), from October 6 to 11, 2024, located in Flagstaff, AZ, USA.

Full Abstract

Large language models (LLMs) bear great potential for automating tedious development tasks such as creating and maintaining code documentation. However, it is unclear to what extent developers can effectively prompt LLMs to create concise and useful documentation. We report on a controlled experiment with 20 professionals and 30 computer science students tasked with code documentation generation for two Python functions. The experimental group freely entered ad-hoc prompts in a ChatGPT-like extension of Visual Studio Code, while the control group executed a predefined few-shot prompt. Our results reveal that professionals and students were unaware of or unable to apply prompt engineering techniques. Especially students perceived the documentation produced from ad-hoc prompts as significantly less readable, less concise, and less helpful than documentation from prepared prompts. Some professionals produced higher quality documentation by just including the keyword Docstring in their ad-hoc prompts. While students desired more support in formulating prompts, professionals appreciated the flexibility of ad-hoc prompting. Participants in both groups rarely assessed the output as perfect. Instead, they understood the tools as support to iteratively refine the documentation. Further research is needed to understand which prompting skills and preferences developers have and which support they need for certain tasks.

Author Information

Name Affiliation Email

Hans-Alexander Kruse Universität Hamburg hans-alexander.kruse@studium.uni-hamburg.de

Tim Puhlfürß Universität Hamburg tim.puhlfuerss@uni-hamburg.de

Walid Maalej Universität Hamburg walid.maalej@uni-hamburg.de

Citation Information

@inproceedings{kruse-icsme-2024, author={Kruse, Hans-Alexander and Puhlf{"u}r{\ss}, Tim and Maalej, Walid}, booktitle={2022 IEEE International Conference on Software Maintenance and Evolution}, title={Can Developers Prompt? A Controlled Experiment for Code Documentation Generation}, year={2024}, doi={tba}, }

Artifacts Overview

Preprint

The file kruse-icsme-2024-preprint.pdf is the preprint version of the official paper. You should read the paper in detail to understand the study, especially its methodology and results.

Results

The folder results includes two subfolders, explained in the following.

Demographics RQ1 RQ2

The subfolder Demographics RQ1 RQ2 provides Jupyter Notebook file evaluation.ipynb for analyzing (1) the experiment participants' submissions of the digital survey and (2) the ad-hoc prompts that the experimental group entered into their tool. Hence, this file provides demographic information about the participants and results for the research questions 1 and 2. Please refer to the README file inside this subfolder for installation steps of the Jupyter Notebook file.

RQ2

The subfolder RQ2 contains further subfolders with Microsoft Excel files specific to the results of research question 2:

The subfolder UEQ contains three times the official User Experience Questionnaire (UEQ) analysis Excel tool, with data entered from all participants/students/professionals.

The subfolder Open Coding contains three Excel files with the open-coding results for the free-text answers that participants could enter at the end of the survey to state additional positive and negative comments about their experience during the experiment. The Consensus file provides the finalized version of the open coding process.

Extension

The folder extension contains the code of the Visual Studio Code (VS Code) extension developed in this study to generate code documentation with predefined prompts. Please refer to the README file inside the folder for installation steps. Alternatively, you can install the deployed version of this tool, called Code Docs AI, via the VS Code Marketplace.

You can install the tool to generate code documentation with ad-hoc prompts directly via the VS Code Marketplace. We did not include the code of this extension in this replication package due to license conflicts (GNUv3 vs. MIT).

Survey

The folder survey contains PDFs of the digital survey in two versions:

The file Survey.pdf contains the rendered version of the survey (how it was presented to participants).

The file SurveyOptions.pdf is an export of the LimeSurvey web platform. Its main purpose is to provide the technical answer codes, e.g., AO01 and AO02, that refer to the rendered answer texts, e.g., Yes and No. This can help you if you want to analyze the CSV files inside the results folder (instead of using the Jupyter Notebook file), as the CSVs contain the answer codes, not the answer texts. Please note that an export issue caused page 9 to be almost blank. However, this problem is negligible as the question on this page only contained one free-text answer field.

Appendix

The folder appendix provides additional material about the study:

The subfolder tool_screenshots contains screenshots of both tools.

The file few_shots.txt lists the few shots used for the predefined prompt tool.

The file test_functions.py lists the functions used in the experiment.

Revisions

Version Changelog

1.0.0 Initial upload

1.1.0 Add paper preprint. Update abstract.

1.2.0 Update replication package based on ICSME Artifact Track reviews

License

See LICENSE file.
f
Results of chi-square test testing all different prompting strategies over...
plos.figshare.com
xls
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacqueline A. Jansen; Artür Manukyan; Nour Al Khoury; Altuna Akalin (2025). Results of chi-square test testing all different prompting strategies over the various complexities. [Dataset]. http://doi.org/10.1371/journal.pone.0317084.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317084.t002
Dataset updated
Feb 21, 2025
Dataset provided by
PLOS ONE
Authors
Jacqueline A. Jansen; Artür Manukyan; Nour Al Khoury; Altuna Akalin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Results of chi-square test testing all different prompting strategies over the various complexities.
D
Cookie Tracking Software Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Cookie Tracking Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/cookie-tracking-software-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Cookie Tracking Software Market Outlook

The global cookie tracking software market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach around USD 4.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 16.8% during the forecast period. This growth is driven by increasing digitalization, heightened demand for personalized marketing, and stringent data privacy regulations. Companies are investing heavily in technologies that can help them track user behavior, optimize user experiences, and ensure compliance with evolving privacy laws, which fuels market growth.

One of the primary growth factors for the cookie tracking software market is the increasing emphasis on personalized marketing. As companies strive to offer more tailored user experiences, they require sophisticated tools to collect and analyze user data. Cookie tracking software enables businesses to capture detailed insights into user preferences and behaviors, allowing them to deliver personalized content and advertisements. This capability significantly enhances customer engagement and conversion rates, making it a critical component for digital marketing strategies.

Another contributing factor is the rising implementation of data privacy regulations worldwide. Legislation such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States has necessitated more transparent and secure data tracking practices. Cookie tracking software helps organizations comply with these regulations by providing features like consent management and data anonymization. This ensures that businesses can continue to leverage user data while adhering to legal requirements, thereby mitigating the risk of hefty fines and reputational damage.

The growing adoption of digital platforms during the COVID-19 pandemic has further accelerated the demand for cookie tracking software. With an increasing number of consumers shifting to online shopping and remote work environments, businesses have had to ramp up their digital presence. This surge in digital activity has underscored the importance of effective user tracking and data analysis, prompting more companies to invest in advanced cookie tracking solutions to better understand and cater to their online audiences.

In the realm of digital marketing, Subscription Analytics Software is becoming increasingly vital as businesses transition to subscription-based models. This software provides companies with the tools to analyze customer subscription data, offering insights into customer behavior, preferences, and churn rates. By leveraging these insights, businesses can optimize their subscription offerings, tailor marketing strategies, and enhance customer retention. The integration of subscription analytics with cookie tracking software can further enrich data collection, enabling a more comprehensive understanding of user interactions and preferences. As the subscription economy continues to grow, the demand for robust analytics solutions that can seamlessly integrate with existing digital marketing tools is expected to rise.

Regionally, North America is expected to hold a significant share of the cookie tracking software market, driven by the presence of major technology companies and high internet penetration rates. Europe is also anticipated to see robust growth due to stringent data protection regulations and widespread digitalization initiatives. Meanwhile, the Asia Pacific region is projected to experience the fastest growth, fueled by rapid economic development, increasing internet usage, and the proliferation of e-commerce platforms in countries like China and India.

Component Analysis

The cookie tracking software market can be segmented by component into software and services. The software segment, which includes various types of cookie tracking applications and platforms, is expected to dominate the market. This is largely due to the continuous advancements in technology and the increasing need for sophisticated tools to analyze vast amounts of user data. Companies are constantly seeking robust software solutions that can provide real-time insights and seamless integration with other digital marketing tools.

Within the software segment, several sub-categories exist, including standalone cookie tracking software and integrated solutions that form part of broader digital marketing platforms. Sta
PROSPECT: Professional Role Effects on Specialized Perspective Enhancement...
zenodo.org
explore.openaire.eu
zip
Updated Dec 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keisuke Sato; Keisuke Sato (2024). PROSPECT: Professional Role Effects on Specialized Perspective Enhancement in Conversational Task [Dataset]. http://doi.org/10.5281/zenodo.14567800
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14567800
Dataset updated
Dec 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Keisuke Sato; Keisuke Sato
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 29, 2024
Description
### Data Availability Statement (for the paper)

All dialogue logs and final responses collected in this study are publicly available in the PROSPECT repository on Zenodo (DOI: [to be assigned]). The repository contains PDF files of complete dialogue histories and Markdown files of final comprehensive analyses for all conditions and models used in this study, allowing for reproducibility and further analysis.

### README.md for Zenodo

# PROSPECT: Professional Role Effects on Specialized Perspective Enhancement in Conversational Task

## Overview
This repository (PROSPECT) contains the dataset associated with the paper:
> "Empirical Investigation of Expertise, Multiperspectivity, and Abstraction Induction in Conversational AI Outputs through Professional Role Assignment to Both User and AI"

This research analyzed changes in dialogue logs and final responses when professional roles were assigned to both user and AI sides across multiple Large Language Models (LLMs). This repository provides the complete dialogue logs (PDF format) and final responses (Markdown format) used in the analysis.

## Directory Structure
The repository structure under the top directory (`PROSPECT/`) is as follows:

```
PROSPECT/
├── dialogue/ # Dialogue histories (PDF)
│ ├── none/
│ ├── ai_only/
│ ├── user_only/
│ └── both/
└── final_answers/ # Final responses (Markdown)
├── none/
├── ai_only/
├── user_only/
└── both/
```

- **dialogue/**
- Contains raw dialogue logs in PDF format. Subdirectories represent role assignment conditions:
- `none/`: No roles assigned to either user or AI
- `ai_only/`: Role assigned to AI only
- `user_only/`: Role assigned to user only
- `both/`: Roles assigned to both user and AI
- **final_answers/**
- Contains final comprehensive analysis responses in Markdown format. Directory structure mirrors that of `dialogue/`.

## File Naming Convention
Files in each directory follow this naming convention:
```
[AI]_[conditionNumber]-[roleNumber].pdf
[AI]_[conditionNumber]-[roleNumber].md
```
- `[AI]`: AI model name used for dialogue (e.g., ChatGPT, ChatGPT-o1, Claude, Gemini)
- `[conditionNumber]`: Number indicating role assignment condition
- 0: none
- 1: ai_only
- 2: user_only
- 3: both
- `[roleNumber]`: Professional role number
- 0: No role
- 1: Detective
- 2: Psychologist
- 3: Artist
- 4: Architect
- 5: Natural Scientist

### Examples:
- `ChatGPT_3-1.pdf`: Dialogue log with ChatGPT-4o model under "both" condition (3) with detective role (1)
- `Gemini_1-4.md`: Final response from Gemini model under "ai_only" condition (1) with architect role (4)

## Role Number Reference
| roleNumber | Professional Role |
|-----------:|:-----------------|
| 0 | No role |
| 1 | Detective |
| 2 | Psychologist |
| 3 | Artist |
| 4 | Architect |
| 5 | Natural Scientist|

## Data Description
- **Dialogue Histories (PDF format)**
Complete logs of questions and responses from each session, preserved as captured during the research. All dialogues were conducted in Japanese. While assistant version information is not included, implementation dates and model names are recorded within the files.
- **Final Responses (Markdown format)**
Excerpted responses to the final "comprehensive analysis request" as Markdown files, intended for text analysis and keyword extraction. All responses are in Japanese.

*Note: This dataset contains dialogues and responses exclusively in Japanese. Researchers interested in lexical analysis or content analysis should consider this language specification.

## How to Use
1. Please maintain the folder hierarchy after downloading.
2. For meta-analysis or lexical analysis, refer to PDFs for complete dialogues and Markdown files for final responses.
3. Utilize for research reproduction, secondary analysis, or meta-analysis.

## License
This dataset is released under the **CC BY 4.0** License.
- Free to use and modify, but please cite this repository (DOI) and the associated paper when using the data.

## Related Publication

## Disclaimer
- The dialogue logs contain no personal information or confidential data.
- The provided logs and responses reflect the research timing; identical prompts may yield different responses due to AI model updates.
- The creators assume no responsibility for any damages resulting from the use of this dataset.

## Contact
For questions or requests, please contact skeisuke@ibaraki-ct.ac.jp.
E
Global Laser Methane Telemetry Detection Module Market Industry Best...
statsndata.org
excel, pdf
Updated Jul 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global Laser Methane Telemetry Detection Module Market Industry Best Practices 2025-2032 [Dataset]. https://www.statsndata.org/report/laser-methane-telemetry-detection-module-market-299604
Explore at:
pdf, excelAvailable download formats
Dataset updated
Jul 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The Laser Methane Telemetry Detection Module market is witnessing significant growth as industries increasingly prioritize safety and environmental compliance. These modules utilize advanced laser technology to detect methane and other gases in real-time, providing critical data for prompt decision-making and risk m
u
Data from: Can Large Language Models Identify Locations Better Than Linked...
portaldelaciencia.uva.es
zenodo.org
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
García-Zarza, Pablo; Asensio-Pérez, Juan I.; Bote-Lorenzo, Miguel L.; Sánchez-Turrión, Luis F.; Taibi, Davide; Vega-Gorgojo, Guillermo; García-Zarza, Pablo; Asensio-Pérez, Juan I.; Bote-Lorenzo, Miguel L.; Sánchez-Turrión, Luis F.; Taibi, Davide; Vega-Gorgojo, Guillermo (2025). Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning? [Dataset]. https://portaldelaciencia.uva.es/documentos/6856990b6364e456d3a65544
Explore at:
Dataset updated
2025
Authors
García-Zarza, Pablo; Asensio-Pérez, Juan I.; Bote-Lorenzo, Miguel L.; Sánchez-Turrión, Luis F.; Taibi, Davide; Vega-Gorgojo, Guillermo; García-Zarza, Pablo; Asensio-Pérez, Juan I.; Bote-Lorenzo, Miguel L.; Sánchez-Turrión, Luis F.; Taibi, Davide; Vega-Gorgojo, Guillermo
Description
This repository contains the dataset and analysis associated with the research paper "Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning?" Presented at the 20th European Conference on Technology Enhanced Learning (ECTEL), 2025.

Overview

Ubiquitous learning (u-learning) applications often rely on identifying relevant Points of Interest (POIs) where students can engage in contextualized learning tasks. Traditionally, these POIs have been retrieved from structured datasets like Linked Open Data (LOD). However, with the rise of Large Language Models (LLMs), a new question arises: can LLMs outperform LOD in identifying such locations?

This study compares the performance of a LOD dataset (Wikidata) and two LLMs (ChatGPT and DeepSeek) in retrieving 16th-century cultural heritage sites (churches, cathedrals, castles, and palaces) across three European cities (two in Spain and one in Italy) and their regions.

Dataset

The file LODvsLLMs.xlsx includes:

Raw data retrieved from Wikidata and the two LLMs.

SPARQL queries and LLM prompts used for data collection.

Comparative analysis across four key dimensions:

Accuracy: Are the retrieved sites real and verifiable?

Consistency: Do repeated queries yield stable results?

Completeness: How exhaustive are the lists of POIs?

Validity: Are the geographic coordinates and Wikipedia links correct?

Key Findings

LOD (Wikidata) outperformed LLMs in terms of consistency, completeness (especially in larger regions), and validity of data.

LLMs were able to retrieve some POIs not found in Wikidata, but also introduced hallucinations and invalid links.

A hybrid approach combining LOD and LLMs is proposed for future u-learning applications to maximize coverage and reliability.

Citation

If you use this dataset or refer to the findings in your work, please cite the original paper presented at ECTEL 2025.

García-Zarza, P., Asensio-Pérez, J.I., Bote-Lorenzo, M.L., Sánchez-Turrión, L.F., Taibi, D., Vega-Gorgojo, G., Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning Proceedings of 20th European Conference on Technology Enhanced Learning, ECTEL 2025, Newcastle & Durham, United Kingdom, September 2025.
d
A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Aug 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott McGrath (2025). A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data [Dataset]. http://doi.org/10.5061/dryad.s4mw6m9cv
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.s4mw6m9cv
Dataset updated
Aug 1, 2025
Dataset provided by
Dryad Digital Repository
Authors
Scott McGrath
Time period covered
Jan 1, 2023
Description
Objective: Our objective is to evaluate the efficacy of ChatGPT 4 in accurately and effectively delivering genetic information, building on previous findings with ChatGPT 3.5. We focus on assessing the utility, limitations, and ethical implications of using ChatGPT in medical settings. Materials and Methods: A structured questionnaire, including the Brief User Survey (BUS-15) and custom questions, was developed to assess ChatGPT 4's clinical value. An expert panel of genetic counselors and clinical geneticists independently evaluated ChatGPT 4's responses to these questions. We also involved comparative analysis with ChatGPT 3.5, utilizing descriptive statistics and using R for data analysis. Results: ChatGPT 4 demonstrated improvements over 3.5 in context recognition, relevance, and informativeness. However, performance variability and concerns about the naturalness of the output were noted. No significant difference in accuracy was found between ChatGPT 3.5 and 4.0. Notably, the effic..., Study Design This study was conducted to evaluate the performance of ChatGPT 4 (March 23rd, 2023) Â Model) in the context of genetic counseling and education. The evaluation involved a structured questionnaire, which included questions selected from the Brief User Survey (BUS-15) and additional custom questions designed to assess the clinical value of ChatGPT 4's responses. Questionnaire Development The questionnaire was built on Qualtrics, which comprised twelve questions: seven selected from the BUS-15 preceded by two additional questions that we designed. The initial questions focused on quality and answer relevancy: 1.Â Â Â Â The overall quality of the Chatbotâ€™s response is: (5-point Likert: Very poor to Very Good) 2.Â Â Â Â The Chatbot delivered an answer that provided the relevant information you would include if asked the question. (5-point Likert: Strongly disagree to Strongly agree) The BUS-15 questions (7-point Likert: Strongly disagree to Strongly agree) focused on: 1.Â Â Â Â Recogniti..., , # A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data

https://doi.org/10.5061/dryad.s4mw6m9cv

This data was captured when evaluating the ability of ChatGPT to address questions patients may ask it about three genetic conditions (BRCA1, HFE, and MLH1). This data is associated with the JAMIA article of the similar name with the DOIÂ 10.1093/jamia/ocae128

Description of the data and file structure

Key: This tab contains the data structure, explaining the survey questions, and potential responses available.

Prompt Responses: This tab contains the prompts used for ChatGPT, and the response provided from each model (3.5 and 4)

GPT 4 Results: This tab provides the responses collected from the medical experts (genetic counselors and clinical geneticist) from the Qualtrics survey.

Accuracy (Qx_1): This tab contains the subset of results from both the Ch...
D
Oil and Gas Pipeline Leak Detection Market Report | Global Forecast From...
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Oil and Gas Pipeline Leak Detection Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/oil-and-gas-pipeline-leak-detection-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Oil and Gas Pipeline Leak Detection Market Outlook

The global oil and gas pipeline leak detection market size is projected to experience significant growth, with an expected valuation rising from USD 2.37 billion in 2023 to USD 3.89 billion by 2032, reflecting a healthy compound annual growth rate (CAGR) of 5.6% from 2024 to 2032. This market expansion is largely fueled by the increasing emphasis on safety and environmental regulations, the growing complexity of pipeline networks, and the dire need for efficient and reliable leak detection systems. As governments and organizations worldwide become more aware of and committed to reducing the environmental impacts of fossil fuel extraction and transportation, the demand for advanced leak detection technologies has intensified, driving market growth.

One of the primary factors contributing to the growth of the oil and gas pipeline leak detection market is the stringent regulatory frameworks being implemented globally to prevent environmental disasters. These regulations mandate the installation of sophisticated leak detection systems to minimize the risk of oil spills and gas leaks, which can have catastrophic environmental and economic consequences. The increasing public awareness and pressure on governments to ensure the safety and integrity of oil and gas infrastructure have also played a crucial role in driving the market's expansion. Furthermore, the adoption of best practices and international standards in pipeline monitoring and maintenance is further propelling the demand for innovative and reliable leak detection technologies.

Technological advancements in the oil and gas industry have paved the way for the development of more efficient and accurate leak detection systems. Innovations such as acoustic/ultrasonic sensors, fiber optic technologies, and advanced data analytics are improving the precision and reliability of leak detection, thereby reducing operational risks and potential losses. The integration of Internet of Things (IoT) and artificial intelligence (AI) in pipeline monitoring systems enhances real-time data collection and analysis, enabling prompt detection and response to leaks. These cutting-edge technologies are not only enhancing the effectiveness of leak detection but also reducing the overall costs associated with pipeline monitoring and maintenance, making them increasingly attractive to oil and gas companies.

The growing global energy demand and the expansion of oil and gas pipeline networks, especially in emerging economies, are also driving the need for efficient leak detection systems. As countries endeavor to secure their energy supply and improve infrastructure, significant investments are being made in the construction and maintenance of extensive pipeline networks. This expansion necessitates robust leak detection solutions to ensure the safe and efficient transportation of oil and gas resources. Additionally, the shift towards unconventional oil and gas resources, such as shale gas and deepwater drilling, presents new challenges in leak detection, further increasing the demand for advanced technologies.

Pipeline Leak Detectors play a crucial role in ensuring the safety and efficiency of oil and gas transportation. These detectors are designed to identify leaks quickly and accurately, minimizing the risk of environmental damage and economic loss. By utilizing advanced technologies such as acoustic sensors and fiber optics, pipeline leak detectors can provide real-time monitoring and immediate alerts, allowing operators to respond swiftly to any potential issues. This capability is particularly important in complex pipeline networks, where undetected leaks can lead to significant operational challenges. As the industry continues to evolve, the integration of pipeline leak detectors with digital technologies like AI and IoT is enhancing their effectiveness, offering more precise detection and predictive maintenance capabilities.

Technology Analysis

The technology segment of the oil and gas pipeline leak detection market encompasses various sophisticated systems, each offering unique advantages in detecting leaks with precision. Acoustic/ultrasonic technology, for instance, stands out for its ability to detect leaks through sound waves. This method is particularly effective in situations where traditional methods may fall short, as it can monitor for changes in noise levels along pipeline routes, indicating potential leaks. The sensitivity of acoustic/ultrasonic systems to sound variations makes th
Data Center Construction Market in Southeast Asia by Construction Components...
technavio.com
pdf
Updated May 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2021). Data Center Construction Market in Southeast Asia by Construction Components and Geography - Forecast and Analysis 2021-2025 [Dataset]. https://www.technavio.com/report/data-center-construction-market-industry-in-southeast-asia-analysis
Explore at:
pdfAvailable download formats
Dataset updated
May 31, 2021
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2020 - 2025
Area covered
South East Asia
Description
Snapshot img

The data center construction market in southeast asia size is expected to grow by USD 3.61 billion and record a CAGR of 12% during 2021-2025. This post-pandemic data center construction market in southeast asia report has assessed the shift in consumer behavior and has identified and explored the upcoming trends and drivers that the vendors can capitalize on to support prompt business decisions. In this data center construction market in southeast asia analysis report, key drivers such as increase in investment in data centers have been discussed with emerging growth regions, which will offer immense business opportunities. Our analysts have also identified challenges such as system integration and interoperability issues, which will impede market growth. With these insights, the vendors can recreate their plan of action to obtain growth opportunities in the future. This data center construction market in southeast asia report further entails segmentation by geography (Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia) and construction component (electrical construction, mechanical construction, consulting and other services, and integrating software). The available actionable insights on the segmentations, in this report, will enable a better understanding of the target audience and changing demand patterns.

Who are the Key Vendors in the Data Center Construction Market In Southeast Asia?

The data center construction market in southeast asia forecast report provides insights on complete key vendor profiles and their business strategies to reimage themselves. The profiles include information on the production, competitive landscape, sustainability, and prospects of the leading companies including:

ABB Ltd. AECOM Eaton Corporation Plc Hewlett Packard Enterprise Development LP Legrand SA M+W Group GmbH Ove Arup & Partners International Ltd. Rittal GmbH & Co. KG Schneider Electric SE Vertiv Holdings Co.

Our analysts have extensively outlined successful business strategies deployed by the key vendors in this market research report. The data center construction market in southeast asia is fragmented and the vendors are deploying various organic and inorganic growth strategies to compete in the market.

To make the most of the opportunities, vendors should focus on fast-growing segments, while maintaining their positions in the slow-growing segments. The data center construction market in southeast asia further offers well-structured marketing strategies to overcome the negative post-COVID-19 impact, if any, on each product and service segment.

Which are the Key Regional Markets for Data Center Construction Market In Southeast Asia?

The report offers an up-to-date analysis of the geographical composition of the market. Singapore will record a fast growth rate during 2021-2025, owing to which the region should offer several growth opportunities to market vendors. The rise in iot solutions will significantly influence data center construction market in southeast asia growth in this region. From the statistical study of the geographic landscape, you can interpret and understand the competitive intelligence and regional opportunities in store for vendors for 2021-2025.

35% of the market's growth will originate from Singapore during the forecast period. Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia are the key markets for data center construction market in southeast asia in the region. This report provides estimations of the contribution of all regions to the growth of the data center construction market in southeast asia size.

Data Center Construction Market In Southeast Asia Scope Report Coverage Details Page number 120 Base year 2020 Forecast period 2021-2025 Growth momentum & CAGR Accelerate at a CAGR of 12% Market growth 2021-2025 USD 3.61 billion Market structure Fragmented YoY growth (%) 9.45 Regional analysis Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia Performing market contribution Singapore at 35% Key consumer countries Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia Competitive landscape Leading companies, competitive strategies, consumer engagement scope Companies profiled ABB Ltd., AECOM, Eaton Corporation Plc, Hewlett Packard Enterprise Development LP, Legrand SA , M+W Group GmbH, Ove Arup & Partners International Ltd., Rittal GmbH & Co. KG, Schneider Electric SE, and Vertiv Holdings Co. Market Dynamics Parent market a
Z
Geoparsing with Large Language Models: Leveraging the linguistic...
data.niaid.nih.gov
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous, Anonymous (2024). Geoparsing with Large Language Models: Leveraging the linguistic capabilities of generative AI to improve geographic information extraction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13862654
Explore at:
Dataset updated
Oct 2, 2024
Dataset authored and provided by
Anonymous, Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Geoparsing with Large Language Models

The .zip file included in this repository contains all the code and data required to reproduce the results from our paper. Note, however, that in order to run the OpenAI models, users will required an OpenAI API key and sufficient API credits.

Data

The data used for the paper are in the datasetst and results folders.

**Datasets: **This contains the XML files (LGL and Geovirus) and Json files (News2024) used to benchmark the models. It also contains all the data used to fine-tune the gpt-3.5 model, the prompt templates sent to the LLMs, and other data used for mapping and data creation.

**Results: **This contains the results for the models on the three datastes. The folder is separated by dataset, with a single .csv file giving the results for each model on each dataset separately. The .csv file is structured so that each row contains either a predicted toponym and an associated true toponym (along with assigned spatial coordinates), if the model correctly identified a toponym; otherwise the true toponym columns are empty for false positives and the predicted columns are empty for false negatives.

Code

The code is split into two seperate folders gpt_geoparser and notebooks.

**GPT_Geoparser: **this contains the classes and methods used process the XML and JSON articles (data.py), interact with the Nominatim API for geocoding (gazetteer.py), interact with the OpenAI API (gpt_handler.py), process the outputs from the GPT models (geoparser.py) and analyse the results (analysis.py).

Notebooks: This series of notebooks can be used to reproduce the results given in the paper. The file names a reasonably descriptive of what they do within the context of the paper.

Code/software

Requirements

Numpy

Pandas

Geopy

Scitkit-learn

lxml

openai

matplotlib

Contextily

Shapely

Geopandas

tqdm

huggingface_hub

Gnews

Access information

Other publicly accessible locations of the data:

The LGL and GeoVirus datasets can also be obtained here (opens in new window).

Abstract

Geoparsing- the process of associating textual data with geographic locations - is a key challenge in natural language processing. The often ambiguous and complex nature of geospatial language make geoparsing a difficult task, requiring sophisticated language modelling techniques. Recent developments in Large Language Models (LLMs) have demonstrated their impressive capability in natural language modelling, suggesting suitability to a wide range of complex linguistic tasks. In this paper, we evaluate the performance of four LLMs - GPT-3.5, GPT-4o, Llama-3.1-8b and Gemma-2-9b - in geographic information extraction by testing them on three geoparsing benchmark datasets: GeoVirus, LGL, and a novel dataset, News2024, composed of geotagged news articles published outside the models' training window. We demonstrate that, through techniques such as fine-tuning and retrieval-augmented generation, LLMs significantly outperform existing geoparsing models. The best performing models achieve a toponym extraction F1 score of 0.985 and toponym resolution accuracy within 161 km of 0.921. Additionally, we show that the spatial information encoded within the embedding space of these models may explain their strong performance in geographic information extraction. Finally, we discuss the spatial biases inherent in the models' predictions and emphasize the need for caution when applying these techniques in certain contexts.

Methods

This contains the data and codes required to reproduce the results from our paper. The LGL and GeoVirus datasets are pre-existing datasets, with references given in the manuscript. The News2024 dataset was constructed specifically for the paper.

To construct the News2024 dataset, we first created a list of 50 cities from around the world which have population greater than 1000000. We then used the GNews python package https://pypi.org/project/gnews/ (opens in new window) to find a news article for each location, published between 2024-05-01 and 2024-06-30 (inclusive). Of these articles, 47 were found to contain toponyms, with the three rejected articles referring to businesses which share a name with a city, and which did not otherwise mention any place names.

We used a semi autonmous approach to geotagging the articles. The articles were first processed using a Distil-BERT model, fine tuned for named entity recognicion. This provided a first estimate of the toponyms within the text. A human reviewer then read the articles, and accepted or rejected the machine tags, and added any tags missing from the machine tagging process. We then used OpenStreetMap to obtain geographic coordinates for the location, and to identify the toponym type (e.g. city, town, village, river etc). We also flagged if the toponym was acting as a geo-political entity, as these were reomved from the analysis process. In total, 534 toponyms were identified in the 47 news articles.
f
Representative open-ended responses to the prompt “how would having the...
figshare.com
xls
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa McCartney; Jessica Colon (2023). Representative open-ended responses to the prompt “how would having the [module] during your chosen time have better prepared you for life after FIU? [Dataset]. http://doi.org/10.1371/journal.pone.0285176.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285176.t007
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Melissa McCartney; Jessica Colon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
What would you have done differently in regards to career preparation?” that connect to the inductive code of “better prepared to apply to graduate/professional schools or jobs” Relevant sections of the student response are bolded and underlined.
Replication package of the paper "Where is Code Generated by LLMs Coming...
zenodo.org
zip
Updated Nov 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Anonymous; Anonymous Anonymous (2024). Replication package of the paper "Where is Code Generated by LLMs Coming From? A Study with Gemini and Bing CoPilot" [Dataset]. http://doi.org/10.5281/zenodo.14051606
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14051606
Dataset updated
Nov 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous Anonymous; Anonymous Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication Package

This replication package contains the necessary tools, data, and scripts for reproducing the results of our paper: "Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot". Below is a detailed description of the directory structure and the contents of this package.

Contents

The replication package is organized into two main directories:

assets: This directory contains all .csv files used as input for the script and the outputted .csv file used to perform the manual and automated analyses for RQ1 and RQ2.

script: This directory contains all scripts for RQ1 and RQ2.

In the following, we describe the content of each directory:

assets

This directory contains the tools and resources required for our study.

dataset: Contains the main datasets used in the study.

annotationStore.csv: Input dataset for our analyses, originating from the CODESEARCHNET dataset.

queries.csv: .csv file containing the queries used for the experiments filtered from the CODESEARCHNET dataset. This file contains the following columns:

Language: Programming language of the query

Query: Query used for the experiment

GitHubUrl: GitHub URL related to a snippet that addresses the query

Relevance: Relevance of the linked GitHub snippet to the query

data: Contains the datasets and results of all analyses.

queries.csv: General input queries. This file contains the following columns:

Language: Programming language of the query

Query: Query used for the snippet generation

Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about:

queries_filled.csv: Similar to the previous file, but also containing the output produced by the LLM-based assistants. This file contains the following columns:

Language: Programming language of the query

Query: Query used for the snippet generation

Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about:

Notes: General notes that provide additional context or information about the query or prompt.

Gemini_Answer(n): The generated code snippets by Gemini.

Gemini(n): The external links provided by Gemini.

Prompt (repeated)

Note: Notes that provide additional context or information about the query or prompt.

Copilot_Answer(n): The generated code snippets by Bing-Copilot.

Copilot_Bing(n): The external links provided by Bing-Copilot.

copilot || gemini: Contains the data related to the specific LLM. These two subdirectories have the same internal structure.

queries.csv: The queries_filled.csv file, filtered for the specific LLM.

queries_noTrivial.csv: Contains only the queries with at least one nontrivial generated snippet.

external_links.csv: External links extracted from the LLMs output.

external_links_filled.csv: Snippets extracted from the external links.

index: Query ID

source: Snippet ID

url: Link URL

note: Notes that provide additional context or information about the query or prompt

code(n): The n-th code snippet extracted from the source

manual_analysis: Manual analysis results.

manual_analysis.csv:

index: Query ID

query: Query used for the snippet generation

generatedsnippet(n): The n-th code snippet generated by the LLM-based assistant

trivial_1: Manual analysis of whether or not the snippet was trivial (validator 1)

trivial_2: Manual analysis of whether or not the snippet was trivial (validator 2)

trivial_final: Manual analysis of whether or not the snippet was trivial (final classification if there is a disagreement)

source: URL to analyze

sourcetype1: Type of the source (validator 1)

sourcetype2: Type of the source (validator 2)

sourcetypefinal: Type of the source (final classification if there is a disagreement)

relatedtoquery_1: Relevance of the link to the query (validator 1)

relatedtoquery_2: Relevance of the link to the query (validator 2)

relatedtoquery_final: Relevance of the link to the query (final classification if there is a disagreement)

relatedtosnippets_1: Relevance of the generated snippet to those in the link (validator 1)

relatedtosnippets_2: Relevance of the generated snippet to those in the link (validator 2)

relatedtosnippets_final: Relevance of the generated snippet to those in the link (final classification if there is a disagreement)

manual_analysis_noTrivial.csv: As in the previous file, but only the queries with at least one nontrivial generated code snippet.

clone_detector: Output and intermediate files for clone detection with Copilot data.

copilot_tokens || gemini_tokens: Contains the output the tokenization of the generated code snippets and the code snippets extracted from the external links.

merged_llm_ext_link.csv: All possible pairs (Cartesian product) (code snippet extracted from the external links, generated code snippet). This file is the input of the clone detection tool.

ID_query: Query ID

query: Query used for the snippet generation

language: Programming language of the query

generated_snippet: The generated code snippet by the LLM-based assistant

IDgensnippet: The index of the generated code snippet

LOCgensnippet: The number of lines of code of the generated code snippet

ID_source: Source ID

source: Source URL

source_snippet: Code snippet extracted from the source

IDsourcesnippet: ID of the code snippet extracted from the source

LOCsourcesnippet: The number of lines of code of the code snippet extracted from the source

note: Notes that provide additional context or information about the query or prompt

clone_detection_output.csv: Contains the clone detection results.

ID_query: The index of the query

query: Query used for the snippet generation

language: The programming language of the query

generated_snippet: The generated code snippet by the LLM-based assistant

IDgensnippet: The index of the generated code snippet

LOCgensnippet: The number of lines of code of the generated code snippet

ID_source: Source ID

source: Source URL

source_snippet: Code snippet extracted from the source

IDsourcesnippet: ID of the code snippet extracted from the source

LOCsourcesnippet: The number of lines of code of the code snippet extracted from the source

note: Notes that provide additional context or information about the query or prompt

clone_detected: bBolean value that indicates whether a clone has been detected (1 = detected, 0 = not detected)

cloning_ratio: Ratio of the number of lines of code of the generated code snippet has been detected as a clone in the code snippet extracted from the source

cloned_lines: The number of lines of code of the generated code snippet that has been detected as a clone in the code snippet extracted from the source

cosine_sim: Cosine similarity results.

cosine_sim_output.csv: Contains the cosine similarity results

query_id: Query ID

snippet_id:ID the generated code snippet

source_id: ID of the source

sourcesnippetid: ID of the code snippet extracted from the source

cosine_similarity: The cosine similarity between the generated code snippet and the code snippet extracted from the source

quant_analysis: Quantitative analysis results.

topN_links_se.csv: Contains the top-N links extracted from the search engine.

id: Query ID

query: The query

url: Link URL

merged_clone_cosine.csv: Contains the merged results of the clone detection and cosine similarity.

ID_query: Query ID

query: The query

language: The programming language of the query

generated_snippet: The generated code snippet by the LLM-based assistant

IDgensnippet: The ID of the generated code
h
llama2-sst2-fine-tuning
huggingface.co
Updated Aug 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yifei (2023). llama2-sst2-fine-tuning [Dataset]. https://huggingface.co/datasets/OneFly7/llama2-sst2-fine-tuning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2023
Authors
Yifei
Description
Dataset Card for "llama2-sst2-finetuning"

Dataset Description

The Llama2-sst2-fine-tuning dataset is designed for supervised fine-tuning of the LLaMA V2 based on the GLUE SST2 for sentiment analysis classification task.We provide two subsets: training and validation.To ensure the effectiveness of fine-tuning, we convert the data into the prompt template for LLaMA V2 supervised fine-tuning, where the data will follow this format:
[INST] <
O
5 Day Payment Target; Better Payment Practice Code (BPPC)
opalpro.cs.upb.de
cloud.csiss.gmu.edu
+2more
Updated Jun 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NHS Digital (2019). 5 Day Payment Target; Better Payment Practice Code (BPPC) [Dataset]. http://opalpro.cs.upb.de/zh_CN/dataset/groups/5_day_payment_target_better_payment_practice_code_bppc_
Explore at:
http://publications.europa.eu/resource/authority/file-type/csvAvailable download formats
Dataset updated
Jun 23, 2019
Dataset provided by
NHS Digital
License
http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
Description
The NHS Information Centre 5 Day Payment Target; Better Payment Practice Code (BPPC)

This information shows the Additional Monitor Returns Report - (Prompt payment Analysis of duration between Invoice Receipt and Invoice Payment in Working Days)

Facebook

Twitter

Click to copy link

Link copied

Cite

Ghadeer Sawalha; Imran Taj; Abdulhadi Shoufan (2024). Analyzing student prompts and their effect on ChatGPT’s performance [Dataset]. http://doi.org/10.6084/m9.figshare.26970708.v1

Data from: Analyzing student prompts and their effect on ChatGPT’s performance

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.26970708.v1

Dataset updated

Dec 12, 2024

Dataset provided by

Taylor & Francis

Authors

Ghadeer Sawalha; Imran Taj; Abdulhadi Shoufan

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Large language models present new opportunities for teaching and learning. The response accuracy of these models, however, is believed to depend on the prompt quality which can be a challenge for students. In this study, we aimed to explore how undergraduate students use ChatGPT for problem-solving, what prompting strategies they develop, the link between these strategies and the model’s response accuracy, the existence of individual prompting tendencies, and the impact of gender in this context. Our students used ChatGPT to solve five problems related to embedded systems and provided the solutions and the conversations with this model. We analyzed the conversations thematically to identify prompting strategies and applied different quantitative analyses to establish relationships between these strategies and the response accuracy and other factors. The findings indicate that students predominantly employ three types of prompting strategies: single copy-and-paste prompting (SCP), single reformulated prompting (SRP), and multiple-question prompting (MQP). ChatGPT’s response accuracy using SRP and MQP was significantly higher than using SCP, with effect sizes of -0.94 and -0.69, respectively. The student-by-student analysis revealed some tendencies. For example, 26 percent of the students consistently copied and pasted the questions into ChatGPT without any modification. Students who used MQP showed better performance in the final exam than those who did not use this prompting strategy. As for gender, female students tended to make extensive use of SCP, whereas male students tended to mix SCP and MQP. We conclude that students develop different prompting strategies that lead to different response qualities and learning. More research is needed to deepen our understanding and inform effective educational practices in the AI era.

Clear search

Close search

Google apps

Main menu

Data from: Analyzing student prompts and their effect on ChatGPT’s...

awesome-chatgpt-prompts

[WSESE] [Prompt Engineering in Data Analysis] Included and Excluded Papers

Data Analytics Consulting Service Report

Replication Package of the paper "Large Language Models for Multilingual...

Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality

Abstract

Replication Package

Data

Scripts

Example prompts, their task-related features, and their assigned complexity...

Can Developers Prompt? A Controlled Experiment for Code Documentation...

Results of chi-square test testing all different prompting strategies over...

Cookie Tracking Software Market Report | Global Forecast From 2025 To 2033

Cookie Tracking Software Market Outlook

Component Analysis

PROSPECT: Professional Role Effects on Specialized Perspective Enhancement...

Global Laser Methane Telemetry Detection Module Market Industry Best...

Data from: Can Large Language Models Identify Locations Better Than Linked...

A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to...

Description of the data and file structure

Oil and Gas Pipeline Leak Detection Market Report | Global Forecast From...

Oil and Gas Pipeline Leak Detection Market Outlook

Technology Analysis

Data Center Construction Market in Southeast Asia by Construction Components...

Snapshot img

Geoparsing with Large Language Models: Leveraging the linguistic...

Representative open-ended responses to the prompt “how would having the...

Replication package of the paper "Where is Code Generated by LLMs Coming...

Replication Package

Contents

assets

dataset: Contains the main datasets used in the study.

data: Contains the datasets and results of all analyses.

copilot || gemini: Contains the data related to the specific LLM. These two subdirectories have the same internal structure.

manual_analysis: Manual analysis results.

clone_detector: Output and intermediate files for clone detection with Copilot data.

cosine_sim: Cosine similarity results.

quant_analysis: Quantitative analysis results.

llama2-sst2-fine-tuning

5 Day Payment Target; Better Payment Practice Code (BPPC)

Data from: Analyzing student prompts and their effect on ChatGPT’s performance

`assets`

`dataset`: Contains the main datasets used in the study.

`data`: Contains the datasets and results of all analyses.

`copilot` || `gemini`: Contains the data related to the specific LLM. These two subdirectories have the same internal structure.

`manual_analysis`: Manual analysis results.

`clone_detector`: Output and intermediate files for clone detection with Copilot data.

`cosine_sim`: Cosine similarity results.

`quant_analysis`: Quantitative analysis results.