Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large language models present new opportunities for teaching and learning. The response accuracy of these models, however, is believed to depend on the prompt quality which can be a challenge for students. In this study, we aimed to explore how undergraduate students use ChatGPT for problem-solving, what prompting strategies they develop, the link between these strategies and the model’s response accuracy, the existence of individual prompting tendencies, and the impact of gender in this context. Our students used ChatGPT to solve five problems related to embedded systems and provided the solutions and the conversations with this model. We analyzed the conversations thematically to identify prompting strategies and applied different quantitative analyses to establish relationships between these strategies and the response accuracy and other factors. The findings indicate that students predominantly employ three types of prompting strategies: single copy-and-paste prompting (SCP), single reformulated prompting (SRP), and multiple-question prompting (MQP). ChatGPT’s response accuracy using SRP and MQP was significantly higher than using SCP, with effect sizes of -0.94 and -0.69, respectively. The student-by-student analysis revealed some tendencies. For example, 26 percent of the students consistently copied and pasted the questions into ChatGPT without any modification. Students who used MQP showed better performance in the final exam than those who did not use this prompting strategy. As for gender, female students tended to make extensive use of SCP, whereas male students tended to mix SCP and MQP. We conclude that students develop different prompting strategies that lead to different response qualities and learning. More research is needed to deepen our understanding and inform effective educational practices in the AI era.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
🧠 Awesome ChatGPT Prompts [CSV dataset]
This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub
License
CC-0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context. The use of large language models for qualitative analysis is gaining attention in various fields, including software engineering, where qualitative methods are essential to understanding human and social factors. Goal. This study aimed to investigate how LLMs are currently used in qualitative analysis and how they can be used in software engineering research, focusing on identifying the benefits, limitations, and practices associated with their application. Method. We conducted a systematic mapping study and analyzed 21 relevant studies to explore reports of using LLM for qualitative analysis reported in the literature. Findings. Our findings indicate that LLMs are primarily used for tasks such as coding, thematic analysis, and data categorization, with benefits including increased efficiency and support for new researchers. However, limitations such as output variability, challenges capturing nuanced perspectives, and ethical concerns regarding privacy and transparency were also evident. Discussions. The study highlights the need for structured strategies and guidelines to optimize LLM use in qualitative research within software engineering. Such strategies could enhance the effectiveness of LLMs while addressing ethical considerations. Conclusion. While LLMs show promise for supporting qualitative analysis, human expertise remains essential for data interpretation, and continued exploration of best practices will be crucial for their effective integration into empirical software engineering research.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Data Analytics Consulting Services market is experiencing robust growth, driven by the increasing adoption of data-driven decision-making across various industries. The market, estimated at $150 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033, reaching approximately $400 billion by 2033. This expansion is fueled by several key factors. The surge in data volume and variety necessitates specialized expertise in data analytics, pushing organizations to seek professional consulting services. Furthermore, the growing need for advanced analytics techniques, including predictive modeling, machine learning, and AI, is driving demand for sophisticated consulting solutions. The rise of cloud computing and big data technologies is also contributing to market growth by enabling easier data storage, processing, and analysis. Finally, regulatory compliance requirements, such as GDPR and CCPA, are prompting businesses to invest in data governance and analytics consulting to ensure data security and privacy. The competitive landscape is characterized by a mix of large multinational consulting firms (Accenture, Deloitte, EY, PwC, McKinsey, BCG) and specialized data analytics consultancies (DataArt, Infosys, Appnovation, InData Labs, etc.). These firms offer a wide range of services, including data strategy development, data warehousing and integration, business intelligence implementation, advanced analytics solutions, and data visualization services. While significant growth is anticipated, challenges remain. These include the shortage of skilled data scientists and analysts, the complexity of integrating various data sources, and the need for robust data security measures. The market is segmented based on various factors such as service type, industry vertical, and geographic region, allowing firms to target specific niches and maximize their market penetration. The North American market currently holds the largest market share, followed by Europe and Asia Pacific, but growth in emerging economies is expected to be substantial in the coming years.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Having been trained in the wild, Large Language Models (LLMs) may suffer from different types of bias. As shown in previous studies outside software engineering, this includes a language bias, i.e., these models perform differently depending on the language used for the query/prompt. However, so far the impact of language bias on source code generation has not been thoroughly investigated. Therefore, in this paper, we study the influence of the language adopted in the prompt on the quality of the source code generated by three LLMs, specifically GPT, Claude, and DeepSeek. We consider 230 coding tasks for Python and 230 for Java, and translate their related prompts into four languages: Chinese, Hindi, Spanish, and Italian. After generating the code, we measure code quality in terms of passed tests, code metrics, warnings generated by static analysis tools, and language used for the identifiers. Results indicate that (i) source code generated from the English queries is not necessarily better in terms of passed test and quality metrics, (ii) the quality for different languages varies depending on the programming language and LLM being used, and (iii) the generated code tend to contain mixes of comments and literals written in English and the language used to formulate the prompt.
This replication package is organized into two main directories: data and scripts. The datadirectory contains all the data used in the analysis, including prompts and final results. The scripts directory contains all the Python scripts used for code generation and analysis.
The data directory contains five subdirectories, each corresponding to a stage in the analysis pipeline. These are enumerated to reflect the order of the process:
prompt_translation: Contains files with manually translated prompts for each language. Each file is associated with both Python and Java. The structure of each file is as follows:
id
: The ID of the query in the CoderEval benchmark.prompt
: The original English prompt.summary
: The original summary.code
: The original code.translation
: The translation generated by GPT.correction
: The manual correction of the GPT-generated translation.correction_tag
: A list of tags indicating the corrections made to the translation.generated_code
: This column is initially empty and will contain the code generated from the translated prompt.generation: Contains the code generated by the three LLMs for each programming language and natural language. Each subdirectory (e.g., java_chinese_claude
) contains the following:
java_chinese_claude.csv
) containing the generated code in the corresponding column.tests: Contains input files for the testing process and the results of the tests. Files in the input_files directory are formatted according to the CoderEval benchmark requirements. The results directory holds the output of the testing process.
quantitative_analysis: Contains all the csv reports of the static analysis tools and test output for all languages and models. These files are the inputs for the statistical analysis. The directory stats contains all the output tables for the statistical analysis, which are shown in paper's tables.
qualitative_analysis: Contains files used for the qualitative analysis:
id
: The ID of the query in the CoderEval benchmark.generated_code
: The code generated by the model.comments
: The language used for comments.identifiers
: The language used for identifiers.literals
: The language used for literals.notes
: Additional notes.ablation_study: Contains files for the ablation study. Each file has the following columns:
id
: The ID of the query in the CoderEval benchmark.prompt
: The prompt used for code generation.generated_code
, comments
, identifiers
, and literals
: Same as in the qualitative analysis. results.pdf
: This file shows the table containing all the percentages of comments, identifiers and literals extracted from the csv files of the ablation study.Files prefixed with italian
contain prompts with signatures and docstrings translated into Italian. The system prompt used is the same as the initial one (see the paper). Files with the english
prefix have prompts with the original signature (in English) and the docstring in Italian. The system prompt differs as follows:
You are an AI that only responds with Python code. You will be given a function signature and its docstring by the user. Write your full implementation (restate the function signature).
Use a Python code block to write your response.
Comments and identifiers must be in Italian.
For example:
```python
print("Hello World!")
The scripts directory contains all the scripts used to perform all the generations and analysis. All files are properly commented. Here a brief description of each file:
code_generation.py: This script automates code generation using AI models (GPT, DeepSeek, and Claude) for different programming and natural languages. It reads prompts from CSV files, generates code based on the prompts, and saves the results in structured directories. It logs the process, handles errors, and stores the generated code in separate files for each iteration.
computeallanalysis.py: This script performs static code analysis on generated code files using different models, languages, and programming languages. It runs various analyses (Flake8, Pylint, Lizard) depending on the programming language: for Python, it runs all three analyses, while for Java, only Lizard is executed. The results are stored in dedicated report directories for each iteration. The script ensures the creation of necessary directories and handles any errors that occur during the analysis process.
createtestjava.py: This script processes Java code generated by different models and languages, extracting methods using a JavaParser server. It iterates through multiple iterations of generated code, extracts the relevant method code (or uses the full code if no method is found), and stores the results in a JSONL file for each language and model combination.
deepseek_model.py: This function sends a request to the DeepSeek API, passing a system and user prompt, and extracts the generated code snippet based on the specified programming language. It prints the extracted code in blue to the console, and if any errors occur during the request or extraction, it prints an error message in red. If successful, it returns the extracted code snippet; otherwise, it returns None.
extractpmdreport.py: This script processes PMD analysis reports in SARIF format and converts them into CSV files. It extracts the contents of ZIP files containing the PMD reports, parses the SARIF file to gather analysis results, and saves the findings in a CSV file. The output includes details such as file names, rules, messages, and the count of issues found. The script iterates through multiple languages, models, and iterations, ensuring that PMD reports are properly processed and saved for each combination.
flake_analysis.py: The flake_analysis function runs Flake8 to analyze Python files for errors and generates a CSV report summarizing the results. It processes the output, extracting error details such as filenames, error codes, and messages. The errors are grouped by file and saved in a CSV file for easy review.
generatepredictionclaude_java.py: The generatecodefrom_prompt function processes a JSON file containing prompts, generates Java code using the Claude API, and saves the generated code to a new JSON file. It validates each prompt, ensures it's JSON-serializable, and sends it to the Claude API for code generation. If the generation is successful, the code is stored in a structured format, and the output is saved to a JSON file for further use.
generatepredictionclaude_python.py: This code defines a function generatecodefrom_prompt that processes a JSON file containing prompts, generates Python code using the Claude API, and saves the generated code to a new JSON file. It handles invalid values and ensures all prompts are
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example prompts, their task-related features, and their assigned complexity values.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Summary of Artifacts
This is the replication package for the paper titled 'Can Developers Prompt? A Controlled Experiment for Code Documentation Generation' that is part of the 40th IEEE International Conference on Software Maintenance and Evolution (ICSME), from October 6 to 11, 2024, located in Flagstaff, AZ, USA.
Full Abstract
Large language models (LLMs) bear great potential for automating tedious development tasks such as creating and maintaining code documentation. However, it is unclear to what extent developers can effectively prompt LLMs to create concise and useful documentation. We report on a controlled experiment with 20 professionals and 30 computer science students tasked with code documentation generation for two Python functions. The experimental group freely entered ad-hoc prompts in a ChatGPT-like extension of Visual Studio Code, while the control group executed a predefined few-shot prompt. Our results reveal that professionals and students were unaware of or unable to apply prompt engineering techniques. Especially students perceived the documentation produced from ad-hoc prompts as significantly less readable, less concise, and less helpful than documentation from prepared prompts. Some professionals produced higher quality documentation by just including the keyword Docstring in their ad-hoc prompts. While students desired more support in formulating prompts, professionals appreciated the flexibility of ad-hoc prompting. Participants in both groups rarely assessed the output as perfect. Instead, they understood the tools as support to iteratively refine the documentation. Further research is needed to understand which prompting skills and preferences developers have and which support they need for certain tasks.
Author Information
Name Affiliation Email
Hans-Alexander Kruse Universität Hamburg hans-alexander.kruse@studium.uni-hamburg.de
Tim Puhlfürß Universität Hamburg tim.puhlfuerss@uni-hamburg.de
Walid Maalej Universität Hamburg walid.maalej@uni-hamburg.de
Citation Information
@inproceedings{kruse-icsme-2024, author={Kruse, Hans-Alexander and Puhlf{"u}r{\ss}, Tim and Maalej, Walid}, booktitle={2022 IEEE International Conference on Software Maintenance and Evolution}, title={Can Developers Prompt? A Controlled Experiment for Code Documentation Generation}, year={2024}, doi={tba}, }
Artifacts Overview
The file kruse-icsme-2024-preprint.pdf is the preprint version of the official paper. You should read the paper in detail to understand the study, especially its methodology and results.
The folder results includes two subfolders, explained in the following.
Demographics RQ1 RQ2
The subfolder Demographics RQ1 RQ2 provides Jupyter Notebook file evaluation.ipynb for analyzing (1) the experiment participants' submissions of the digital survey and (2) the ad-hoc prompts that the experimental group entered into their tool. Hence, this file provides demographic information about the participants and results for the research questions 1 and 2. Please refer to the README file inside this subfolder for installation steps of the Jupyter Notebook file.
RQ2
The subfolder RQ2 contains further subfolders with Microsoft Excel files specific to the results of research question 2:
The subfolder UEQ contains three times the official User Experience Questionnaire (UEQ) analysis Excel tool, with data entered from all participants/students/professionals.
The subfolder Open Coding contains three Excel files with the open-coding results for the free-text answers that participants could enter at the end of the survey to state additional positive and negative comments about their experience during the experiment. The Consensus file provides the finalized version of the open coding process.
The folder extension contains the code of the Visual Studio Code (VS Code) extension developed in this study to generate code documentation with predefined prompts. Please refer to the README file inside the folder for installation steps. Alternatively, you can install the deployed version of this tool, called Code Docs AI, via the VS Code Marketplace.
You can install the tool to generate code documentation with ad-hoc prompts directly via the VS Code Marketplace. We did not include the code of this extension in this replication package due to license conflicts (GNUv3 vs. MIT).
The folder survey contains PDFs of the digital survey in two versions:
The file Survey.pdf contains the rendered version of the survey (how it was presented to participants).
The file SurveyOptions.pdf is an export of the LimeSurvey web platform. Its main purpose is to provide the technical answer codes, e.g., AO01 and AO02, that refer to the rendered answer texts, e.g., Yes and No. This can help you if you want to analyze the CSV files inside the results folder (instead of using the Jupyter Notebook file), as the CSVs contain the answer codes, not the answer texts. Please note that an export issue caused page 9 to be almost blank. However, this problem is negligible as the question on this page only contained one free-text answer field.
The folder appendix provides additional material about the study:
The subfolder tool_screenshots contains screenshots of both tools.
The file few_shots.txt lists the few shots used for the predefined prompt tool.
The file test_functions.py lists the functions used in the experiment.
Revisions
Version Changelog
1.0.0 Initial upload
1.1.0 Add paper preprint. Update abstract.
1.2.0 Update replication package based on ICSME Artifact Track reviews
License
See LICENSE file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results of chi-square test testing all different prompting strategies over the various complexities.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global cookie tracking software market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach around USD 4.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 16.8% during the forecast period. This growth is driven by increasing digitalization, heightened demand for personalized marketing, and stringent data privacy regulations. Companies are investing heavily in technologies that can help them track user behavior, optimize user experiences, and ensure compliance with evolving privacy laws, which fuels market growth.
One of the primary growth factors for the cookie tracking software market is the increasing emphasis on personalized marketing. As companies strive to offer more tailored user experiences, they require sophisticated tools to collect and analyze user data. Cookie tracking software enables businesses to capture detailed insights into user preferences and behaviors, allowing them to deliver personalized content and advertisements. This capability significantly enhances customer engagement and conversion rates, making it a critical component for digital marketing strategies.
Another contributing factor is the rising implementation of data privacy regulations worldwide. Legislation such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States has necessitated more transparent and secure data tracking practices. Cookie tracking software helps organizations comply with these regulations by providing features like consent management and data anonymization. This ensures that businesses can continue to leverage user data while adhering to legal requirements, thereby mitigating the risk of hefty fines and reputational damage.
The growing adoption of digital platforms during the COVID-19 pandemic has further accelerated the demand for cookie tracking software. With an increasing number of consumers shifting to online shopping and remote work environments, businesses have had to ramp up their digital presence. This surge in digital activity has underscored the importance of effective user tracking and data analysis, prompting more companies to invest in advanced cookie tracking solutions to better understand and cater to their online audiences.
In the realm of digital marketing, Subscription Analytics Software is becoming increasingly vital as businesses transition to subscription-based models. This software provides companies with the tools to analyze customer subscription data, offering insights into customer behavior, preferences, and churn rates. By leveraging these insights, businesses can optimize their subscription offerings, tailor marketing strategies, and enhance customer retention. The integration of subscription analytics with cookie tracking software can further enrich data collection, enabling a more comprehensive understanding of user interactions and preferences. As the subscription economy continues to grow, the demand for robust analytics solutions that can seamlessly integrate with existing digital marketing tools is expected to rise.
Regionally, North America is expected to hold a significant share of the cookie tracking software market, driven by the presence of major technology companies and high internet penetration rates. Europe is also anticipated to see robust growth due to stringent data protection regulations and widespread digitalization initiatives. Meanwhile, the Asia Pacific region is projected to experience the fastest growth, fueled by rapid economic development, increasing internet usage, and the proliferation of e-commerce platforms in countries like China and India.
The cookie tracking software market can be segmented by component into software and services. The software segment, which includes various types of cookie tracking applications and platforms, is expected to dominate the market. This is largely due to the continuous advancements in technology and the increasing need for sophisticated tools to analyze vast amounts of user data. Companies are constantly seeking robust software solutions that can provide real-time insights and seamless integration with other digital marketing tools.
Within the software segment, several sub-categories exist, including standalone cookie tracking software and integrated solutions that form part of broader digital marketing platforms. Sta
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
### Data Availability Statement (for the paper)
All dialogue logs and final responses collected in this study are publicly available in the PROSPECT repository on Zenodo (DOI: [to be assigned]). The repository contains PDF files of complete dialogue histories and Markdown files of final comprehensive analyses for all conditions and models used in this study, allowing for reproducibility and further analysis.
### README.md for Zenodo
# PROSPECT: Professional Role Effects on Specialized Perspective Enhancement in Conversational Task
## Overview
This repository (PROSPECT) contains the dataset associated with the paper:
> "Empirical Investigation of Expertise, Multiperspectivity, and Abstraction Induction in Conversational AI Outputs through Professional Role Assignment to Both User and AI"
This research analyzed changes in dialogue logs and final responses when professional roles were assigned to both user and AI sides across multiple Large Language Models (LLMs). This repository provides the complete dialogue logs (PDF format) and final responses (Markdown format) used in the analysis.
## Directory Structure
The repository structure under the top directory (`PROSPECT/`) is as follows:
```
PROSPECT/
├── dialogue/ # Dialogue histories (PDF)
│ ├── none/
│ ├── ai_only/
│ ├── user_only/
│ └── both/
└── final_answers/ # Final responses (Markdown)
├── none/
├── ai_only/
├── user_only/
└── both/
```
- **dialogue/**
- Contains raw dialogue logs in PDF format. Subdirectories represent role assignment conditions:
- `none/`: No roles assigned to either user or AI
- `ai_only/`: Role assigned to AI only
- `user_only/`: Role assigned to user only
- `both/`: Roles assigned to both user and AI
- **final_answers/**
- Contains final comprehensive analysis responses in Markdown format. Directory structure mirrors that of `dialogue/`.
## File Naming Convention
Files in each directory follow this naming convention:
```
[AI]_[conditionNumber]-[roleNumber].pdf
[AI]_[conditionNumber]-[roleNumber].md
```
- `[AI]`: AI model name used for dialogue (e.g., ChatGPT, ChatGPT-o1, Claude, Gemini)
- `[conditionNumber]`: Number indicating role assignment condition
- 0: none
- 1: ai_only
- 2: user_only
- 3: both
- `[roleNumber]`: Professional role number
- 0: No role
- 1: Detective
- 2: Psychologist
- 3: Artist
- 4: Architect
- 5: Natural Scientist
### Examples:
- `ChatGPT_3-1.pdf`: Dialogue log with ChatGPT-4o model under "both" condition (3) with detective role (1)
- `Gemini_1-4.md`: Final response from Gemini model under "ai_only" condition (1) with architect role (4)
## Role Number Reference
| roleNumber | Professional Role |
|-----------:|:-----------------|
| 0 | No role |
| 1 | Detective |
| 2 | Psychologist |
| 3 | Artist |
| 4 | Architect |
| 5 | Natural Scientist|
## Data Description
- **Dialogue Histories (PDF format)**
Complete logs of questions and responses from each session, preserved as captured during the research. All dialogues were conducted in Japanese. While assistant version information is not included, implementation dates and model names are recorded within the files.
- **Final Responses (Markdown format)**
Excerpted responses to the final "comprehensive analysis request" as Markdown files, intended for text analysis and keyword extraction. All responses are in Japanese.
*Note: This dataset contains dialogues and responses exclusively in Japanese. Researchers interested in lexical analysis or content analysis should consider this language specification.
## How to Use
1. Please maintain the folder hierarchy after downloading.
2. For meta-analysis or lexical analysis, refer to PDFs for complete dialogues and Markdown files for final responses.
3. Utilize for research reproduction, secondary analysis, or meta-analysis.
## License
This dataset is released under the **CC BY 4.0** License.
- Free to use and modify, but please cite this repository (DOI) and the associated paper when using the data.
## Related Publication
## Disclaimer
- The dialogue logs contain no personal information or confidential data.
- The provided logs and responses reflect the research timing; identical prompts may yield different responses due to AI model updates.
- The creators assume no responsibility for any damages resulting from the use of this dataset.
## Contact
For questions or requests, please contact skeisuke@ibaraki-ct.ac.jp.
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
The Laser Methane Telemetry Detection Module market is witnessing significant growth as industries increasingly prioritize safety and environmental compliance. These modules utilize advanced laser technology to detect methane and other gases in real-time, providing critical data for prompt decision-making and risk m
This repository contains the dataset and analysis associated with the research paper "Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning?" Presented at the 20th European Conference on Technology Enhanced Learning (ECTEL), 2025.
Overview
Ubiquitous learning (u-learning) applications often rely on identifying relevant Points of Interest (POIs) where students can engage in contextualized learning tasks. Traditionally, these POIs have been retrieved from structured datasets like Linked Open Data (LOD). However, with the rise of Large Language Models (LLMs), a new question arises: can LLMs outperform LOD in identifying such locations?
This study compares the performance of a LOD dataset (Wikidata) and two LLMs (ChatGPT and DeepSeek) in retrieving 16th-century cultural heritage sites (churches, cathedrals, castles, and palaces) across three European cities (two in Spain and one in Italy) and their regions.
Dataset
The file LODvsLLMs.xlsx includes:
Raw data retrieved from Wikidata and the two LLMs.
SPARQL queries and LLM prompts used for data collection.
Comparative analysis across four key dimensions:
Accuracy: Are the retrieved sites real and verifiable?
Consistency: Do repeated queries yield stable results?
Completeness: How exhaustive are the lists of POIs?
Validity: Are the geographic coordinates and Wikipedia links correct?
Key Findings
LOD (Wikidata) outperformed LLMs in terms of consistency, completeness (especially in larger regions), and validity of data.
LLMs were able to retrieve some POIs not found in Wikidata, but also introduced hallucinations and invalid links.
A hybrid approach combining LOD and LLMs is proposed for future u-learning applications to maximize coverage and reliability.
Citation
If you use this dataset or refer to the findings in your work, please cite the original paper presented at ECTEL 2025.
García-Zarza, P., Asensio-Pérez, J.I., Bote-Lorenzo, M.L., Sánchez-Turrión, L.F., Taibi, D., Vega-Gorgojo, G., Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning Proceedings of 20th European Conference on Technology Enhanced Learning, ECTEL 2025, Newcastle & Durham, United Kingdom, September 2025.
Objective: Our objective is to evaluate the efficacy of ChatGPT 4 in accurately and effectively delivering genetic information, building on previous findings with ChatGPT 3.5. We focus on assessing the utility, limitations, and ethical implications of using ChatGPT in medical settings. Materials and Methods: A structured questionnaire, including the Brief User Survey (BUS-15) and custom questions, was developed to assess ChatGPT 4's clinical value. An expert panel of genetic counselors and clinical geneticists independently evaluated ChatGPT 4's responses to these questions. We also involved comparative analysis with ChatGPT 3.5, utilizing descriptive statistics and using R for data analysis. Results: ChatGPT 4 demonstrated improvements over 3.5 in context recognition, relevance, and informativeness. However, performance variability and concerns about the naturalness of the output were noted. No significant difference in accuracy was found between ChatGPT 3.5 and 4.0. Notably, the effic..., Study Design This study was conducted to evaluate the performance of ChatGPT 4 (March 23rd, 2023)  Model) in the context of genetic counseling and education. The evaluation involved a structured questionnaire, which included questions selected from the Brief User Survey (BUS-15) and additional custom questions designed to assess the clinical value of ChatGPT 4's responses. Questionnaire Development The questionnaire was built on Qualtrics, which comprised twelve questions: seven selected from the BUS-15 preceded by two additional questions that we designed. The initial questions focused on quality and answer relevancy: 1.    The overall quality of the Chatbot’s response is: (5-point Likert: Very poor to Very Good) 2.    The Chatbot delivered an answer that provided the relevant information you would include if asked the question. (5-point Likert: Strongly disagree to Strongly agree) The BUS-15 questions (7-point Likert: Strongly disagree to Strongly agree) focused on: 1.    Recogniti..., , # A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data
https://doi.org/10.5061/dryad.s4mw6m9cv
This data was captured when evaluating the ability of ChatGPT to address questions patients may ask it about three genetic conditions (BRCA1, HFE, and MLH1). This data is associated with the JAMIA article of the similar name with the DOIÂ 10.1093/jamia/ocae128
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global oil and gas pipeline leak detection market size is projected to experience significant growth, with an expected valuation rising from USD 2.37 billion in 2023 to USD 3.89 billion by 2032, reflecting a healthy compound annual growth rate (CAGR) of 5.6% from 2024 to 2032. This market expansion is largely fueled by the increasing emphasis on safety and environmental regulations, the growing complexity of pipeline networks, and the dire need for efficient and reliable leak detection systems. As governments and organizations worldwide become more aware of and committed to reducing the environmental impacts of fossil fuel extraction and transportation, the demand for advanced leak detection technologies has intensified, driving market growth.
One of the primary factors contributing to the growth of the oil and gas pipeline leak detection market is the stringent regulatory frameworks being implemented globally to prevent environmental disasters. These regulations mandate the installation of sophisticated leak detection systems to minimize the risk of oil spills and gas leaks, which can have catastrophic environmental and economic consequences. The increasing public awareness and pressure on governments to ensure the safety and integrity of oil and gas infrastructure have also played a crucial role in driving the market's expansion. Furthermore, the adoption of best practices and international standards in pipeline monitoring and maintenance is further propelling the demand for innovative and reliable leak detection technologies.
Technological advancements in the oil and gas industry have paved the way for the development of more efficient and accurate leak detection systems. Innovations such as acoustic/ultrasonic sensors, fiber optic technologies, and advanced data analytics are improving the precision and reliability of leak detection, thereby reducing operational risks and potential losses. The integration of Internet of Things (IoT) and artificial intelligence (AI) in pipeline monitoring systems enhances real-time data collection and analysis, enabling prompt detection and response to leaks. These cutting-edge technologies are not only enhancing the effectiveness of leak detection but also reducing the overall costs associated with pipeline monitoring and maintenance, making them increasingly attractive to oil and gas companies.
The growing global energy demand and the expansion of oil and gas pipeline networks, especially in emerging economies, are also driving the need for efficient leak detection systems. As countries endeavor to secure their energy supply and improve infrastructure, significant investments are being made in the construction and maintenance of extensive pipeline networks. This expansion necessitates robust leak detection solutions to ensure the safe and efficient transportation of oil and gas resources. Additionally, the shift towards unconventional oil and gas resources, such as shale gas and deepwater drilling, presents new challenges in leak detection, further increasing the demand for advanced technologies.
Pipeline Leak Detectors play a crucial role in ensuring the safety and efficiency of oil and gas transportation. These detectors are designed to identify leaks quickly and accurately, minimizing the risk of environmental damage and economic loss. By utilizing advanced technologies such as acoustic sensors and fiber optics, pipeline leak detectors can provide real-time monitoring and immediate alerts, allowing operators to respond swiftly to any potential issues. This capability is particularly important in complex pipeline networks, where undetected leaks can lead to significant operational challenges. As the industry continues to evolve, the integration of pipeline leak detectors with digital technologies like AI and IoT is enhancing their effectiveness, offering more precise detection and predictive maintenance capabilities.
The technology segment of the oil and gas pipeline leak detection market encompasses various sophisticated systems, each offering unique advantages in detecting leaks with precision. Acoustic/ultrasonic technology, for instance, stands out for its ability to detect leaks through sound waves. This method is particularly effective in situations where traditional methods may fall short, as it can monitor for changes in noise levels along pipeline routes, indicating potential leaks. The sensitivity of acoustic/ultrasonic systems to sound variations makes th
The data center construction market in southeast asia size is expected to grow by USD 3.61 billion and record a CAGR of 12% during 2021-2025. This post-pandemic data center construction market in southeast asia report has assessed the shift in consumer behavior and has identified and explored the upcoming trends and drivers that the vendors can capitalize on to support prompt business decisions. In this data center construction market in southeast asia analysis report, key drivers such as increase in investment in data centers have been discussed with emerging growth regions, which will offer immense business opportunities. Our analysts have also identified challenges such as system integration and interoperability issues, which will impede market growth. With these insights, the vendors can recreate their plan of action to obtain growth opportunities in the future. This data center construction market in southeast asia report further entails segmentation by geography (Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia) and construction component (electrical construction, mechanical construction, consulting and other services, and integrating software). The available actionable insights on the segmentations, in this report, will enable a better understanding of the target audience and changing demand patterns.
Who are the Key Vendors in the Data Center Construction Market In Southeast Asia?
The data center construction market in southeast asia forecast report provides insights on complete key vendor profiles and their business strategies to reimage themselves. The profiles include information on the production, competitive landscape, sustainability, and prospects of the leading companies including:
ABB Ltd.
AECOM
Eaton Corporation Plc
Hewlett Packard Enterprise Development LP
Legrand SA
M+W Group GmbH
Ove Arup & Partners International Ltd.
Rittal GmbH & Co. KG
Schneider Electric SE
Vertiv Holdings Co.
Our analysts have extensively outlined successful business strategies deployed by the key vendors in this market research report. The data center construction market in southeast asia is fragmented and the vendors are deploying various organic and inorganic growth strategies to compete in the market.
To make the most of the opportunities, vendors should focus on fast-growing segments, while maintaining their positions in the slow-growing segments. The data center construction market in southeast asia further offers well-structured marketing strategies to overcome the negative post-COVID-19 impact, if any, on each product and service segment.
Which are the Key Regional Markets for Data Center Construction Market In Southeast Asia?
The report offers an up-to-date analysis of the geographical composition of the market. Singapore will record a fast growth rate during 2021-2025, owing to which the region should offer several growth opportunities to market vendors. The rise in iot solutions will significantly influence data center construction market in southeast asia growth in this region. From the statistical study of the geographic landscape, you can interpret and understand the competitive intelligence and regional opportunities in store for vendors for 2021-2025.
35% of the market's growth will originate from Singapore during the forecast period. Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia are the key markets for data center construction market in southeast asia in the region. This report provides estimations of the contribution of all regions to the growth of the data center construction market in southeast asia size.
Data Center Construction Market In Southeast Asia Scope
Report Coverage
Details
Page number
120
Base year
2020
Forecast period
2021-2025
Growth momentum & CAGR
Accelerate at a CAGR of 12%
Market growth 2021-2025
USD 3.61 billion
Market structure
Fragmented
YoY growth (%)
9.45
Regional analysis
Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia
Performing market contribution
Singapore at 35%
Key consumer countries
Singapore, Malaysia, Thailand, Indonesia, and Rest of South-East Asia
Competitive landscape
Leading companies, competitive strategies, consumer engagement scope
Companies profiled
ABB Ltd., AECOM, Eaton Corporation Plc, Hewlett Packard Enterprise Development LP, Legrand SA , M+W Group GmbH, Ove Arup & Partners International Ltd., Rittal GmbH & Co. KG, Schneider Electric SE, and Vertiv Holdings Co.
Market Dynamics
Parent market a
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geoparsing with Large Language Models
The .zip file included in this repository contains all the code and data required to reproduce the results from our paper. Note, however, that in order to run the OpenAI models, users will required an OpenAI API key and sufficient API credits.
Data
The data used for the paper are in the datasetst and results folders.
**Datasets: **This contains the XML files (LGL and Geovirus) and Json files (News2024) used to benchmark the models. It also contains all the data used to fine-tune the gpt-3.5 model, the prompt templates sent to the LLMs, and other data used for mapping and data creation.
**Results: **This contains the results for the models on the three datastes. The folder is separated by dataset, with a single .csv file giving the results for each model on each dataset separately. The .csv file is structured so that each row contains either a predicted toponym and an associated true toponym (along with assigned spatial coordinates), if the model correctly identified a toponym; otherwise the true toponym columns are empty for false positives and the predicted columns are empty for false negatives.
Code
The code is split into two seperate folders gpt_geoparser and notebooks.
**GPT_Geoparser: **this contains the classes and methods used process the XML and JSON articles (data.py), interact with the Nominatim API for geocoding (gazetteer.py), interact with the OpenAI API (gpt_handler.py), process the outputs from the GPT models (geoparser.py) and analyse the results (analysis.py).
Notebooks: This series of notebooks can be used to reproduce the results given in the paper. The file names a reasonably descriptive of what they do within the context of the paper.
Code/software
Requirements
Numpy
Pandas
Geopy
Scitkit-learn
lxml
openai
matplotlib
Contextily
Shapely
Geopandas
tqdm
huggingface_hub
Gnews
Access information
Other publicly accessible locations of the data:
The LGL and GeoVirus datasets can also be obtained here (opens in new window).
Abstract
Geoparsing- the process of associating textual data with geographic locations - is a key challenge in natural language processing. The often ambiguous and complex nature of geospatial language make geoparsing a difficult task, requiring sophisticated language modelling techniques. Recent developments in Large Language Models (LLMs) have demonstrated their impressive capability in natural language modelling, suggesting suitability to a wide range of complex linguistic tasks. In this paper, we evaluate the performance of four LLMs - GPT-3.5, GPT-4o, Llama-3.1-8b and Gemma-2-9b - in geographic information extraction by testing them on three geoparsing benchmark datasets: GeoVirus, LGL, and a novel dataset, News2024, composed of geotagged news articles published outside the models' training window. We demonstrate that, through techniques such as fine-tuning and retrieval-augmented generation, LLMs significantly outperform existing geoparsing models. The best performing models achieve a toponym extraction F1 score of 0.985 and toponym resolution accuracy within 161 km of 0.921. Additionally, we show that the spatial information encoded within the embedding space of these models may explain their strong performance in geographic information extraction. Finally, we discuss the spatial biases inherent in the models' predictions and emphasize the need for caution when applying these techniques in certain contexts.
Methods
This contains the data and codes required to reproduce the results from our paper. The LGL and GeoVirus datasets are pre-existing datasets, with references given in the manuscript. The News2024 dataset was constructed specifically for the paper.
To construct the News2024 dataset, we first created a list of 50 cities from around the world which have population greater than 1000000. We then used the GNews python package https://pypi.org/project/gnews/ (opens in new window) to find a news article for each location, published between 2024-05-01 and 2024-06-30 (inclusive). Of these articles, 47 were found to contain toponyms, with the three rejected articles referring to businesses which share a name with a city, and which did not otherwise mention any place names.
We used a semi autonmous approach to geotagging the articles. The articles were first processed using a Distil-BERT model, fine tuned for named entity recognicion. This provided a first estimate of the toponyms within the text. A human reviewer then read the articles, and accepted or rejected the machine tags, and added any tags missing from the machine tagging process. We then used OpenStreetMap to obtain geographic coordinates for the location, and to identify the toponym type (e.g. city, town, village, river etc). We also flagged if the toponym was acting as a geo-political entity, as these were reomved from the analysis process. In total, 534 toponyms were identified in the 47 news articles.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
What would you have done differently in regards to career preparation?” that connect to the inductive code of “better prepared to apply to graduate/professional schools or jobs” Relevant sections of the student response are bolded and underlined.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains the necessary tools, data, and scripts for reproducing the results of our paper: "Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot". Below is a detailed description of the directory structure and the contents of this package.
The replication package is organized into two main directories:
assets
: This directory contains all .csv files used as input for the script and the outputted .csv file used to perform the manual and automated analyses for RQ1 and RQ2.
script
: This directory contains all scripts for RQ1 and RQ2.
In the following, we describe the content of each directory:
assets
This directory contains the tools and resources required for our study.
dataset
: Contains the main datasets used in the study.annotationStore.csv
: Input dataset for our analyses, originating from the CODESEARCHNET dataset.
queries.csv
: .csv file containing the queries used for the experiments filtered from the CODESEARCHNET dataset. This file contains the following columns:
data
: Contains the datasets and results of all analyses.queries.csv
: General input queries. This file contains the following columns:
developer. Then give me a
code snippet about:
queries_filled.csv
: Similar to the previous file, but also containing the output produced by the LLM-based assistants. This file contains the following columns:
developer. Then give me a
code snippet about:
copilot
|| gemini
: Contains the data related to the specific LLM. These two subdirectories have the same internal structure.queries.csv
: The queries_filled.csv
file, filtered for the specific LLM.queries_noTrivial.csv
: Contains only the queries with at least one nontrivial generated snippet.external_links.csv
: External links extracted from the LLMs output.
external_links_filled.csv
: Snippets extracted from the external links.
manual_analysis
: Manual analysis results.manual_analysis.csv
:
manual_analysis_noTrivial.csv
: As in the previous file, but only the queries with at least one nontrivial generated code snippet.clone_detector
: Output and intermediate files for clone detection with Copilot data.copilot_tokens || gemini_tokens
: Contains the output the tokenization of the generated code snippets and the code snippets extracted from the external links.merged_llm_ext_link.csv
: All possible pairs (Cartesian product) (code snippet extracted from the external links, generated code snippet). This file is the input of the clone detection tool.
clone_detection_output.csv
: Contains the clone detection results.
cosine_sim
: Cosine similarity results.cosine_sim_output.csv
: Contains the cosine similarity results
quant_analysis
: Quantitative analysis results.topN_links_se.csv
: Contains the top-N links extracted from the search engine.
merged_clone_cosine.csv
: Contains the merged results of the clone detection and cosine similarity.
Dataset Card for "llama2-sst2-finetuning"
Dataset Description
The Llama2-sst2-fine-tuning dataset is designed for supervised fine-tuning of the LLaMA V2 based on the GLUE SST2 for sentiment analysis classification task.We provide two subsets: training and validation.To ensure the effectiveness of fine-tuning, we convert the data into the prompt template for LLaMA V2 supervised fine-tuning, where the data will follow this format:
[INST] <
http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
The NHS Information Centre 5 Day Payment Target; Better Payment Practice Code (BPPC)
This information shows the Additional Monitor Returns Report - (Prompt payment Analysis of duration between Invoice Receipt and Invoice Payment in Working Days)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large language models present new opportunities for teaching and learning. The response accuracy of these models, however, is believed to depend on the prompt quality which can be a challenge for students. In this study, we aimed to explore how undergraduate students use ChatGPT for problem-solving, what prompting strategies they develop, the link between these strategies and the model’s response accuracy, the existence of individual prompting tendencies, and the impact of gender in this context. Our students used ChatGPT to solve five problems related to embedded systems and provided the solutions and the conversations with this model. We analyzed the conversations thematically to identify prompting strategies and applied different quantitative analyses to establish relationships between these strategies and the response accuracy and other factors. The findings indicate that students predominantly employ three types of prompting strategies: single copy-and-paste prompting (SCP), single reformulated prompting (SRP), and multiple-question prompting (MQP). ChatGPT’s response accuracy using SRP and MQP was significantly higher than using SCP, with effect sizes of -0.94 and -0.69, respectively. The student-by-student analysis revealed some tendencies. For example, 26 percent of the students consistently copied and pasted the questions into ChatGPT without any modification. Students who used MQP showed better performance in the final exam than those who did not use this prompting strategy. As for gender, female students tended to make extensive use of SCP, whereas male students tended to mix SCP and MQP. We conclude that students develop different prompting strategies that lead to different response qualities and learning. More research is needed to deepen our understanding and inform effective educational practices in the AI era.