MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Independent Jailbreak Datasets for LLM Guardrail Evaluation
Constructed for the thesis:“Contamination Effects: How Training Data Leakage Affects Red Team Evaluation of LLM Jailbreak Detection” The effectiveness of LLM guardrails is commonly evaluated using open-source red teaming tools. However, this study reveals that significant data contamination exists between the training sets of binary jailbreak classifiers (ProtectAI, Katanemo, TestSavantAI, etc.) and the test prompts used in… See the full description on the dataset page: https://huggingface.co/datasets/Simsonsun/JailbreakPrompts.
ABSTRACT: Context: Large Language Models (LLMs) have revolutionized natural language generation and understanding. However, they raise significant data privacy concerns, especially when sensitive data is processed and stored by third parties. Goal: This paper investigates the perception of software development teams members regarding data privacy when using LLMs in their professional activities. Additionally, we examine the challenges faced and the practices adopted by these practitioners. Method: We conducted a survey with 78 ICT practitioners from five regions of the country. Results: Software development teams members have basic knowledge about data privacy and LGPD, but most have never received formal training on LLMs and possess only basic knowledge about them. Their main concerns include the leakage of sensitive data and the misuse of personal data. To mitigate risks, they avoid using sensitive data and implement anonymization techniques. The primary challenges practitioners face are ensuring transparency in the use of LLMs and minimizing data collection. Software development teams members consider current legislation inadequate for protecting data privacy in the context of LLM use. Conclusions: The results reveal a need to improve knowledge and practices related to data privacy in the context of LLM use. According to software development teams members, organizations need to invest in training, develop new tools, and adopt more robust policies to protect user data privacy. They advocate for a multifaceted approach that combines education, technology, and regulation to ensure the safe and responsible use of LLMs.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PandaLM aims to provide reproducible and automated comparisons between different large language models (LLMs). By giving PandaLM the same context, it can compare the responses of different LLMs and provide a reason for the decision, along with a reference answer. The target audience for PandaLM may be organizations that have confidential data and research labs with limited funds that seek reproducibility. These organizations may not want to disclose their data to third parties or may not be able to afford the high costs of secret data leakage using third-party APIs or hiring human annotators. With PandaLM, they can perform evaluations without compromising data security or incurring high costs, and obtain reproducible results. To demonstrate the reliability and consistency of our tool, we have created a diverse human-annotated test dataset of approximately 1,000 samples, where the contexts and the labels are all created by humans. Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset.. More papers and features are coming soon.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Mobile On-Device LLM market size reached USD 1.62 billion in 2024, demonstrating robust momentum driven by surging demand for privacy-centric and real-time AI applications. The market is projected to expand at a CAGR of 29.4% during the forecast period, with the total market size anticipated to reach USD 14.13 billion by 2033. This remarkable growth trajectory is primarily attributed to the rapid proliferation of AI-powered mobile devices, increasing user awareness regarding data privacy, and continuous advancements in edge computing and model optimization techniques.
One of the primary growth factors catalyzing the Mobile On-Device LLM market is the escalating demand for AI-driven functionalities that do not compromise user privacy. As consumers and enterprises become more vigilant about data breaches and regulatory compliance, on-device large language models (LLMs) offer a compelling solution by processing sensitive data locally rather than transmitting it to external servers. This capability not only minimizes latency and enhances user experience but also aligns with global data protection mandates such as GDPR and CCPA. Furthermore, the integration of LLMs directly into mobile hardware is enabling a new generation of smart applications—from personalized virtual assistants to advanced text generation—fueling widespread adoption across both consumer and enterprise segments.
Technological advancements in model compression, quantization, and hardware acceleration are also pivotal in driving the market forward. The evolution of small and medium-sized LLMs, tailored for resource-constrained environments like smartphones and wearables, has dramatically improved inference efficiency without sacrificing performance. Leading semiconductor manufacturers are embedding AI accelerators within chipsets, empowering devices to handle complex natural language processing (NLP) tasks in real time. This synergy between hardware and software is reducing power consumption, extending battery life, and making sophisticated AI capabilities accessible even on mid-tier devices. As a result, the addressable market for on-device LLMs is rapidly expanding beyond flagship smartphones to encompass tablets, wearables, and a diverse array of IoT endpoints.
Another significant growth driver is the surge in demand for hyper-personalized experiences across applications such as content recommendation, predictive text, translation, and contextual search. On-device LLMs enable seamless, always-available AI services that adapt to individual user preferences without persistent internet connectivity. This is particularly valuable in regions with unreliable network infrastructure or stringent data localization requirements. Additionally, enterprises are leveraging on-device AI to enhance productivity, automate workflows, and strengthen endpoint security, further accelerating market penetration. As organizations across healthcare, education, and retail sectors invest in digital transformation, the scope for on-device LLM deployment is set to broaden considerably over the coming years.
From a regional perspective, the Asia Pacific region is emerging as a dominant force in the Mobile On-Device LLM market, driven by rapid smartphone adoption, burgeoning digital ecosystems, and a thriving manufacturing base for consumer electronics. North America and Europe are also witnessing strong uptake, propelled by high consumer spending, robust enterprise digitalization, and a favorable regulatory environment for AI innovation. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, supported by increasing investments in mobile infrastructure and growing awareness of the benefits of on-device AI. The interplay of these regional trends is shaping a highly dynamic and competitive global market landscape.
The Mobile On-Device LLM market is segmented by model type into Small Language Models, Medium Language Models, and Large Language Models, each catering to distinct device capabilities and application requirements. Small Language Models (SLMs), typically comprising fewer than 1 billion parameters, are engineered for ultra-low latency and minimal resource consumption, making them ideal for wearables, entry-level smartphones, and IoT devices. Their compact size enables efficient operation even on devices with limited memory and processing power, while s
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AI Agent Evasion Dataset Overview The AI Agent Evasion Dataset is a comprehensive collection of 1000 prompts designed to train and evaluate large language models (LLMs) against advanced attacks targeting AI-driven systems, such as chatbots, APIs, and voice assistants. It addresses vulnerabilities outlined in the OWASP LLM Top 10, including prompt injection, data leakage, and unauthorized command execution. The dataset balances 70% malicious prompts (700 entries) with 30% benign prompts (300… See the full description on the dataset page: https://huggingface.co/datasets/darkknight25/AI_Agent_Evasion_Dataset.
According to our latest research, the global Local LLM Inference Board (Robot) market size reached USD 2.18 billion in 2024, reflecting robust industry momentum. The market is projected to expand at a CAGR of 19.7% from 2025 to 2033, reaching a forecasted value of USD 10.65 billion by 2033. This remarkable growth is driven by advancements in edge AI hardware, the proliferation of intelligent robotics across industries, and the increasing demand for real-time, on-device large language model (LLM) inference. The market’s upward trajectory is further fueled by the convergence of artificial intelligence, robotics, and next-generation computing platforms, creating substantial opportunities for both established players and innovative startups.
The accelerating integration of artificial intelligence into robotics is a primary growth factor for the Local LLM Inference Board (Robot) market. As industries demand higher autonomy, responsiveness, and intelligence from robotic systems, the need for robust, low-latency, and power-efficient LLM inference at the edge is intensifying. Local LLM inference boards empower robots to process and understand complex language-based tasks in real time, without relying on cloud connectivity. This capability is crucial for applications such as collaborative industrial robots, healthcare assistants, and autonomous vehicles, where latency, privacy, and reliability are paramount. The rapid evolution of transformer models and efficient AI chipsets has made it feasible to deploy sophisticated LLMs directly on robotic hardware, further accelerating market adoption.
Another significant driver is the growing emphasis on data privacy, security, and compliance across regulated sectors such as healthcare, manufacturing, and automotive. Local LLM inference boards enable organizations to process sensitive data on-premises or at the edge, minimizing the risk of data breaches and ensuring compliance with stringent data protection regulations. This is particularly critical in healthcare robotics, where patient data confidentiality is non-negotiable, and in manufacturing environments, where intellectual property and operational data must remain secure. The ability to deliver advanced AI-powered functionalities without transmitting data to external servers is a compelling value proposition, positioning local inference solutions as a preferred choice for enterprises with strict privacy requirements.
The market’s expansion is also being propelled by advancements in hardware acceleration technologies and the growing ecosystem of software frameworks optimized for on-device LLM deployment. The emergence of specialized AI inference boards, featuring high-performance GPUs, TPUs, and NPUs, has significantly improved the efficiency and scalability of local LLM processing. Additionally, the availability of robust software stacks, model compression techniques, and toolchains designed for edge deployment has lowered the barriers for integrating LLMs into diverse robotic platforms. This synergy between hardware and software innovation is catalyzing the development of next-generation robots capable of natural language understanding, contextual reasoning, and adaptive interaction in dynamic environments.
Regionally, Asia Pacific is emerging as the dominant market for Local LLM Inference Boards, driven by the rapid adoption of robotics in manufacturing, logistics, and consumer electronics. North America and Europe are also witnessing strong growth, fueled by technological innovation, robust R&D investments, and early adoption across healthcare and automotive sectors. The Middle East & Africa and Latin America are gradually catching up, supported by government initiatives and increasing investments in smart automation. The regional landscape is characterized by diverse application scenarios, regulatory frameworks, and ecosystem maturity, shaping the competitive dynamics and growth opportunities for market participants.
<br
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data
The performance of large language models on programming tasks is impressive, but many datasets suffer from data leakage, particularly in benchmarks like HumanEval and MBPP. To tackle this, we introduce the XCoder-Complexity-Scorer, which control code instruction-tuning data quality across three key dimensions: instruction complexity, response quality, and diversity. We also traine a Unit Test… See the full description on the dataset page: https://huggingface.co/datasets/banksy235/XCoder-80K.
SciEval is a comprehensive and multi-disciplinary evaluation benchmark designed to assess the performance of large language models (LLMs) in the scientific domain. It addresses several critical issues related to evaluating LLMs for scientific research.
Here are the key features of SciEval:
Multi-Dimensional Evaluation: SciEval systematically evaluates scientific research ability across four dimensions based on Bloom's taxonomy. These dimensions cover various aspects of scientific understanding and reasoning.
Objective and Subjective Questions: Unlike existing benchmarks that primarily rely on pre-collected objective questions, SciEval includes both objective and subjective questions. This approach ensures a more comprehensive evaluation of LLMs' abilities.
Dynamic Subset: To prevent potential data leakage, SciEval introduces a "dynamic" subset based on scientific principles. This subset dynamically adapts to evaluate LLMs' performance without compromising the integrity of the evaluation process.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Independent Jailbreak Datasets for LLM Guardrail Evaluation
Constructed for the thesis:“Contamination Effects: How Training Data Leakage Affects Red Team Evaluation of LLM Jailbreak Detection” The effectiveness of LLM guardrails is commonly evaluated using open-source red teaming tools. However, this study reveals that significant data contamination exists between the training sets of binary jailbreak classifiers (ProtectAI, Katanemo, TestSavantAI, etc.) and the test prompts used in… See the full description on the dataset page: https://huggingface.co/datasets/Simsonsun/JailbreakPrompts.