As of mid-February 2025, the Chinese AI chatbot DeepSeek had around ** million daily active users. When DeepSeek released its research paper illustrating the capabilities of their chatbot, a global audience became aware of the company. As a result, the number of daily active users skyrocketed.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
A chatbot from Chinese AI lab DeepSeek sent shockwaves through the market in January, due to its ability to perform mathematics, coding and reasoning at a similar level to ChatGPT and other top-tier...
In January 2025, deepseek.com attracted a total of 278 million visits. Male users accounted for over two-thirds. With a fraction of costs to develop its advanced large language model, the Chinese company Deepseek has rapidly emerged as a significant player in the global AI industry. Its chatbot app hit 20 million daily active users in just three weeks.
At the end of January 2025, Deepseek recorded a spike in its web traffic after media reports highlighted the company's efficient and affordable large language model, which disrupted the global AI landscape. Younger internet users have exhibited a strong enthusiasm. That month, around 57 percent of deepseek.com's visitors were between 18 and 34 years old.
At the end of January 2025, the Chinese AI company Deepseek made global headlines with its cost-effective large language model (LLM), which rivals industry leaders like OpenAI's GPT-4o, sending shockwaves through the global tech community. The web traffic of deepseek.com surged to 278 million visits on desktop and mobile in January 2025, compared to only 12 million visits in the previous month. The company's home country contributed almost a quarter of the desktop traffic, followed by the United States and Brazil.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
grammar-correction-deepseek-v9-10k
Grammar correction dataset using DeepSeek v9 with GPT prompts for training conversational models
Dataset Description
This dataset contains conversational data for grammar correction tasks, with system prompts, user inputs, and assistant responses.
Dataset Structure
Each example contains:
messages: List of conversation messages with roles (system/user/assistant) and content source: Source identifier for the dataset… See the full description on the dataset page: https://huggingface.co/datasets/stimuler/grammar-correction-deepseek-v9-10k.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
R1 Dataset Collection
Aggregated high-quality English prompts and model-generated responses from DeepSeek R1 and DeepSeek R1-0528.
Dataset Summary
The R1 Dataset Collection combines multiple public DeepSeek-generated instruction-response corpora into a single, cleaned, English-only JSONL file. Each example consists of a <|user|> prompt and a <|assistant|> response in one "text" field. This release includes:
~21,000 examples from the DeepSeek-R1-0528 Distilled Custom… See the full description on the dataset page: https://huggingface.co/datasets/Hugodonotexit/math-code-science-deepseek-r1-en.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Click here to support our open-source dataset and model releases! DAG-Reasoning-DeepSeek-R1-0528 is a dataset focused on analysis and reasoning, creating directed acyclic graphs testing the limits of DeepSeek R1 0528's graph-reasoning skills! This dataset contains:
4.08k synthetically generated prompts to create directed acyclic graphs in response to user input, with all responses generated using DeepSeek R1 0528. All responses contain a multi-step thinking process to perform effective… See the full description on the dataset page: https://huggingface.co/datasets/sequelbox/DAG-Reasoning-DeepSeek-R1-0528.
Synthetic DeepSeek Emotional Support - Multi-Turn
100% synthetic emotional support dataset generated by R1 0528 with a new method I'm working on. This is more than just two instances of DeepSeek talking to each other - this new method allows much more genuine and realistic user responses. For now I used this method to generate emotional support conversations but it can easily be applied to other fields in the future. I plan to open-source the entire framework soon - stay tuned.… See the full description on the dataset page: https://huggingface.co/datasets/mrfakename/deepseek-synthetic-emotional-support.
China's most popular search engine, Baidu, has leveraged AI capabilities to cement its dominant status. In March 2025, Baidu AI reportedly had over 290 million monthly active users. Douyin's AI search feature outperformed DeepSeek in terms of monthly active user size.
In January 2025, ChatGPT was the most downloaded generative AI mobile app worldwide, with over 40.5 million downloads. DeepSeek ranked second, with 17.6 million downloads, while the app's domestic version for the Chinese market ranked fifth and added 7.8 million downloads to the AI brand. Google Gemini ranked third with approximately 10 million global downloads from global users during January 2025.
As of February 2025, the leading generative artificial intelligence (AI) smartphone app in South Korea was ChatGPT, with almost 3.9 million monthly users. The Chinese-developed service DeepSeek-R1 still featured on the list despite a ban on new downloads from the South Korean government.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Having been trained in the wild, Large Language Models (LLMs) may suffer from different types of bias. As shown in previous studies outside software engineering, this includes a language bias, i.e., these models perform differently depending on the language used for the query/prompt. However, so far the impact of language bias on source code generation has not been thoroughly investigated. Therefore, in this paper, we study the influence of the language adopted in the prompt on the quality of the source code generated by three LLMs, specifically GPT, Claude, and DeepSeek. We consider 230 coding tasks for Python and 230 for Java, and translate their related prompts into four languages: Chinese, Hindi, Spanish, and Italian. After generating the code, we measure code quality in terms of passed tests, code metrics, warnings generated by static analysis tools, and language used for the identifiers. Results indicate that (i) source code generated from the English queries is not necessarily better in terms of passed test and quality metrics, (ii) the quality for different languages varies depending on the programming language and LLM being used, and (iii) the generated code tend to contain mixes of comments and literals written in English and the language used to formulate the prompt.
This replication package is organized into two main directories: data and scripts. The datadirectory contains all the data used in the analysis, including prompts and final results. The scripts directory contains all the Python scripts used for code generation and analysis.
The data directory contains five subdirectories, each corresponding to a stage in the analysis pipeline. These are enumerated to reflect the order of the process:
prompt_translation: Contains files with manually translated prompts for each language. Each file is associated with both Python and Java. The structure of each file is as follows:
id
: The ID of the query in the CoderEval benchmark.prompt
: The original English prompt.summary
: The original summary.code
: The original code.translation
: The translation generated by GPT.correction
: The manual correction of the GPT-generated translation.correction_tag
: A list of tags indicating the corrections made to the translation.generated_code
: This column is initially empty and will contain the code generated from the translated prompt.generation: Contains the code generated by the three LLMs for each programming language and natural language. Each subdirectory (e.g., java_chinese_claude
) contains the following:
java_chinese_claude.csv
) containing the generated code in the corresponding column.tests: Contains input files for the testing process and the results of the tests. Files in the input_files directory are formatted according to the CoderEval benchmark requirements. The results directory holds the output of the testing process.
quantitative_analysis: Contains all the csv reports of the static analysis tools and test output for all languages and models. These files are the inputs for the statistical analysis. The directory stats contains all the output tables for the statistical analysis, which are shown in paper's tables.
qualitative_analysis: Contains files used for the qualitative analysis:
id
: The ID of the query in the CoderEval benchmark.generated_code
: The code generated by the model.comments
: The language used for comments.identifiers
: The language used for identifiers.literals
: The language used for literals.notes
: Additional notes.ablation_study: Contains files for the ablation study. Each file has the following columns:
id
: The ID of the query in the CoderEval benchmark.prompt
: The prompt used for code generation.generated_code
, comments
, identifiers
, and literals
: Same as in the qualitative analysis. results.pdf
: This file shows the table containing all the percentages of comments, identifiers and literals extracted from the csv files of the ablation study.Files prefixed with italian
contain prompts with signatures and docstrings translated into Italian. The system prompt used is the same as the initial one (see the paper). Files with the english
prefix have prompts with the original signature (in English) and the docstring in Italian. The system prompt differs as follows:
You are an AI that only responds with Python code. You will be given a function signature and its docstring by the user. Write your full implementation (restate the function signature).
Use a Python code block to write your response.
Comments and identifiers must be in Italian.
For example:
```python
print("Hello World!")
The scripts directory contains all the scripts used to perform all the generations and analysis. All files are properly commented. Here a brief description of each file:
code_generation.py: This script automates code generation using AI models (GPT, DeepSeek, and Claude) for different programming and natural languages. It reads prompts from CSV files, generates code based on the prompts, and saves the results in structured directories. It logs the process, handles errors, and stores the generated code in separate files for each iteration.
computeallanalysis.py: This script performs static code analysis on generated code files using different models, languages, and programming languages. It runs various analyses (Flake8, Pylint, Lizard) depending on the programming language: for Python, it runs all three analyses, while for Java, only Lizard is executed. The results are stored in dedicated report directories for each iteration. The script ensures the creation of necessary directories and handles any errors that occur during the analysis process.
createtestjava.py: This script processes Java code generated by different models and languages, extracting methods using a JavaParser server. It iterates through multiple iterations of generated code, extracts the relevant method code (or uses the full code if no method is found), and stores the results in a JSONL file for each language and model combination.
deepseek_model.py: This function sends a request to the DeepSeek API, passing a system and user prompt, and extracts the generated code snippet based on the specified programming language. It prints the extracted code in blue to the console, and if any errors occur during the request or extraction, it prints an error message in red. If successful, it returns the extracted code snippet; otherwise, it returns None.
extractpmdreport.py: This script processes PMD analysis reports in SARIF format and converts them into CSV files. It extracts the contents of ZIP files containing the PMD reports, parses the SARIF file to gather analysis results, and saves the findings in a CSV file. The output includes details such as file names, rules, messages, and the count of issues found. The script iterates through multiple languages, models, and iterations, ensuring that PMD reports are properly processed and saved for each combination.
flake_analysis.py: The flake_analysis function runs Flake8 to analyze Python files for errors and generates a CSV report summarizing the results. It processes the output, extracting error details such as filenames, error codes, and messages. The errors are grouped by file and saved in a CSV file for easy review.
generatepredictionclaude_java.py: The generatecodefrom_prompt function processes a JSON file containing prompts, generates Java code using the Claude API, and saves the generated code to a new JSON file. It validates each prompt, ensures it's JSON-serializable, and sends it to the Claude API for code generation. If the generation is successful, the code is stored in a structured format, and the output is saved to a JSON file for further use.
generatepredictionclaude_python.py: This code defines a function generatecodefrom_prompt that processes a JSON file containing prompts, generates Python code using the Claude API, and saves the generated code to a new JSON file. It handles invalid values and ensures all prompts are
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Integration of publicly available datasets related to R1
We integrate all data and remove contaminated data and data with inconsistent formats. The user defaults to selecting version 'V1', with a total of 2592286 samples.
1 Relevant datasets mentioned in HuggingFace/open_r1:
(1) HuggingFaceH4/numina-deepseek-r1-qwen-7b: A dataset distilled using DeepSeek-R1-Distill-Qwen-7B. Hugging Face downloads: 631.
(2) AI-MO/NuminaMath-TIR: A subset of 70K math-related samples… See the full description on the dataset page: https://huggingface.co/datasets/xiushenghuang/open_r1_dataset.
Bespoke-Stratos-17k-DeepSeekrized
Created by: Seungwoo Ryu
Introduction
This dataset is a modified version of the original HuggingFaceH4/Bespoke-Stratos-17k dataset, reformatted to match the output format of DeepSeek models.
Modifications
The user and assistant fields from the original dataset's messages have been moved to user_modified and agent_modified respectively. The content in the agent_modified field has been transformed to match the DeepSeek model's… See the full description on the dataset page: https://huggingface.co/datasets/tryumanshow/Bespoke-Stratos-17k-DeepSeekrized.
Comparison of Seconds to Output 500 Tokens, including reasoning model 'thinking' time; Lower is better by Model
In February 2025, ChatGPT was the most popular artificial intelligence (AI) application worldwide, with over 400.61 million monthly active users (MAU). The ByteDance-owned chatbot Doubao had around 81.91 million MAU, making it the most popular Chinese-based tool of this kind. ChatGPT-operated Nova Assistant ranked third with 62.79 million MAU and was followed by Chinese-based DeepSeek with around 61.81 million MAU.
In March 2025, ChatGPT’s mobile app recorded over 64.26 million App Store and Google Play downloads worldwide. Google's Gemini AI Assistant mobile app was released on February 8, 2024, and was initially available in the U.S. market only. In the same month, the app registered around 13.92 million downloads. Regional preferences shape AI app adoption ChatGPT has a strong global presence with over 400.61 million monthly active users in February 2025, but regional preferences vary. In the United States, ChatGPT had a 45 percent download market share, compared to Google Gemini's 11 percent. However, Gemini emerged as the preferred generative AI app in India, representing a 52 percent market share. This competitive landscape now also includes Chinese-based players like ByteDance's Doubao and DeepSeek, indicating an even more diverse and evolving AI worldwide ecosystem. The AI-powered revolution in online search The global AI market has experienced substantial growth, exceeding 184 billion U.S. dollars in 2024 and projected to surpass 826 billion U.S. dollars by 2030. This expansion is mirrored in user behavior, with around 15 million adults in the United States using AI-powered tools as their first option for online search in 2024. Additionally, 68 percent of U.S. adults reported the use of AI-powered search engines for exploring new topics in 2024, with another 44 percent of respondents utilizing these tools to learn or explain concepts.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for XuHu6736/s1_54k_filter_with_isreasoning
Dataset Description
XuHu6736/s1_54k_filter_with_isreasoning is an enhanced version of the XuHu6736/s1_54k_filter dataset. This version includes additional annotations to assess the suitability of each question for reasoning training. These annotations, isreasoning_score and isreasoning, were generated using the deepseek-v3 model. The purpose of these new fields is to allow users to filter, weight, or specifically… See the full description on the dataset page: https://huggingface.co/datasets/XuHu6736/s1_54k_filter_with_isreasoning.
Comparison of Represents the average of coding benchmarks in the Artificial Analysis Intelligence Index (LiveCodeBench & SciCode) by Model
As of mid-February 2025, the Chinese AI chatbot DeepSeek had around ** million daily active users. When DeepSeek released its research paper illustrating the capabilities of their chatbot, a global audience became aware of the company. As a result, the number of daily active users skyrocketed.