Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In order to raise the bar for non-English QA, we are releasing a high-quality, human-labeled German QA dataset consisting of 13 722 questions, incl. a three-way annotated test set. The creation of GermanQuAD is inspired by insights from existing datasets as well as our labeling experience from several industry projects. We combine the strengths of SQuAD, such as high out-of-domain performance, with self-sufficient questions that contain all relevant information for open-domain QA as in the NaturalQuestions dataset. Our training and test datasets do not overlap like other popular datasets and include complex questions that cannot be answered with a single entity or only a few words.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
(see https://tblock.github.io/10kGNAD/ for the original dataset page)
This page introduces the 10k German News Articles Dataset (10kGNAD) german topic classification dataset. The 10kGNAD is based on the One Million Posts Corpus and avalaible under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can download the dataset here.
English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. However, to my knowlege, no german topic classification dataset is avaliable to the public.
Due to grammatical differences between the English and the German language, a classifyer might be effective on a English dataset, but not as effectiv on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifyer on multiple German datasets to get a sense of it's effectivness.
The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.
In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise
.
The 10kGNAD uses the second part of the topic path, here Wirtschaft
, as class label.
In result the dataset can be used for multi-class classification.
I created and used this dataset in my thesis to train and evaluate four text classifyers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.
As in most real-world datasets the class distribution of the 10kGNAD is not balanced. The biggest class Web consists of 1678, while the smalles class Kultur contains only 539 articles. However articles from the Web class have on average the fewest words, while artilces from the culture class have the second most words.
I propose a stratifyed split of 10% for testing and the remaining articles for training.
To use the dataset as a benchmark dataset, please used the train.csv
and test.csv
files located in the project root.
Python scripts to extract the articles and split them into a train- and a testset avaliable in the code directory of this project.
Make sure to install the requirements.
The original corpus.sqlite3
is required to extract the articles (download here (compressed) or here (uncompressed)).
This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please consider citing the authors of the One Million Post Corpus if you use the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comprehensive dataset of local, state, and federal election results in Germany, facilitating research on electoral behavior, representation, and political responsiveness. Umfassende Datenbank von: Bundestagswahlergebnissen, Landeswahlergebnissen und Kommunalwahlergebnissen in Deutschland, die die Forschung zu Wahlverhalten, politischer Repräsentation und politischer Reaktionsfähigkeit ermöglicht.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Tsev German DatasetDeutscher DatensatzHigh-Quality German Hu-Center, thiab IVR Dataset rau AI & Speech Models Hu rau peb Hu-Center Data IVR Cov Ntaub Ntawv Hu-Center Cov Ntaub Ntawv .elementor-58669 .elementor-element.elementor-element-91938a9{20px:0px 50px;}.elementor-0 .elementor-element.elementor-element-58669f99d{padding:171px 0px 0px…
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
German Traffic Sign Recognition is a dataset for object detection tasks - it contains Traffic Signs annotations for 1,102 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also included. The size of the uncompressed dataset is 16GB.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Public Domain Newspapers (German)
This dataset contains 13 billion words of OCR text extracted from German historical newspapers.
Dataset Details
Dataset Description
Curated by: Sebastian Majstorovic Language(s) (NLP): German License: Dataset: CC0, Texts: Public Domain
Dataset Sources [optional]
Repository: https://www.deutsche-digitale-bibliothek.de/newspaper
Copyright & License
The newspapers texts have been… See the full description on the dataset page: https://huggingface.co/datasets/storytracer/German-PD-Newspapers.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Ụlọ German DatasetDeutscher DatensatzHigh-Quality German Call-Center, na IVR Dataset maka AI & Ụdị Okwu Kpọtụrụ Anyị Oku-Center Data IVR Data Call-Center Data .elementor-58669 .elementor-element.elementor-element-91938a9{padding:20px 0px 50px;}.elementor-0 .elementor-element.elementor-element-58669f99d{padding:171px 0px 0px…
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The German General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world German usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level German conversations covering a broad spectrum of everyday topics.
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native German speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Chats reflect informal, native-level German usage with:
Every chat instance is accompanied by structured metadata, which includes:
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
This ensures a clean, reliable dataset ready for high-performance AI model training.
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Tsev German DatasetHigh-Quality German Wake Word Dataset rau AI & Speech Models Hu rau Peb Txheej TxheemTitleGerman Language DatasetDataset HomWake WordDescriptionWake Words / Voice Command / Trigger Word / Keyphrase sau ntawm…
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Kumba German DatasetDeutscher DatensatzHigh-Quality German Call-Center, uye IVR Dataset yeAI & Speech Models Bata Isu Call-Center Data IVR Data Call-Center Data .elementor-element.elementor-element-58669a91938{padding:9px 20px.0p50px 0px 58669px 99px 171px 0 px. .elementor-element.elementor-element-0f20d{padding:XNUMXpx XNUMXpx XNUMXpx…
1,796 Hours – German Speech Dataset by Mobile Phone, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(3,442 German native speakers in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The aim of the project was to identify and compile the best available historical time series for Germany, and to complement or update them at reasonable expense. Time series were only to be included, if data for the entire period from 1834 to 2012 was at least theoretically available. An integral aspect of the concept of our project is the combination of data with critical commentaries of the time series by established expert scientists. The following themes are covered (authors in parentheses): 1. Environment, Climate, and Nature (Paul Erker) 2. Population, Households, Families (Georg Fertig/Franz Rothenbacher) 3. Migration (Jochen Oltmer) 4. Education and Science (Volker Müller-Benedict) 5. Health Service (Reinhard Spree) 6. Social Policy (Marcel Boldorf) 7. Public Finance and Taxation (Mark Spoerer) 8. Political Participation (Marc Debus) 9. Crime and Justice (Dietrich Oberwittler) 10. Work, Income, and Standard of Living (Toni Pierenkemper) 11. Culture, Tourism, and Sports (Heike Wolter/Bernd Wedemeyer-Kolwe) 12. Religion (Thomas Großbölting/Markus Goldbeck) 13. National Accounts (Rainer Metz) 14. Prices (Rainer Metz) 15. Money and Credit (Richard Tilly) 16. Transport and Communication (Christopher Kopper) 17. Agriculture (Michael Kopsidis) 18. Business, Industry, and Craft (Alfred Reckendrees) 19. Building and Housing (Günther Schulz) 20. Trade (Markus Lampe/ Nikolaus Wolf) 21. Balance of Payments (Nikolaus Wolf) 22. International Comparisons (Herman de Jong/Joerg Baten) Basically, the structure of a dataset is guided by the tables in the print publication by the Federal Agency. The print publication allows for four to eight tables for each of the 22 chapters, which means the data record is correspondingly made up of 120 tables in total. The inner structure of the dataset is a consequence of a German idiosyncrasy: the numerous territorial changes. To account for this idiosyncrasy, we decided on a four-fold data structure. Four territorial units with their respective data, are therefore differentiated in each table in separate columns: A German Confederation/Custom Union/German Reich (1834-1945).B German Federal Republic (1949-1989).C German Democratic Republic (1949-1989).D Germany since the reunification (since 1990). Years in parentheses should be considered a guideline only. It is possible that series for the territory of the old Federal Republic or the new federal states are continued after 1990, or that all-German data from before 1990 were available or were reconstructed.All time series are identified by a distinct ID consisting of an “x” and a four-digit number (for numbers under 1000 with leading zeros). The time series that exclusively contain GDR data were identified with a “c” prefix instead of the “x”.For the four territorial units, the time series are arranged in four blocks side by side within the XLSX files. That means: first all time series for the territory and the period of the Custom Union and German Reich, the next columns contain side by side all time series for the territory of the German Federal Republic / the old federal states, then – if available – those for the territory of the German Democratic Republic / the new federal states, and finally for the reunified Germany. There is at most one row for each year. Dates can be missing if no data for the respective year are available in either of the table’s time series, but no date will appear twice. The four territorial units and the resultant time periods cause a “stepwise” appearance of the data tables.
If you find anything missing, unclear, incomprehensible, improvable, etc., please contact me (kontakt@deutschland-in-daten.de). Further reading:Rahlf, Thomas, The German Time Series Dataset 1834-2012, in: Journal of Economics and Statistics 236/1 (2016), pp. 129-143. [DOI: 10.1515/jbnst-2015-1005] Open Access: Rahlf, Thomas, Voraussetzungen für eine Historische Statistik von Deutschland (19./20. Jh.), in: Vierteljahrschrift für Sozial- und Wirtschaftsgeschichte 101/3 (2014), S. 322-352. [PDF] Rahlf, Thomas (Hrsg.), Dokumentation zum Zeitreihendatensatz für Deutschland, 1834-2012, Version 01 (= Historical Social Research Transition 26v01), Köln 2015. http://dx.doi.org/10.12759/hsr.trans.26.v01.2015Rahlf, Thomas (Hrsg.), Deutschland in Daten. Zeitreihen zur Historischen Statistik, Bonn: Bundeszentrale für Politische Bildung, 2015. [EconStor]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Germany recorded a Current Account surplus of 14774.93 EUR Million in July of 2025. This dataset provides - Germany Current Account - actual values, historical data, forecast, chart, statistics, economic calendar and news.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset comprises 431 hours of telephone dialogues in German, collected from 590+ native speakers across various topics and domains, achieving an impressive 95% sentence accuracy rate. It is designed for research in automatic speech recognition (ASR) systems.
By utilizing this dataset, researchers and developers can advance their understanding and capabilities in transcribing audio, and natural language processing (NLP). - Get the data
The dataset contains diverse audio files that represent different accents and dialects, making it a comprehensive resource for training and evaluating recognition models.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2F602775557600742814487a26ed7d34bb%2FFrame%202%20(1).png?generation=1741267375174939&alt=media" alt="">
- Audio files: High-quality recordings in WAV format
- Text transcriptions: Accurate and detailed transcripts for each audio segment
- Speaker information: Metadata on native speakers, including gender and etc
- Topics: Diverse domains such as general conversations, business and etc
This dataset is essential for anyone looking to improve speech recognition technology and develop more effective automatic speech systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Traffic German is a dataset for object detection tasks - it contains Football Player Detection annotations for 6,523 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
https://www.icpsr.umich.edu/web/ICPSR/studies/43/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/43/terms
This data collection contains electoral data at the wahlkreis and staat levels for the Reichstag elections of 1871, 1874, 1877, 1878, 1881, 1884, 1890, 1893, 1898, 1903, 1907, and 1912. The variables for each election provide information on the votes cast for parties, including the Conservative Party, the German Empire Party, the National-Liberals, the Liberal Empire Party, the People's Party, the Social Democrats, the Progress Party, the Catholic Center, the Particularists, the Poles Party, the Protest Party, the Antisemites, the Free-thinking People's Party, the German Reform Party, the Farmers' Union, the Peasants' Union, and splinter parties. Data are also provided on the total population in 1871 and every fifth year between 1875 and 1910, and the proportions of Protestants and of Catholics in the total population for 1871, 1875, 1880, 1885, 1890, 1905, and 1910. Additional variables provide information on the number of eligible voters, valid and invalid votes cast, and voter turnout.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The WMT 2014 English-German dataset is a cornerstone resource for researchers developing and evaluating machine translation (MT) systems. It's widely used in the annual WMT shared task, serving as a standard benchmark to compare different approaches and track progress in the field.
Key Features
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home German DatasetDeutscher DatensatzHigh-Quality German Call-Center, and IVR Dataset for AI & Speech Models Contact Us Call-Center Data IVR Data Call-Center Data .elementor-58669 .elementor-element.elementor-element-91938a9{padding:20px 0px 50px 0px;}.elementor-58669 .elementor-element.elementor-element-99f171d{padding:0px 0px 20px…
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Terms of Trade in Germany increased to 103.90 points in July from 103.60 points in June of 2025. This dataset provides - Germany Terms Of Trade- actual values, historical data, forecast, chart, statistics, economic calendar and news.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In order to raise the bar for non-English QA, we are releasing a high-quality, human-labeled German QA dataset consisting of 13 722 questions, incl. a three-way annotated test set. The creation of GermanQuAD is inspired by insights from existing datasets as well as our labeling experience from several industry projects. We combine the strengths of SQuAD, such as high out-of-domain performance, with self-sufficient questions that contain all relevant information for open-domain QA as in the NaturalQuestions dataset. Our training and test datasets do not overlap like other popular datasets and include complex questions that cannot be answered with a single entity or only a few words.