Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".
For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the form used by StatLog.
This dataset requires use of a cost matrix:
Good | Bad | |
---|---|---|
Good | 0 | 1 |
Bad | 5 | 0 |
The rows represent the actual classification and the columns the predicted classification.
It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset encapsulates the financial records of 12 leading German companies, presenting a detailed compilation of quarterly data from 2017 to 2024. The dataset features top-tier corporations such as Volkswagen AG, Siemens AG, Allianz SE, BMW AG, BASF SE, Deutsche Telekom AG, Daimler AG, SAP SE, Bayer AG, Deutsche Bank AG, Porsche AG, and Merck KGaA. It offers an extensive view into critical financial metrics, fostering comprehensive analysis and modeling of corporate financial health, performance trends, and growth trajectories.
Designed for various analytical applications, this dataset is an invaluable resource for financial forecasting, risk analysis, profitability assessment, and performance benchmarking. Each record represents a single quarter's financial snapshot for a specific company, enabling users to conduct robust time-series analysis and cross-sectional evaluations. The dataset provides granular insights into revenue generation, profitability, asset management, and financial leverage, supporting informed decision-making and strategic planning.
Data Fields and Their Significance: Company: This field identifies the company associated with the financial data, such as "Volkswagen AG" or "Siemens AG." It categorizes the data for cross-company comparisons and trend analysis of individual organizations.
Period: Representing the specific quarter in year-month format (e.g., "2017-03-31" for Q1 2017), this field is critical for tracking temporal trends in financial performance, allowing users to analyze year-over-year or quarter-over-quarter changes.
Revenue: Captured in billions of Euros, revenue reflects the total sales performance of a company for the given quarter. It provides insights into the company’s market reach and the demand for its products or services during each period.
Net Income: Expressed in billions of Euros, net income denotes the company’s profit after all expenses for the quarter. This metric is a cornerstone of profitability analysis, reflecting the financial efficiency and success of operational strategies.
Liabilities: Recorded in billions of Euros, liabilities represent the total debt and obligations of a company for a specific quarter. This data is essential for understanding the company’s financial leverage and assessing its exposure to financial risks.
Assets: Assets, measured in billions of Euros, encompass all resources owned by a company with economic value. This metric reflects the scale and capacity of the company’s operations and investments, serving as a benchmark for evaluating organizational size and financial resourcefulness.
Equity: Equity is calculated as Assets minus Liabilities and is expressed in billions of Euros. This metric represents the residual value available to shareholders, offering insights into financial stability and value creation within the organization.
ROA (Return on Assets): ROA, expressed as a percentage, is derived from the formula ( Net Income / Assets ) × 100 (Net Income/Assets)×100. It measures the company’s ability to generate profit from its assets, providing a lens into operational efficiency.
ROE (Return on Equity): Calculated as ( Net Income / Equity ) × 100 (Net Income/Equity)×100, ROE, expressed as a percentage, highlights the profitability of a company from shareholders' investments, serving as a key performance indicator.
Debt to Equity Ratio: This ratio, representing the proportion of Liabilities to Equity, sheds light on the company’s capital structure. It is crucial for understanding financial leverage, revealing the balance between debt financing and shareholder equity in the company's operations.
This comprehensive dataset is tailored to meet the needs of analysts, researchers, and industry professionals, facilitating in-depth studies and decision-making processes. By encompassing a diverse range of financial metrics over an extended time frame, it provides a rich foundation for examining the dynamics of corporate performance in one of the world's most robust economies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets on this page were obtained by asking human subjects to assign a similarity or relatedness judgment to a number of German word pairs. The datasets have been used to test the performance of semantic similarity/relatedness measures. All subjects in our experiments were native speakers of German. A judgment of 0 means “fully unsimilar/unrelated”, while a score of 4 means “fully similar/related”. In the comma-separated dataset files, each word pair is on a single line followed by the mean judgment score and the standard deviation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"GermanFakeNC" is a German Fake News Corpus including 490 texts which were retrieved from German alternative online media sources. Every fake statement in the text was verified claim-by-claim by authoritative sources (e.g. from local police authorities, scientific studies, the police press office, etc.). The time interval for most of the news is established from December 2015 to March 2018.
Steps to reproduce the data are described in the README file.
Please cite:
@inproceedings{TPDL_Vogel19, author = {Inna Vogel and Peter Jiang}, title = {Fake News Detection with the New German Dataset "GermanFakeNC"}, booktitle = {Digital Libraries for Open Knowledge - 23rd International Conference on Theory and Practice of Digital Libraries, {TPDL} 2019, Oslo, Norway, September 9-12, 2019, Proceedings}, pages = {288--295}, year = {2019}, url = {https://doi.org/10.1007/978-3-030-30760-8\_25}, doi = {10.1007/978-3-030-30760-8\_25},}
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
(see https://tblock.github.io/10kGNAD/ for the original dataset page)
This page introduces the 10k German News Articles Dataset (10kGNAD) german topic classification dataset. The 10kGNAD is based on the One Million Posts Corpus and avalaible under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can download the dataset here.
English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. However, to my knowlege, no german topic classification dataset is avaliable to the public.
Due to grammatical differences between the English and the German language, a classifyer might be effective on a English dataset, but not as effectiv on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifyer on multiple German datasets to get a sense of it's effectivness.
The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.
In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise
.
The 10kGNAD uses the second part of the topic path, here Wirtschaft
, as class label.
In result the dataset can be used for multi-class classification.
I created and used this dataset in my thesis to train and evaluate four text classifyers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.
As in most real-world datasets the class distribution of the 10kGNAD is not balanced. The biggest class Web consists of 1678, while the smalles class Kultur contains only 539 articles. However articles from the Web class have on average the fewest words, while artilces from the culture class have the second most words.
I propose a stratifyed split of 10% for testing and the remaining articles for training.
To use the dataset as a benchmark dataset, please used the train.csv
and test.csv
files located in the project root.
Python scripts to extract the articles and split them into a train- and a testset avaliable in the code directory of this project.
Make sure to install the requirements.
The original corpus.sqlite3
is required to extract the articles (download here (compressed) or here (uncompressed)).
This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please consider citing the authors of the One Million Post Corpus if you use the dataset.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
German Backtranslated Paraphrase Dataset
This is a dataset of more than 21 million German paraphrases. These are text pairs that have the same meaning but are expressed with different words. The source of the paraphrases are different parallel German / English text corpora. The English texts were machine translated back into German to obtain the paraphrases. This dataset can be used for example to train semantic text embeddings. To do this, for example, SentenceTransformers and… See the full description on the dataset page: https://huggingface.co/datasets/deutsche-telekom/ger-backtrans-paraphrase.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
German License Plates is a dataset for object detection tasks - it contains License Plates annotations for 1,243 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Multilingual HateCheck
Dataset Description
Multilingual HateCheck (MHC) is a suite of functional tests for hate speech detection models in 10 different languages: Arabic, Dutch, French, German, Hindi, Italian, Mandarin, Polish, Portuguese and Spanish. For each language, there are 25+ functional tests that correspond to distinct types of hate and challenging non-hate. This allows for targeted diagnostic insights into model performance. For more… See the full description on the dataset page: https://huggingface.co/datasets/Paul/hatecheck-german.
https://www.icpsr.umich.edu/web/ICPSR/studies/43/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/43/terms
This data collection contains electoral data at the wahlkreis and staat levels for the Reichstag elections of 1871, 1874, 1877, 1878, 1881, 1884, 1890, 1893, 1898, 1903, 1907, and 1912. The variables for each election provide information on the votes cast for parties, including the Conservative Party, the German Empire Party, the National-Liberals, the Liberal Empire Party, the People's Party, the Social Democrats, the Progress Party, the Catholic Center, the Particularists, the Poles Party, the Protest Party, the Antisemites, the Free-thinking People's Party, the German Reform Party, the Farmers' Union, the Peasants' Union, and splinter parties. Data are also provided on the total population in 1871 and every fifth year between 1875 and 1910, and the proportions of Protestants and of Catholics in the total population for 1871, 1875, 1880, 1885, 1890, 1905, and 1910. Additional variables provide information on the number of eligible voters, valid and invalid votes cast, and voter turnout.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
German-English texts extracted from the website of the Federal Foreign Office Berlin. This includes 53,849 pairs that were translated between October 2013 and the beginning of November 2015 and converted into a .TMX file format.
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) actions SMART 2014/1074 and SMART 2015/1091. For further information on the project: http://lr-coordination.eu.
German(Germany) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(3,442 German native speakers in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
https://www.icpsr.umich.edu/web/ICPSR/studies/42/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/42/terms
This data collection contains electoral and demographic data at several levels of aggregation (kreis, land/regierungsberzirk, and wahlkreis) for Germany in the Weimar Republic period of 1919-1933. Two datasets are available. Part 1, 1919 Data, presents raw and percentagized election returns at the wahlkreis level for the 1919 election to the Nationalversammlung. Information is provided on the number and percentage of eligible voters and the total votes cast for parties such as the German National People's Party, German People's Party, Christian People's Party, German Democratic Party, Social Democratic Party, and Independent Social Democratic Party. Part 2, 1920-1933 Data, consists of returns for elections to the Reichstag, 1920-1933, and for the Reichsprasident elections of 1925 and 1932 (including runoff elections in each year), returns for two national referenda, held in 1926 and 1929, and data pertaining to urban population, religion, and occupations, taken from the German Census of 1925. This second dataset contains data at several levels of aggregation and is a merged file. Crosstemporal discrepancies, such as changes in the names of the geographical units and the disappearance of units, have been adjusted for whenever possible. Variables in this file provide information for the total number and percentage of eligible voters and votes cast for parties, including the German Nationalist People's Party, German People's Party, German Center Party, German Democratic Party, German Social Democratic Party, German Communist Party, Bavarian People's Party, Nationalist-Socialist German Workers' Party (Hitler's movement), German Middle Class Party, German Business and Labor Party, Conservative People's Party, and other parties. Data are also provided for the total number and percentage of votes cast in the Reichsprasident elections of 1925 and 1932 for candidates Jarres, Held, Ludendorff, Braun, Marx, Hellpach, Thalman, Hitler, Duesterburg, Von Hindenburg, Winter, and others. Additional variables provide information on occupations in the country, including the number of wage earners employed in agriculture, industry and manufacturing, trade and transportation, civil service, army and navy, clergy, public health, welfare, domestic and personal services, and unknown occupations. Other census data cover the total number of wage earners in the labor force and the number of female wage earners employed in all occupations. Also provided is the percentage of the total population living in towns with 5,000 inhabitants or more, and the number and percentage of the population who were Protestants, Catholics, and Jews.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
German Traffic Sign Recognition is a dataset for object detection tasks - it contains Traffic Signs annotations for 1,102 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Enhance your Conversational AI model with our Off-the-Shelf German Language Datasets. Shaip high-quality audio datasets are a quick and effective solution for model training.
VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
We will make available all submitted audio files under the GPL license, and then 'compile' them into acoustic models for use with Open Source speech recognition engines such as CMU Sphinx, ISIP, Julius (github) and HTK (note: HTK has distribution restrictions).
German(Germany) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(around 500 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
In late May 1939, just three months before the Second World War began in Europe, Germany's workforce was made up of almost 25 million men, 15 million women, and a very small number of foreign workers. The share of German men in the workforce decreased each year thereafter, as more were conscripted into the armed forces, and there were approximately 11 million fewer German male citizens in the workforce by September 1944. The number of German women fluctuated, but remained between 14 and 15 million throughout the given period, and it exceeded the number of German men in 1944. Despite the number of German men in the workforce dropping by 45 percent, the total number of workers in German was consistently around 36 million between 1940 and 1944, as this difference was offset by foreign and forced laborers. These workers were mostly drafted from annexed territories in Eastern Europe, and prisoners were transferred from concentration and POW camps to meet the labor demands in various areas of Germany.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are 5000 questions from LC-QuAD 1.0 dataset, translated to German language. Each question consist of corresponding SPARQL query for DBpedia 2016-04.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This short article introduces the German Cliometrics Database as the fundament of Jopp and Spoerer (2024) who trace cliometric research on German history. This newly constructed database of every publication which (1) contributes to the historiography of Germany and (2) employs, as a baseline, inferential statistics enables researchers to specifically find cliometric studies related to their own work much quicker. Even though no full texts are provided along with the data file, the collected abstracts or, respectively, summaries for every publication in the database allow for some baseline text mining approaches. Along with the remaining information provided, they may also form the basis for broader bibliometric or historiographical studies.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The German SpeechDat-Car database comprises 338 German speakers recorded over the mobile telephone network. This database is partitioned into 17 DVDs and 1 CD. The speech databases made within the SpeechDat-Car project were validated by SPEX, the Netherlands, to assess their compliance with the SpeechDat-Car format and content specifications.The speech data files are in two formats. The signal data format for the in-car mobile platform recordings is 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order); the channels are multiplexed in a single file, with the channel sequence being 0-1-2-3. The format of the fixed platform audio files is 8 kHz, 8 bit alaw encoding. Each signal file is accompanied by an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:- 2 voice activation keywords- 1 sequence of 10 isolated digits- 7 connected digits : 1 sheet number (4+ digits), 1 spontaneous telephone number (9-11 digits), 3 read telephone numbers, 1 credit card number (16 digits), 1 PIN code (6 digits)- 3 dates : 1 spontaneous date (e.g. birthday), 1 prompted date, 1 relative or general date expression- 2 word spotting phrases using an application word (embedded)- German data phrases- 4 isolated digits- 7 spelled words : 1 spontaneous (own forename or surname), 1 spelling of directory city name, 4 real word/name, 1 artificial name for coverage- 1 money amount- 1 natural number- 7 directory assistance names : 1 spontaneous (own forename or surname), 1 city of birth / growing up (spontaneous), 2 most frequent cities, 2 most frequent company/agency, 1 "forename surname"- 9 phonetically rich sentences- 2 time phrases : 1 time of day (spontaneous), 1 time phrase (word style)- 4 phonetically rich words- 69 application words: 13 mobile phone application words, 22 IVR function keywords, 32 car products keywords, 2 additional common application words- 2 additional language dependent keywords- spontaneous sentencesThe following age distribution has been obtained: 187 speakers are between 16 and 30, 72 speakers are between 31 and 45, 70 speakers are between 46 and 60, and 9 speakers are over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".
For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the form used by StatLog.
This dataset requires use of a cost matrix:
Good | Bad | |
---|---|---|
Good | 0 | 1 |
Bad | 5 | 0 |
The rows represent the actual classification and the columns the predicted classification.
It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).