Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".
For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the form used by StatLog.
This dataset requires use of a cost matrix:
Good | Bad | |
---|---|---|
Good | 0 | 1 |
Bad | 5 | 0 |
The rows represent the actual classification and the columns the predicted classification.
It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset classifies people described by a set of attributes as good or bad credit risks. Link to the original dataset: German Credit Data
Dataset Characteristics | # Instances | # Features |
---|---|---|
Multivariate | 1000 | 20 |
Since it is impossible to understand the original dataset due to its categorical features with coded, we have mapped those codes into appropriate ones.
Features and explanations
checking_acc_status
(categorical) - Status of existing checking account
duration
(numeric) - Agreed Loan Duration in monthscred_hist
(categorical) - Credit history status
purpose
(categorical) - Loan Request Purpose
loan_amt
(numerical) - Credit amountsaving_acc_bonds
(categorical) - Savings account/bonds
present_employment_since
(categorical) - Present employment since
installment_rate
(numerical) - Installment rate in percentage of disposable incomepersonal_stat_gender
(categorical) - Personal status and sex
other_debtors_guarantors
(categorical: co-applicant, guarantor, none)present_residence_since
(numerical)property
(categorical)
age
(numerical)other_installment_plans
(categorical: bank, stores, none)housing
(categorical: rent, own, for_free)num_curr_loans
- Number of existing credits at this bankjob
(categorical)
num_people_provide_maint
(numerical) - Number of people being liable to provide maintenance fortelephone
(categorical)is_foreign_worker
(categorical) - Indicates whether the individual is a foreign workerhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The link to the original dataset can be found below.
It is almost impossible to understand the original dataset due to its complicated system of categories and symbols. Thus, I wrote a small Python script to convert it into a readable CSV file. Several columns are simply ignored, because in my opinion either they are not important or their descriptions are obscure. The selected attributes are:
Source: UCI
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Ụlọ German DatasetDeutscher DatensatzHigh-Quality German Call-Center, na IVR Dataset maka AI & Ụdị Okwu Kpọtụrụ Anyị Oku-Center Data IVR Data Call-Center Data .elementor-58669 .elementor-element.elementor-element-91938a9{padding:20px 0px 50px;}.elementor-0 .elementor-element.elementor-element-58669f99d{padding:171px 0px 0px…
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
German is a dataset for object detection tasks - it contains Sign annotations for 898 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
🇩🇪 German Public Domain 🇩🇪
German-Public Domain or German-PD is a large collection aiming to aggregate all German monographies and periodicals in the public domain. As of March 2024, it is the biggest German open corpus.
Dataset summary
The collection contains 260,638 individual texts making up 37,650,706,611 words recovered from multiple sources, including Internet Archive and various European national libraries and cultural heritage institutions. Each parquet file… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/German-PD.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
(see https://tblock.github.io/10kGNAD/ for the original dataset page)
This page introduces the 10k German News Articles Dataset (10kGNAD) german topic classification dataset. The 10kGNAD is based on the One Million Posts Corpus and avalaible under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can download the dataset here.
English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. However, to my knowlege, no german topic classification dataset is avaliable to the public.
Due to grammatical differences between the English and the German language, a classifyer might be effective on a English dataset, but not as effectiv on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifyer on multiple German datasets to get a sense of it's effectivness.
The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.
In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise
.
The 10kGNAD uses the second part of the topic path, here Wirtschaft
, as class label.
In result the dataset can be used for multi-class classification.
I created and used this dataset in my thesis to train and evaluate four text classifyers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.
As in most real-world datasets the class distribution of the 10kGNAD is not balanced. The biggest class Web consists of 1678, while the smalles class Kultur contains only 539 articles. However articles from the Web class have on average the fewest words, while artilces from the culture class have the second most words.
I propose a stratifyed split of 10% for testing and the remaining articles for training.
To use the dataset as a benchmark dataset, please used the train.csv
and test.csv
files located in the project root.
Python scripts to extract the articles and split them into a train- and a testset avaliable in the code directory of this project.
Make sure to install the requirements.
The original corpus.sqlite3
is required to extract the articles (download here (compressed) or here (uncompressed)).
This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please consider citing the authors of the One Million Post Corpus if you use the dataset.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Domus Germanica DatasetDataset Germanica Dataset Altae Qualitatis pro Centris Vocationum et IVR pro Exemplis Intelligentiae Artificialis et Orationis Contactus Nobiscum Data Centrorum Vocationum Data IVR Data Centrorum Vocationum .elementor-58669 .elementor-element.elementor-element-91938a9{padding:20px 0px 50px 0px;}.elementor-58669 .elementor-element.elementor-element-99f171d{padding:0px 0px 20px…
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"GermanFakeNC" is a German Fake News Corpus including 490 texts which were retrieved from German alternative online media sources. Every fake statement in the text was verified claim-by-claim by authoritative sources (e.g. from local police authorities, scientific studies, the police press office, etc.). The time interval for most of the news is established from December 2015 to March 2018.
Steps to reproduce the data are described in the README file.
Please cite:
@inproceedings{TPDL_Vogel19, author = {Inna Vogel and Peter Jiang}, title = {Fake News Detection with the New German Dataset "GermanFakeNC"}, booktitle = {Digital Libraries for Open Knowledge - 23rd International Conference on Theory and Practice of Digital Libraries, {TPDL} 2019, Oslo, Norway, September 9-12, 2019, Proceedings}, pages = {288--295}, year = {2019}, url = {https://doi.org/10.1007/978-3-030-30760-8\_25}, doi = {10.1007/978-3-030-30760-8\_25},}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Traffic German is a dataset for object detection tasks - it contains Football Player Detection annotations for 6,523 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Multilingual HateCheck
Dataset Description
Multilingual HateCheck (MHC) is a suite of functional tests for hate speech detection models in 10 different languages: Arabic, Dutch, French, German, Hindi, Italian, Mandarin, Polish, Portuguese and Spanish. For each language, there are 25+ functional tests that correspond to distinct types of hate and challenging non-hate. This allows for targeted diagnostic insights into model performance. For more details… See the full description on the dataset page: https://huggingface.co/datasets/Paul/hatecheck-german.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comprehensive dataset of local, state, and federal election results in Germany, facilitating research on electoral behavior, representation, and political responsiveness. Umfassende Datenbank von: Bundestagswahlergebnissen, Landeswahlergebnissen und Kommunalwahlergebnissen in Deutschland, die die Forschung zu Wahlverhalten, politischer Repräsentation und politischer Reaktionsfähigkeit ermöglicht.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Public Domain Newspapers (German)
This dataset contains 13 billion words of OCR text extracted from German historical newspapers.
Dataset Details
Dataset Description
Curated by: Sebastian Majstorovic Language(s) (NLP): German License: Dataset: CC0, Texts: Public Domain
Dataset Sources [optional]
Repository: https://www.deutsche-digitale-bibliothek.de/newspaper
Copyright & License
The newspapers texts have been… See the full description on the dataset page: https://huggingface.co/datasets/storytracer/German-PD-Newspapers.
German Hate Speech Superset
This dataset is a superset (N=50,545) of posts annotated as hateful or not. It results from the preprocessing and merge of all available German hate speech datasets in April 2024. These datasets were identified through a systematic survey of hate speech datasets conducted in early 2024. We only kept datasets that:
are documented are publicly available focus on hate speech, defined broadly as "any kind of communication in speech, writing or behavior, that… See the full description on the dataset page: https://huggingface.co/datasets/manueltonneau/german-hate-speech-superset.
VoxForge is an open speech dataset that was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).
We will make available all submitted audio files under the GPL license, and then 'compile' them into acoustic models for use with Open Source speech recognition engines such as CMU Sphinx, ISIP, Julius (github) and HTK (note: HTK has distribution restrictions).
https://www.icpsr.umich.edu/web/ICPSR/studies/43/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/43/terms
This data collection contains electoral data at the wahlkreis and staat levels for the Reichstag elections of 1871, 1874, 1877, 1878, 1881, 1884, 1890, 1893, 1898, 1903, 1907, and 1912. The variables for each election provide information on the votes cast for parties, including the Conservative Party, the German Empire Party, the National-Liberals, the Liberal Empire Party, the People's Party, the Social Democrats, the Progress Party, the Catholic Center, the Particularists, the Poles Party, the Protest Party, the Antisemites, the Free-thinking People's Party, the German Reform Party, the Farmers' Union, the Peasants' Union, and splinter parties. Data are also provided on the total population in 1871 and every fifth year between 1875 and 1910, and the proportions of Protestants and of Catholics in the total population for 1871, 1875, 1880, 1885, 1890, 1905, and 1910. Additional variables provide information on the number of eligible voters, valid and invalid votes cast, and voter turnout.
In late May 1939, just three months before the Second World War began in Europe, Germany's workforce was made up of almost 25 million men, 15 million women, and a very small number of foreign workers. The share of German men in the workforce decreased each year thereafter, as more were conscripted into the armed forces, and there were approximately 11 million fewer German male citizens in the workforce by September 1944. The number of German women fluctuated, but remained between 14 and 15 million throughout the given period, and it exceeded the number of German men in 1944. Despite the number of German men in the workforce dropping by 45 percent, the total number of workers in German was consistently around 36 million between 1940 and 1944, as this difference was offset by foreign and forced laborers. These workers were mostly drafted from annexed territories in Eastern Europe, and prisoners were transferred from concentration and POW camps to meet the labor demands in various areas of Germany.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This short article introduces the German Cliometrics Database as the fundament of Jopp and Spoerer (2024) who trace cliometric research on German history. This newly constructed database of every publication which (1) contributes to the historiography of Germany and (2) employs, as a baseline, inferential statistics enables researchers to specifically find cliometric studies related to their own work much quicker. Even though no full texts are provided along with the data file, the collected abstracts or, respectively, summaries for every publication in the database allow for some baseline text mining approaches. Along with the remaining information provided, they may also form the basis for broader bibliometric or historiographical studies.
This statistic shows the development of population numbers in Germany from 1990 to 2023. In 2023, the population in Germany, as of December 31 of that year, amounted to 84.67 million people. An increase compared to the previous year.
A wide-ranging representative longitudinal study of private households that permits researchers to track yearly changes in the health and economic well-being of older people relative to younger people in Germany from 1984 to the present. Every year, there were nearly 11,000 households, and more than 20,000 persons sampled by the fieldwork organization TNS Infratest Sozialforschung. The data provide information on all household members, consisting of Germans living in the Old and New German States, Foreigners, and recent Immigrants to Germany. The Panel was started in 1984. Some of the many topics include household composition, occupational biographies, employment, earnings, health and satisfaction indicators. In addition to standard demographic information, the GSOEP questionnaire also contains objective measuresuse of time, use of earnings, income, benefit payments, health, etc. and subjective measures - level of satisfaction with various aspects of life, hopes and fears, political involvement, etc. of the German population. The first wave, collected in 1984 in the western states of Germany, contains 5,921 households in two randomly sampled sub-groups: 1) German Sub-Sample: people in private households where the head of household was not of Turkish, Greek, Yugoslavian, Spanish, or Italian nationality; 2) Foreign Sub-Sample: people in private households where the head of household was of Turkish, Greek, Yugoslavian, Spanish, or Italian nationality. In each year since 1984, the GSOEP has attempted to re-interview original sample members unless they leave the country. A major expansion of the GSOEP was necessitated by German reunification. In June 1990, the GSOEP fielded a first wave of the eastern states of Germany. This sub-sample includes individuals in private households where the head of household was a citizen of the German Democratic Republic. The first wave contains 2,179 households. In 1994 and 1995, the GSOEP added a sample of immigrants to the western states of Germany from 522 households who arrived after 1984, which in 2006 included 360 households and 684 respondents. In 1998 a new refreshment sample of 1,067 households was selected from the population of private households. In 2000 a sample was drawn using essentially similar selection rules as the original German sub-sample and the 1998 refreshment sample with some modifications. The 2000 sample includes 6,052 households covering 10,890 individuals. Finally, in 2002, an overrepresentation of high-income households was added with 2,671 respondents from 1,224 households, of which 1,801 individuals (689 households) were still included in the year 2006. Data Availability: The data are available to researchers in Germany and abroad in SPSS, SAS, TDA, STATA, and ASCII format for immediate use. Extensive documentation in English and German is available online. The SOEP data are available in German and English, alone or in combination with data from other international panel surveys (e.g., the Cross-National Equivalent Files which contain panel data from Canada, Germany, and the United States). The public use file of the SOEP with anonymous microdata is provided free of charge (plus shipping costs) to universities and research centers. The individual SOEP datasets cannot be downloaded from the DIW Web site due to data protection regulations. Use of the data is subject to special regulations, and data privacy laws necessitate the signing of a data transfer contract with the DIW. The English Language Public Use Version of the GSOEP is distributed and administered by the Department of Policy Analysis and Management, Cornell University. The data are available on CD-ROM from Cornell for a fee. Full instructions for accessing GSOEP data may be accessed on the project website, http://www.human.cornell.edu/che/PAM/Research/Centers-Programs/German-Panel/cnef.cfm * Dates of Study: 1984-present * Study Features: Longitudinal, International * Sample Size: ** 1984: 12,290 (GSOEP West) ** 1990: 4,453 (GSOEP East) ** 2000: 20,000+ Links: * Cornell Project Website: http://www.human.cornell.edu/che/PAM/Research/Centers-Programs/German-Panel/cnef.cfm * GSOEP ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/00131
Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".
For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the form used by StatLog.
This dataset requires use of a cost matrix:
Good | Bad | |
---|---|---|
Good | 0 | 1 |
Bad | 5 | 0 |
The rows represent the actual classification and the columns the predicted classification.
It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).