44 datasets found

Web Data Commons Phones Dataset, Augmented Version, Fixed Splits
linkagelibrary.icpsr.umich.edu
delimited
Updated Nov 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Primpeli; Christian Bizer (2020). Web Data Commons Phones Dataset, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127243V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E127243V1
Dataset updated
Nov 23, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
d
ISTARI.AI | Points of Interest Dataset (POI) | Oil & Gas Industry | Verified...
datarade.ai
Updated Aug 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Istari.AI (2025). ISTARI.AI | Points of Interest Dataset (POI) | Oil & Gas Industry | Verified Company Profiles filtered from over 40M+ companies [Dataset]. https://datarade.ai/data-products/istari-ai-points-of-interest-dataset-poi-oil-gas-indu-istari-ai
Explore at:
Dataset updated
Aug 14, 2025
Dataset provided by
Istari.AI
Area covered
Togo, Gibraltar, Aruba, Denmark, Cook Islands, Poland, Bulgaria, Heard Island and McDonald Islands, Sint Eustatius and Saba, France
Description
📍 Looking for high-quality oil & gas industry data? ISTARI.AI offers tailored POI datasets to fit your exact business needs – whether you’re looking for all oil & gas exploration/refining operations, equipment manufacturers, Consultants, sub suppliers, service providers, or other specific type of location-based business.

📊 Our POI data includes: - Organizational structure & key personnel - Products, services & partnerships - Verified contact & domain info - Tech stack & business descriptions - Detailed geographic data (address, region, country)

We don’t offer one-size-fits-all datasets – instead, you tell us what you need.

This flexibility makes our data ideal for use cases in: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & B2B outreach

All POI data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency. With ISTARI.AI, you receive structured POI datasets ready for direct integration into your systems.

✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich
g
German Internet Panel, Welle 66 (Juli 2023)
search.gesis.org
Updated Jul 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
German Internet Panel, Universität Mannheim (2024). German Internet Panel, Welle 66 (Juli 2023) [Dataset]. http://doi.org/10.4232/1.14370
Explore at:
(72162), (65025)Available download formats
Unique identifier
https://doi.org/10.4232/1.14370
Dataset updated
Jul 26, 2024
Dataset provided by
GESIS
GESIS search
Authors
German Internet Panel, Universität Mannheim
License
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Time period covered
Jul 1, 2023 - Jul 31, 2023
Area covered
Germany
Description
The German Internet Panel (GIP) is a long-term study at the University of Mannheim. The GIP examines individual attitudes and preferences that are relevant in political and economic decision-making processes. To this end, more than 3,500 people throughout Germany have been regularly surveyed online every two months since 2012 on a wide range of topics. The GIP is based on a random sample of the general population in Germany between the ages of 16 and 75. The study started in 2012 and was supplemented by new participants in 2014 and 2018. The panel participants were recruited offline. The GIP questionnaires cover a variety of topics that deal with current events.
LODVec Evaluation Datasets and Experiments
zenodo.org
bin, zip
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michalis Mountantonakis; Michalis Mountantonakis (2024). LODVec Evaluation Datasets and Experiments [Dataset]. http://doi.org/10.5281/zenodo.14266984
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14266984
Dataset updated
Dec 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michalis Mountantonakis; Michalis Mountantonakis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the input datasets that were used for evaluating the system LODVec. Moreover, it contains the input for the machine learning tasks, i.e., prediction classes for classification, ratings for regression and the top related entities for a set of movies and basketball players.

The evaluation datasets for movies and music albums were derived from https://www.uni-mannheim.de/dws/research/resources/sw4ml-benchmark/, and belong to these researchers (University of Mannheim).

https://demos.isl.ics.forth.gr/lodvec/
Data from: Product Datasets from the MWPD2020 Challenge at the ISWC2020...
linkagelibrary.icpsr.umich.edu
da-ra.de
Updated Nov 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) [Dataset]. http://doi.org/10.3886/E127482V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127482V1
Dataset updated
Nov 26, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Ralph Peeters; Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.
h
xscitldr
huggingface.co
Updated May 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLP & IR Group @ University of Mannheim (2023). xscitldr [Dataset]. https://huggingface.co/datasets/umanlp/xscitldr
Explore at:
Dataset updated
May 14, 2023
Dataset authored and provided by
NLP & IR Group @ University of Mannheim
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This new dataset is designed to solve this great NLP task and is crafted with a lot of care.
n
Activities of daily living of several individuals
narcis.nl
datasetcatalog.nlm.nih.gov
+1more
Updated Nov 3, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timo Sztyler; J. (Josep) Carmona (2015). Activities of daily living of several individuals [Dataset]. http://doi.org/10.4121/uuid:01eaba9f-d3ed-4e04-9945-b8b302764176
Explore at:
media types: application/x-gzip, application/zip, text/plainAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:01eaba9f-d3ed-4e04-9945-b8b302764176
Dataset updated
Nov 3, 2015
Dataset provided by
University of Mannheim, Germany
Authors
Timo Sztyler; J. (Josep) Carmona
Description
This dataset comprises event logs (XES = Extensible Event Stream) regarding the activities of daily living performed by several individuals. The event logs were derived from sensor data which was collected in different scenarios and represent activities of daily living performed by several individuals. These include e.g., sleeping, meal preparation, and washing. The event logs show the different behavior of people in their own homes but also common patterns. The attached event logs were created with Fluxicon Disco ({http://fluxicon.com/disco/}).
d
ISTARI.AI | Points of Interest Dataset (POI) | Tourism Industry | Verified...
datarade.ai
Updated Aug 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Istari.AI (2025). ISTARI.AI | Points of Interest Dataset (POI) | Tourism Industry | Verified Profiles filtered from over 40M+ companies [Dataset]. https://datarade.ai/data-products/istari-ai-points-of-interest-dataset-poi-tourism-indust-istari-ai
Explore at:
.json, .csv, .xls, .txt, .parquet, .pdfAvailable download formats
Dataset updated
Aug 14, 2025
Dataset provided by
Istari.AI
Area covered
Central African Republic, Guam, Nepal, Guernsey, Maldives, Israel, Burundi, Equatorial Guinea, Norfolk Island, Zimbabwe
Description
📍 Looking for high-quality global data on tourism industry? ISTARI.AI provides comprehensive, ready-to-use datasets covering hotels, tourist agencies, travel agents, travel magazine, bars, and restaurants worldwide – including location, contact, and detailed business information.

📊 Our Tourism data includes: - Organizational structure & key personnel - Products, services & partnerships - Verified contact & domain information - Technology stack & business descriptions - Detailed geographic data (address, region, country)

Our datasets are ideal for: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & marketing outreach

All data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency.

✅ Ensuring Data Quality - Developed in close collaboration with academic experts to guarantee expert-level accuracy - Created together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from the University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich

With ISTARI.AI, you get structured, high-quality tourism datasets from across the globe – ready for direct integration into your systems.
h
babeledits
huggingface.co
Updated Aug 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLP & IR Group @ University of Mannheim (2025). babeledits [Dataset]. https://huggingface.co/datasets/umanlp/babeledits
Explore at:
Dataset updated
Aug 1, 2025
Dataset authored and provided by
NLP & IR Group @ University of Mannheim
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
BabelEdits

BabelEdits is a benchmark designed to evaluate cross-lingual knowledge editing (CKE) in Large Language Models (LLMs). It enables robust and effective evaluation across 60 languages by combining high-quality entity translations from BabelNet with marker-based translation. BabelEdits is also accompanied by a modular CKE method, BabelReFT, which supports multilingual edit propagation while preserving downstream model performance.

Dataset Summary

As LLMs… See the full description on the dataset page: https://huggingface.co/datasets/umanlp/babeledits.
Amazon-Google, Augmented Version, Fixed Splits
linkagelibrary.icpsr.umich.edu
Updated Nov 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Primpeli; Christian Bizer (2020). Amazon-Google, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127241V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127241V1
Dataset updated
Nov 23, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Motivation:Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description:An augmented version of the amazon-google products dataset for benchmarking entity matching/record linkage methods found at: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolutio...The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 1,363 records describing products deriving from amazon which are matched against 3,226 product records from google. The gold standards have manual annotations for 1,298 matching and 6,306 non-matching pairs. The total number of attributes used to decribe the product records are 4 while the attribute density is 0.75.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results.The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download:http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
H
Replication data for: Validating estimates of latent traits from textual...
data.niaid.nih.gov
dataverse.harvard.edu
Updated Oct 1, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Will Lowe; Kenneth Benoit (2014). Replication data for: Validating estimates of latent traits from textual data using human judgement as a benchmark [Dataset]. http://doi.org/10.7910/DVN/QWPDQJ
Explore at:
zip, text/plain; charset=us-asciiAvailable download formats
Unique identifier
https://doi.org/10.7910/DVN/QWPDQJ
Dataset updated
Oct 1, 2014
Dataset provided by
London School of Economics and Political Science
University of Mannheim
Authors
Will Lowe; Kenneth Benoit
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Ireland
Description
Automated and statistical methods for estimating latent political traits and classes from textual data hold great promise, since virtually every political act involves the production of text. Statistical models of natural language features, however, are heavily laden with unrealistic assumptions about the process that generates this data, including the stochastic process of text generation, the functional link between political variables and observed text, and the nature of the variables (and dimensions) on which observed text should be conditioned. While acknowledging statistical models of latent traits to be â€œwrongâ€ , political scientists nonetheless treat the treat their results as sufficiently valid to be useful. In this paper, we address the issue of substantive validity in the face of potential model failure, in the context of unsupervised scaling methods of latent traits. We critically examine one popular parametric measurement model of latent traits for text and then compare its results to systematic human judgments of the texts as a benchmark for validity.
Data from: Web Data Commons Training and Test Sets for Large-Scale Product...
linkagelibrary.icpsr.umich.edu
da-ra.de
Updated Nov 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127481V1
Dataset updated
Nov 26, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Ralph Peeters; Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Restaurants (Fodors-Zagats), Augmented Version, Fixed Splits
linkagelibrary.icpsr.umich.edu
delimited
Updated Nov 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Primpeli; Christian Bizer (2020). Restaurants (Fodors-Zagats), Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127242V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E127242V1
Dataset updated
Nov 23, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the fodors-zagats restaurants dataset for benchmarking entity matching/record linkage methods found at:https://hpi.de/en/naumann/projects/data-integration-data-quality-and-data-cleansing/dude.html#c11471 The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 533 records describing restaurants from fodors.com which are matched against 331 restaurants records from zagat.com. The gold standards have manual annotations for 112 matching and 488 non-matching pairs. The total number of attributes used to decribe the product records are 5 while the attribute density is 100%.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
e
Youth in Europe Study (YES!) - Dataset - B2FIND
b2find.eudat.eu
Updated Jul 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Youth in Europe Study (YES!) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/52740ac2-dd63-522f-9de7-1e2460ce0eaa
Explore at:
Dataset updated
Jul 7, 2024
Description
Youth in Europe Study (YES!) is the name of the survey study within the project "Children of Immigrants Longitudinal Survey in Four European Countries" (CILS4EU). It is an international research project aimed at gaining more insight into the current living conditions and opinions of young people. The pupil questionnaire covers several areas, including: School and results, Feelings and opinions, Health, Friends and relationship building, Family Relationships and Leisure. In each participating country, approximately 5,000 pupils attending 8th grade (or corresponding) were interviewed by means of a questionnaire. In Sweden, approximately 130 schools were randomly selected. The first survey in 2011 was followed by another survey in 2012 (when pupils were in 9th grade) one in 2013 (when respondents have finished compulsory school and have entered upper secondary education, the labour market or else)and another in 2016. The survey is conducted in Sweden, Germany, the Netherlands and England. Youth in Europe (YES!) is a joint initiative of researchers from Stockholm University, the University of Mannheim, University of Utrecht, Tilburg University, and University of Oxford. Purpose: The purpose of the study is to answer questions on young people’s living conditions and to compare these between countries, e.g.: Which role do school, family and friends play for youth in Europe? What are the hobbies, interests and issues they are engaged in? How do educational careers of young people with and without immigration background proceed? What are their educational and occupational goals? What can be done in order to improve educational chances of all young people? Ungdomar i Europa (Youth in Europe Study, YES!) heter enkätstudien inom projektet "Children of Immigrants Longitudinal Survey in Four European Countries" (CILS4EU). Det är ett internationellt forskningsprojekt som görs i Sverige, England, Holland och Tyskland. Studien omfattar flera områden, såsom: Skola, Hälsa, Vänner, Familj, Fritid och Uppfattningar och åsikter. Studien omfattar ungefär 5 000 elever i varje land, totalt cirka 19 000 elever. Den svenska delen bygger på ett urval om cirka 130 grundskolor. Studien är i grunden longitudinell och eftersom ett av syftena är att studera gymnasievalet följdes den första studien med elever i åttonde klass 2011 upp av en 2012 (när eleverna går i nionde klass) samt ytterligare en år 2013 (när respondenterna går första året på gymnasiet eller har börjat arbeta, alternativt har annan sysselsättning), samt senast år 2016. Studien är ett samarbete mellan Institutet för social forskning (SOFI) vid Stockholms universitet och universiteten i Mannheim, Utrecht, Tilburg och Oxford. Syfte: Syftet med undersökningen är att svara på frågor om ungdomars levnadsvillkor och att jämföra dessa mellan länder, t ex: Vilken roll spelar skola, familj och vänner för ungdomar i Europa? Hur ser ungdomars fritid ut? Hur utvecklas utbildningskarriärer för ungdomar med och utan utländsk bakgrund? Vad har de för framtidsplaner vad gäller utbildning och arbete? Vad kan man göra för att förbättra ungdomars möjligheter till utbildning?
d
ISTARI.AI | Points of Interest Dataset (POI) | Germany | 35+ Attributes |...
datarade.ai
Updated Aug 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Istari.AI (2025). ISTARI.AI | Points of Interest Dataset (POI) | Germany | 35+ Attributes | 40M+ Verified Company Profiles [Dataset]. https://datarade.ai/data-products/istari-ai-points-of-interest-dataset-poi-germany-35-istari-ai
Explore at:
.json, .csv, .xls, .parquetAvailable download formats
Dataset updated
Aug 7, 2025
Dataset provided by
Istari.AI
Area covered
Germany
Description
📍 Looking for high-quality Point of Interest (POI) data for Germany? ISTARI.AI offers tailored POI datasets to fit your exact business needs – whether you’re looking for all restaurants, gyms, electricians, or any other specific type of location-based business.

📊 Our POI data includes: - Accurate location data (address, coordinates) - Contact information (phone numbers, websites, email addresses where available) - Structured business attributes (opening hours, business category, service offerings, and more)

We don’t offer one-size-fits-all datasets - instead, you tell us what you need. Whether it’s a national dataset of all fitness centers, a list of car repair shops in a specific region, or just all vegan restaurants in major German cities, we generate the dataset based on your POI category and geographic scope.

This flexibility makes our data ideal for use cases in: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & B2B outreach

All POI data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency.

Tell us your POI requirements - we’ll handle the rest. With ISTARI.AI, you receive structured POI datasets ready for direct integration into your systems.

✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich
o
Data from: The Costs and Benefits of Home Office during the Covid-19...
openicpsr.org
delimited, stata
Updated Nov 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Schymik; Harald Fadinger (2020). The Costs and Benefits of Home Office during the Covid-19 Pandemic - Evidence from Infections and an Input-Output Model for Germany [Dataset]. https://www.openicpsr.org/openicpsr/project/124902/view?path=/openicpsr/124902/fcr:versions/V2/---Replication-Package-for-Web.zip&type=file
Explore at:
stata, delimitedAvailable download formats
Dataset updated
Nov 1, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Jan Schymik; Harald Fadinger
Area covered
Germany
Dataset funded by
Deutsche Forschungsgemeinschaft (Germany)
Description
We study the impact of working from home on (i) infection risk in German regions and (ii) output using an input-output (IO) model of the German economy. We find that working from home is very effective in reducing infection risk: regions whose industry structure allows for a larger fraction of work to be done from home experienced much fewer Covid-19 cases and fatalities. Moreover, confinement is significantly more costly in terms of induced output loss in regions where the share of workers who can work from home is lower. When phasing out confinement, home office should be maintained as long as possible, to allow those workers who cannot work from home to go back to work, while keeping infection risk minimal. Finally, systemic industries (with high multipliers and/or high value added per worker) should be given priority, especially those where home office is not possible.
Z
The patccat classifier for patent claims - EPO edition
data.niaid.nih.gov
zenodo.org
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ganglmair, Bernhard (2023). The patccat classifier for patent claims - EPO edition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7776092
Explore at:
Dataset updated
Mar 28, 2023
Dataset provided by
Ganglmair, Bernhard
Seeligson, Michael
Robinson, W. Keith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
!!! This is the EPO/European version of the patccat classifier of patent claims. !!!

Note: We use the same approach that we use for USPTO patents. For a detailed description, see https://doi.org/10.5281/zenodo.6395307.

Data version: 3.4.0

Authors: Bernhard Ganglmair (University of Mannheim, Department of Economics, and ZEW Mannheim) W. Keith Robinson (Wake Forest University, School of Law) Michael Seeligson (Southern Methodist University, Cox School of Business)

Please cite the following paper when using the data in your own work:

Ganglmair, Bernhard, W. Keith Robinson, and Michael Seeligson (2022): "The Rise of Process Claims: Evidence from a Century of U.S. Patents," unpublished manuscript available at https://papers.ssrn.com/abstract=4069994.

patccat: A classifier for patent claims

zenodo.org
data.niaid.nih.gov

csv

Updated Mar 11, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Bernhard Ganglmair; Bernhard Ganglmair; W. Keith Robinson; W. Keith Robinson; Michael Seeligson; Michael Seeligson (2025). patccat: A classifier for patent claims [Dataset]. http://doi.org/10.5281/zenodo.6395308

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6395308

Dataset updated

Mar 11, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Bernhard Ganglmair; Bernhard Ganglmair; W. Keith Robinson; W. Keith Robinson; Michael Seeligson; Michael Seeligson

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data version: 3.3.0

Authors:
Bernhard Ganglmair (University of Mannheim, Department of Economics, and ZEW Mannheim)
W. Keith Robinson (Wake Forest University, School of Law)
Michael Seeligson (Southern Methodist University, Cox School of Business)

1. Notes on Data Construction
2. Citation and Code
3. Description of the Data Files
3.1. File List
3.2. List of Variables for Files with Claim-Level Information
3.3. List of Variables for Files with Patent-Level Information
4. Coming Soon!

1. Notes on Data Construction

This is version 3.3.0 of the patccat data (patent claim classification by algorithmic text analysis).

Patent claims define an invention. A patent application is required to have one or more claims that distinctly claim the subject matter which the patent applicant regards as her invention or discovery. We construct a classifier of patent claims that identifies three distinct claim types: process claims, product claims, and product-by-process claims.

For this classification, we combine information obtained from both the preamble and the body of a claim. The preamble is a general description of the invention (e.g., a method, an apparatus, or a device), whereas the body identifies steps and elements (specifying in detail the invention laid out in the preamble) that the applicant is claiming as the invention. The combination of the preamble type and the body type provides us with a more detailed and more accurate classification of claims than other approaches in the literature. This approach also accounts for unconventional drafting approaches. We eventually validate our classification using close to 10,000 manually classified claims.

The data files contain the results of our classification. We provide claim-level information for each independent claim of U.S. utility patents granted between 1836 and 2020. We also provide patent-level information, i.e., the counts of different claim types for a given patent.

For a detailed description of our classification approach, please take a look at the accompanying paper (Ganglmair, Robinson, and Seeligson 2022).

2. Citation

Please cite the following paper when using the data in your own work:

Ganglmair, Bernhard, W. Keith Robinson, and Michael Seeligson (2022): "The Rise of Process Claims: Evidence from a Century of U.S. Patents," unpublished manuscript available at https://papers.ssrn.com/abstract=4069994.

In the paper, we document the use of process claims in the U.S. over the last century, using the patccat data. We show an increase in the annual share of process claims of about 25 percentage points (from below 10% in 1920). This rise in process intensity of patents is not limited to a few patent classes, but we observe it across a broad spectrum of technologies. Process intensity varies by applicant type: companies file more process-intense patents than individuals, and U.S. applicants file more process-intense patents than foreign applicants. We further show that patents with higher process intensity are more valuable but are not necessarily cited more often. Last, process claims are on average shorter than product claims (with the gap narrowing since the 1970s).

We would love to see how other researchers use the data and eventually learn from it. If you have a discussion paper or a publication in which you use the data, please send us a copy at patccat.data@gmail.com.

We will the R code used to construct the data on Github with the next data version (version 3.4.0). Contact us at b.ganglmair@gmail.com if you would like to take a look at an earlier version of the code.

3. Description of the Data Files

The data files contain claim-level information for independent claims of 10,140,848 U.S. utility patents granted between 1836 and 2020. The files further contain patent-level information for U.S. utility patents.

3.1. File List

File list

claims-patccat-v3-3-sample.csv	claim-level information for independent claims of a sample of 1000 patents issued between 1976 and 2020
claims-patccat-v3-3-1836-1919.csv	claim-level information for independent claims of 1,038,041 patents issued between 1836 and 1919
claims-patccat-v3-3-1920-2020.csv	claim-level information for independent claims of 9,102,807 patents issued between 1920 and 2020
patents-patccat-v3-3-sample.csv	patent-level information for a sample of 1000 patents issued between 1976 and 2020
patents-patccat-v3-3-1836-1919.csv	patent-level information for 1,038,041 patents issued between 1836 and 1919
patents-patccat-v3-3-1920-2020.csv	patent-level information for 9,102,807 patents issued between 1920 and 2020

3.2. List of Variables for Files with Claim-Level Information

For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).

List of Variables (Claim-Level Information)

PatentClaim	patent claim identifier; 8-digit patent number and 4-digit claim number (Ex: 01234567-0001)
singleLine	=1 if claim is published in single-line format
singleReformat	outcome code of reformating of single-line claims
Jepson	=1 if claim is a Jepson claim
JepsonReformat	outcome code of reformating of Jepson claims
inBegin	=1 if claim begins with the word "in"
wordsPreamble	number of words in the claim preamble
wordsBody	number of words in the claim body
dependentClaims	number of dependent claims that refer to this independent claim
isMeansPreamble	=1 if term "means" is used in the preamble
isMeansBody	=1 if term "means" is used in the body
isMeans	=1 if term "means" is used anywhere in the claim (~ means-plus-function claim)
processPreamble	=1 if terms "method" or "process" are used in the preamble
processBody	=1 if terms "method" or "process" are used in the body
processSimple	=1 if terms "method" or "process" are used anywhere in the claim (for simple approach of process claim classification)
claimType	claim type of full classification (1 = process; 2 = product; 3 = product-by-process; 0 = no type)
preambleType	preamble type
preambleTerm	keyword used to classify preamble type
preambleTermAlt	alternative keyword (if preambleTerm were not used)
preambleTextStub	first 15 words of the preamble
bodyType	body type
bodyLinesStep	number of steps in the body
bodyLinesElement	number of elements in the body
bodyLinesTotal	total number of identified lines in the body
label	2-character label of the preamble-body combination; classification table maps label to claim type

3.3. List of Variables for Files with Patent-Level Information

For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).

List of Variables (Patent-Level Information)

patent_id	U.S. patent number (8-digit patent number)
claims	number of independent claims (the sum of the four claim types: 0, 1, 2, and 3)
noCategory	number of claims without a classified type
processClaims	number of process claims
productClaims	number of product claims
prodByProcessClaims	number of product-by-process claims
firstClaim	type of the first claim (1 = process; 2 = product; 3 = product-by-process; 0 = no type)
simpleProcessClaims	number of process claims by simple approach (terms "method" or "process" anywhere in the claim)
simpleProcessPreamble	number of process claims by simple approach (terms "method" or "process" in the preamble)
meansClaims	number of means-plus-function claims
meansFirst	=1 if first claim is a means-plus-function claim
JepsonClaims	number of Jepson claims
JepsonFirst	=1 if first claim is a Jepson claim

Note: The following variables/fields are currently empty (March 30, 2020); we will populate these variables/fields with data version 3.4.0.

preambleTerm
preambleTermAlt
preambleTextStub
bodyLinesStep
bodyLinesElement
bodyLinesTotal

Note: We will release the data for patents issued in 2021 with data version 3.4.0.

4. Coming Soon!

We are working on a number of extensions of the patccat data.

- With data version 3.4.0, we plan to release data for all published U.S. patent applications (2001 through 2021)
- In late spring/early summer 2022, we will release data for patents issued by the European Patent Office (EPO) [Update: March 28, 2023: see https://doi.org/10.5281/zenodo.7776092]
-

d
ISTARI.AI | Points of Interest Dataset (POI) | Mining Industry in Australia...
datarade.ai
Updated Aug 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Istari.AI (2025). ISTARI.AI | Points of Interest Dataset (POI) | Mining Industry in Australia | Verified Company Profiles filtered from over 1.7 M companies [Dataset]. https://datarade.ai/data-products/istari-ai-points-of-interest-dataset-poi-mining-industr-istari-ai
Explore at:
.json, .csv, .xls, .txt, .parquet, .pdfAvailable download formats
Dataset updated
Aug 14, 2025
Dataset provided by
Istari.AI
Area covered
Australia
Description
📍 Looking for high-quality mining industry data? ISTARI.AI offers tailored POI datasets to fit your exact business needs – whether you’re looking for all mining operations, equipment manufacturers, Consultants, sub suppliers, service providers, or other specific type of location-based business.

📊 Our POI data includes: - Organizational structure & key personnel - Products, services & partnerships - Verified contact & domain info - Tech stack & business descriptions - Detailed geographic data (address, region, country)

We don’t offer one-size-fits-all datasets – instead, you tell us what you need.

This flexibility makes our data ideal for use cases in: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & B2B outreach

All POI data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency. With ISTARI.AI, you receive structured POI datasets ready for direct integration into your systems.

✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich
o
Wealth of two nations: The U.S. racial wealth gap, 1860-2020
openicpsr.org
Updated May 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellora Derenoncourt; Chi Hyun Kim; Moritz Kuhn; Moritz Schularick (2022). Wealth of two nations: The U.S. racial wealth gap, 1860-2020 [Dataset]. http://doi.org/10.3886/E170941V2
Explore at:
Unique identifier
https://doi.org/10.3886/E170941V2
Dataset updated
May 22, 2022
Dataset provided by
University of Mannheim
Princeton University
Kiel Institute for the World Economy, Sciences Po
University of Bonn
Authors
Ellora Derenoncourt; Chi Hyun Kim; Moritz Kuhn; Moritz Schularick
Area covered
United States
Description
PSID data extract for computing per capita white-to-Black wealth gaps and active saving rates of Black and white Americans during 1984-2019.

Facebook

Twitter

Click to copy link

Link copied

Cite

Anna Primpeli; Christian Bizer (2020). Web Data Commons Phones Dataset, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127243V1

Web Data Commons Phones Dataset, Augmented Version, Fixed Splits

Explore at:

delimitedAvailable download formats

Unique identifier

https://doi.org/10.3886/E127243V1

Dataset updated

Nov 23, 2020

Dataset provided by

University of Mannheim (Germany)

Authors

Anna Primpeli; Christian Bizer

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html

Clear search

Close search

Google apps

Main menu

Web Data Commons Phones Dataset, Augmented Version, Fixed Splits

ISTARI.AI | Points of Interest Dataset (POI) | Oil & Gas Industry | Verified...

German Internet Panel, Welle 66 (Juli 2023)

LODVec Evaluation Datasets and Experiments

Data from: Product Datasets from the MWPD2020 Challenge at the ISWC2020...

xscitldr

Activities of daily living of several individuals

ISTARI.AI | Points of Interest Dataset (POI) | Tourism Industry | Verified...

babeledits

Amazon-Google, Augmented Version, Fixed Splits

Replication data for: Validating estimates of latent traits from textual...

Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

Restaurants (Fodors-Zagats), Augmented Version, Fixed Splits

Youth in Europe Study (YES!) - Dataset - B2FIND

ISTARI.AI | Points of Interest Dataset (POI) | Germany | 35+ Attributes |...

Data from: The Costs and Benefits of Home Office during the Covid-19...

The patccat classifier for patent claims - EPO edition

patccat: A classifier for patent claims

ISTARI.AI | Points of Interest Dataset (POI) | Mining Industry in Australia...

Wealth of two nations: The U.S. racial wealth gap, 1860-2020

Web Data Commons Phones Dataset, Augmented Version, Fixed Splits