Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
📍 Looking for high-quality oil & gas industry data? ISTARI.AI offers tailored POI datasets to fit your exact business needs – whether you’re looking for all oil & gas exploration/refining operations, equipment manufacturers, Consultants, sub suppliers, service providers, or other specific type of location-based business.
📊 Our POI data includes: - Organizational structure & key personnel - Products, services & partnerships - Verified contact & domain info - Tech stack & business descriptions - Detailed geographic data (address, region, country)
We don’t offer one-size-fits-all datasets – instead, you tell us what you need.
This flexibility makes our data ideal for use cases in: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & B2B outreach
All POI data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency. With ISTARI.AI, you receive structured POI datasets ready for direct integration into your systems.
✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
The German Internet Panel (GIP) is a long-term study at the University of Mannheim. The GIP examines individual attitudes and preferences that are relevant in political and economic decision-making processes. To this end, more than 3,500 people throughout Germany have been regularly surveyed online every two months since 2012 on a wide range of topics. The GIP is based on a random sample of the general population in Germany between the ages of 16 and 75. The study started in 2012 and was supplemented by new participants in 2014 and 2018. The panel participants were recruited offline. The GIP questionnaires cover a variety of topics that deal with current events.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the input datasets that were used for evaluating the system LODVec. Moreover, it contains the input for the machine learning tasks, i.e., prediction classes for classification, ratings for regression and the top related entities for a set of movies and basketball players.
The evaluation datasets for movies and music albums were derived from https://www.uni-mannheim.de/dws/research/resources/sw4ml-benchmark/, and belong to these researchers (University of Mannheim).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This new dataset is designed to solve this great NLP task and is crafted with a lot of care.
This dataset comprises event logs (XES = Extensible Event Stream) regarding the activities of daily living performed by several individuals. The event logs were derived from sensor data which was collected in different scenarios and represent activities of daily living performed by several individuals. These include e.g., sleeping, meal preparation, and washing. The event logs show the different behavior of people in their own homes but also common patterns. The attached event logs were created with Fluxicon Disco ({http://fluxicon.com/disco/}).
📍 Looking for high-quality global data on tourism industry? ISTARI.AI provides comprehensive, ready-to-use datasets covering hotels, tourist agencies, travel agents, travel magazine, bars, and restaurants worldwide – including location, contact, and detailed business information.
📊 Our Tourism data includes: - Organizational structure & key personnel - Products, services & partnerships - Verified contact & domain information - Technology stack & business descriptions - Detailed geographic data (address, region, country)
Our datasets are ideal for: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & marketing outreach
All data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency.
✅ Ensuring Data Quality - Developed in close collaboration with academic experts to guarantee expert-level accuracy - Created together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from the University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich
With ISTARI.AI, you get structured, high-quality tourism datasets from across the globe – ready for direct integration into your systems.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
BabelEdits
BabelEdits is a benchmark designed to evaluate cross-lingual knowledge editing (CKE) in Large Language Models (LLMs). It enables robust and effective evaluation across 60 languages by combining high-quality entity translations from BabelNet with marker-based translation. BabelEdits is also accompanied by a modular CKE method, BabelReFT, which supports multilingual edit propagation while preserving downstream model performance.
Dataset Summary
As LLMs… See the full description on the dataset page: https://huggingface.co/datasets/umanlp/babeledits.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation:Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description:An augmented version of the amazon-google products dataset for benchmarking entity matching/record linkage methods found at: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolutio...The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 1,363 records describing products deriving from amazon which are matched against 3,226 product records from google. The gold standards have manual annotations for 1,298 matching and 6,306 non-matching pairs. The total number of attributes used to decribe the product records are 4 while the attribute density is 0.75.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results.The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download:http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Automated and statistical methods for estimating latent political traits and classes from textual data hold great promise, since virtually every political act involves the production of text. Statistical models of natural language features, however, are heavily laden with unrealistic assumptions about the process that generates this data, including the stochastic process of text generation, the functional link between political variables and observed text, and the nature of the variables (and dimensions) on which observed text should be conditioned. While acknowledging statistical models of latent traits to be “wrong†, political scientists nonetheless treat the treat their results as sufficiently valid to be useful. In this paper, we address the issue of substantive validity in the face of potential model failure, in the context of unsupervised scaling methods of latent traits. We critically examine one popular parametric measurement model of latent traits for text and then compare its results to systematic human judgments of the texts as a benchmark for validity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the fodors-zagats restaurants dataset for benchmarking entity matching/record linkage methods found at:https://hpi.de/en/naumann/projects/data-integration-data-quality-and-data-cleansing/dude.html#c11471 The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 533 records describing restaurants from fodors.com which are matched against 331 restaurants records from zagat.com. The gold standards have manual annotations for 112 matching and 488 non-matching pairs. The total number of attributes used to decribe the product records are 5 while the attribute density is 100%.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
Youth in Europe Study (YES!) is the name of the survey study within the project "Children of Immigrants Longitudinal Survey in Four European Countries" (CILS4EU). It is an international research project aimed at gaining more insight into the current living conditions and opinions of young people. The pupil questionnaire covers several areas, including: School and results, Feelings and opinions, Health, Friends and relationship building, Family Relationships and Leisure. In each participating country, approximately 5,000 pupils attending 8th grade (or corresponding) were interviewed by means of a questionnaire. In Sweden, approximately 130 schools were randomly selected. The first survey in 2011 was followed by another survey in 2012 (when pupils were in 9th grade) one in 2013 (when respondents have finished compulsory school and have entered upper secondary education, the labour market or else)and another in 2016. The survey is conducted in Sweden, Germany, the Netherlands and England. Youth in Europe (YES!) is a joint initiative of researchers from Stockholm University, the University of Mannheim, University of Utrecht, Tilburg University, and University of Oxford. Purpose: The purpose of the study is to answer questions on young people’s living conditions and to compare these between countries, e.g.: Which role do school, family and friends play for youth in Europe? What are the hobbies, interests and issues they are engaged in? How do educational careers of young people with and without immigration background proceed? What are their educational and occupational goals? What can be done in order to improve educational chances of all young people? Ungdomar i Europa (Youth in Europe Study, YES!) heter enkätstudien inom projektet "Children of Immigrants Longitudinal Survey in Four European Countries" (CILS4EU). Det är ett internationellt forskningsprojekt som görs i Sverige, England, Holland och Tyskland. Studien omfattar flera områden, såsom: Skola, Hälsa, Vänner, Familj, Fritid och Uppfattningar och åsikter. Studien omfattar ungefär 5 000 elever i varje land, totalt cirka 19 000 elever. Den svenska delen bygger på ett urval om cirka 130 grundskolor. Studien är i grunden longitudinell och eftersom ett av syftena är att studera gymnasievalet följdes den första studien med elever i åttonde klass 2011 upp av en 2012 (när eleverna går i nionde klass) samt ytterligare en år 2013 (när respondenterna går första året på gymnasiet eller har börjat arbeta, alternativt har annan sysselsättning), samt senast år 2016. Studien är ett samarbete mellan Institutet för social forskning (SOFI) vid Stockholms universitet och universiteten i Mannheim, Utrecht, Tilburg och Oxford. Syfte: Syftet med undersökningen är att svara på frågor om ungdomars levnadsvillkor och att jämföra dessa mellan länder, t ex: Vilken roll spelar skola, familj och vänner för ungdomar i Europa? Hur ser ungdomars fritid ut? Hur utvecklas utbildningskarriärer för ungdomar med och utan utländsk bakgrund? Vad har de för framtidsplaner vad gäller utbildning och arbete? Vad kan man göra för att förbättra ungdomars möjligheter till utbildning?
📍 Looking for high-quality Point of Interest (POI) data for Germany? ISTARI.AI offers tailored POI datasets to fit your exact business needs – whether you’re looking for all restaurants, gyms, electricians, or any other specific type of location-based business.
📊 Our POI data includes: - Accurate location data (address, coordinates) - Contact information (phone numbers, websites, email addresses where available) - Structured business attributes (opening hours, business category, service offerings, and more)
We don’t offer one-size-fits-all datasets - instead, you tell us what you need. Whether it’s a national dataset of all fitness centers, a list of car repair shops in a specific region, or just all vegan restaurants in major German cities, we generate the dataset based on your POI category and geographic scope.
This flexibility makes our data ideal for use cases in: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & B2B outreach
All POI data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency.
Tell us your POI requirements - we’ll handle the rest. With ISTARI.AI, you receive structured POI datasets ready for direct integration into your systems.
✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich
We study the
impact of working from home on (i) infection risk in German regions and
(ii) output using an input-output (IO) model
of the German economy. We find that working from home is very effective
in reducing infection risk: regions whose industry structure
allows for a larger fraction of work to be done from home experienced
much fewer Covid-19 cases and fatalities. Moreover, confinement
is significantly more costly in terms of induced output loss in regions
where the share of workers who can work from home is lower.
When phasing out confinement, home office should be maintained as long
as possible, to allow those workers who cannot work from home
to go back to work, while keeping infection risk minimal. Finally,
systemic industries (with high multipliers and/or high value added
per worker) should be given priority, especially those where home
office is not possible.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
!!! This is the EPO/European version of the patccat classifier of patent claims. !!!
Note: We use the same approach that we use for USPTO patents. For a detailed description, see https://doi.org/10.5281/zenodo.6395307.
Data version: 3.4.0
Authors: Bernhard Ganglmair (University of Mannheim, Department of Economics, and ZEW Mannheim) W. Keith Robinson (Wake Forest University, School of Law) Michael Seeligson (Southern Methodist University, Cox School of Business)
Please cite the following paper when using the data in your own work:
Ganglmair, Bernhard, W. Keith Robinson, and Michael Seeligson (2022): "The Rise of Process Claims: Evidence from a Century of U.S. Patents," unpublished manuscript available at https://papers.ssrn.com/abstract=4069994.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data version: 3.3.0
Authors:
Bernhard Ganglmair (University of Mannheim, Department of Economics, and ZEW Mannheim)
W. Keith Robinson (Wake Forest University, School of Law)
Michael Seeligson (Southern Methodist University, Cox School of Business)
1. Notes on Data Construction
2. Citation and Code
3. Description of the Data Files
3.1. File List
3.2. List of Variables for Files with Claim-Level Information
3.3. List of Variables for Files with Patent-Level Information
4. Coming Soon!
1. Notes on Data Construction
This is version 3.3.0 of the patccat data (patent claim classification by algorithmic text analysis).
Patent claims define an invention. A patent application is required to have one or more claims that distinctly claim the subject matter which the patent applicant regards as her invention or discovery. We construct a classifier of patent claims that identifies three distinct claim types: process claims, product claims, and product-by-process claims.
For this classification, we combine information obtained from both the preamble and the body of a claim. The preamble is a general description of the invention (e.g., a method, an apparatus, or a device), whereas the body identifies steps and elements (specifying in detail the invention laid out in the preamble) that the applicant is claiming as the invention. The combination of the preamble type and the body type provides us with a more detailed and more accurate classification of claims than other approaches in the literature. This approach also accounts for unconventional drafting approaches. We eventually validate our classification using close to 10,000 manually classified claims.
The data files contain the results of our classification. We provide claim-level information for each independent claim of U.S. utility patents granted between 1836 and 2020. We also provide patent-level information, i.e., the counts of different claim types for a given patent.
For a detailed description of our classification approach, please take a look at the accompanying paper (Ganglmair, Robinson, and Seeligson 2022).
2. Citation
Please cite the following paper when using the data in your own work:
Ganglmair, Bernhard, W. Keith Robinson, and Michael Seeligson (2022): "The Rise of Process Claims: Evidence from a Century of U.S. Patents," unpublished manuscript available at https://papers.ssrn.com/abstract=4069994.
In the paper, we document the use of process claims in the U.S. over the last century, using the patccat data. We show an increase in the annual share of process claims of about 25 percentage points (from below 10% in 1920). This rise in process intensity of patents is not limited to a few patent classes, but we observe it across a broad spectrum of technologies. Process intensity varies by applicant type: companies file more process-intense patents than individuals, and U.S. applicants file more process-intense patents than foreign applicants. We further show that patents with higher process intensity are more valuable but are not necessarily cited more often. Last, process claims are on average shorter than product claims (with the gap narrowing since the 1970s).
We would love to see how other researchers use the data and eventually learn from it. If you have a discussion paper or a publication in which you use the data, please send us a copy at patccat.data@gmail.com.
We will the R code used to construct the data on Github with the next data version (version 3.4.0). Contact us at b.ganglmair@gmail.com if you would like to take a look at an earlier version of the code.
3. Description of the Data Files
The data files contain claim-level information for independent claims of 10,140,848 U.S. utility patents granted between 1836 and 2020. The files further contain patent-level information for U.S. utility patents.
3.1. File List
File listclaims-patccat-v3-3-sample.csv | claim-level information for independent claims of a sample of 1000 patents issued between 1976 and 2020 |
claims-patccat-v3-3-1836-1919.csv | claim-level information for independent claims of 1,038,041 patents issued between 1836 and 1919 |
claims-patccat-v3-3-1920-2020.csv | claim-level information for independent claims of 9,102,807 patents issued between 1920 and 2020 |
patents-patccat-v3-3-sample.csv | patent-level information for a sample of 1000 patents issued between 1976 and 2020 |
patents-patccat-v3-3-1836-1919.csv | patent-level information for 1,038,041 patents issued between 1836 and 1919 |
patents-patccat-v3-3-1920-2020.csv | patent-level information for 9,102,807 patents issued between 1920 and 2020 |
3.2. List of Variables for Files with Claim-Level Information
For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).
List of Variables (Claim-Level Information)PatentClaim | patent claim identifier; 8-digit patent number and 4-digit claim number (Ex: 01234567-0001) |
singleLine | =1 if claim is published in single-line format |
singleReformat | outcome code of reformating of single-line claims |
Jepson | =1 if claim is a Jepson claim |
JepsonReformat | outcome code of reformating of Jepson claims |
inBegin | =1 if claim begins with the word "in" |
wordsPreamble | number of words in the claim preamble |
wordsBody | number of words in the claim body |
dependentClaims | number of dependent claims that refer to this independent claim |
isMeansPreamble | =1 if term "means" is used in the preamble |
isMeansBody | =1 if term "means" is used in the body |
isMeans | =1 if term "means" is used anywhere in the claim (~ means-plus-function claim) |
processPreamble | =1 if terms "method" or "process" are used in the preamble |
processBody | =1 if terms "method" or "process" are used in the body |
processSimple | =1 if terms "method" or "process" are used anywhere in the claim (for simple approach of process claim classification) |
claimType | claim type of full classification (1 = process; 2 = product; 3 = product-by-process; 0 = no type) |
preambleType | preamble type |
preambleTerm | keyword used to classify preamble type |
preambleTermAlt | alternative keyword (if preambleTerm were not used) |
preambleTextStub | first 15 words of the preamble |
bodyType | body type |
bodyLinesStep | number of steps in the body |
bodyLinesElement | number of elements in the body |
bodyLinesTotal | total number of identified lines in the body |
label | 2-character label of the preamble-body combination; classification table maps label to claim type |
3.3. List of Variables for Files with Patent-Level Information
For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).
List of Variables (Patent-Level Information)patent_id | U.S. patent number (8-digit patent number) |
claims | number of independent claims (the sum of the four claim types: 0, 1, 2, and 3) |
noCategory | number of claims without a classified type |
processClaims | number of process claims |
productClaims | number of product claims |
prodByProcessClaims | number of product-by-process claims |
firstClaim | type of the first claim (1 = process; 2 = product; 3 = product-by-process; 0 = no type) |
simpleProcessClaims | number of process claims by simple approach (terms "method" or "process" anywhere in the claim) |
simpleProcessPreamble | number of process claims by simple approach (terms "method" or "process" in the preamble) |
meansClaims | number of means-plus-function claims |
meansFirst | =1 if first claim is a means-plus-function claim |
JepsonClaims | number of Jepson claims |
JepsonFirst | =1 if first claim is a Jepson claim |
Note: The following variables/fields are currently empty (March 30, 2020); we will populate these variables/fields with data version 3.4.0.
preambleTerm
preambleTermAlt
preambleTextStub
bodyLinesStep
bodyLinesElement
bodyLinesTotal
Note: We will release the data for patents issued in 2021 with data version 3.4.0.
4. Coming Soon!
We are working on a number of extensions of the patccat data.
- With data version 3.4.0, we plan to release data for all published U.S. patent applications (2001 through 2021)
- In late spring/early summer 2022, we will release data for patents issued by the European Patent Office (EPO) [Update: March 28, 2023: see https://doi.org/10.5281/zenodo.7776092]
-
📍 Looking for high-quality mining industry data? ISTARI.AI offers tailored POI datasets to fit your exact business needs – whether you’re looking for all mining operations, equipment manufacturers, Consultants, sub suppliers, service providers, or other specific type of location-based business.
📊 Our POI data includes: - Organizational structure & key personnel - Products, services & partnerships - Verified contact & domain info - Tech stack & business descriptions - Detailed geographic data (address, region, country)
We don’t offer one-size-fits-all datasets – instead, you tell us what you need.
This flexibility makes our data ideal for use cases in: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & B2B outreach
All POI data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency. With ISTARI.AI, you receive structured POI datasets ready for direct integration into your systems.
✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich
PSID data extract for computing per capita white-to-Black wealth gaps and active saving rates of Black and white Americans during 1984-2019.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html