Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Photogrammetry scan of inner courtyard of castle Mannheim (now the University of Mannheim). Processed using RealityCapture.
Source: Objaverse 1.0 / Sketchfab
🖨️ Additive manufacturing (3D Printing) data is a crucial driver of smart production and a foundational element of Industry 4.0. ISTARI.AI provides verified, scalable Additive Manufacturing data by analyzing how prominently Additive Manufacturing know-how is communicated on company websites. This enables both quantitative benchmarking and qualitative insight into how central Additive Manufacturing is to a company’s offerings - ensuring consistently high data quality and reliability.
📊 The dataset includes: - additive_manufacturing_intensity: Numerical indicator reflecting the prominence of additive manufacturing adoption - additive_manufacturing_intensity_level: Categorized engagement level (from very low to very high) - additive_manufacturing_keywords: Relevant Additive Manufacturing-related keywords found on the company’s website
📊 The Additive Manufacturing Score in Detail: The Additive Manufacturing Score reflects how central the topic of additive manufacturing is communicated by the company on its own website and presented as essential for its own business model. It specifically captures evidence of: - Products and services in additive manufacturing - Personnel with skills in additive manufacturing - Strategic positioning of additive manufacturing in the company’s communication
Rather than simple binary classification ("AI: yes/no"), ISTARI’s WebAI delivers a continuous, nuanced score that distinguishes between marginal mentions of Additive Manufacturing and core Additive Manufacturing-focused business models.
🔍 How do we measure? The webAI AI Agent, developed by ISTARI.AI, reads and analyzes company websites to: - Identify Additive Manufacturing-related keywords - Detect and validate text segments (“paragraphs”) containing Additive Manufacturing-related content - Classify whether a paragraph reflects genuine Additive Manufacturing know-how or simply general information - Calculate a ratio of Additive Manufacturing-know-how paragraphs to total website content, resulting in a numeric Additive Manufacturing Score
This approach ensures a deep contextual analysis of how central the Additive Manufacturing Score is to each company’s external communication and positioning.
🔍 How can the data be interpreted? - 0.0 = No communication of Additive Manufacturing-related know-how - 0.25 = Limited communication; e.g., a consulting firm mentioning "Additive Manufacturing services" among other topics - 2.5+ = High intensity; e.g., a startup exclusively focused on Additive Manufacturing solutions - 3.5+ = Exceptional additive manufacturing focus; typically, AM-first companies or specialized industrial technology providers. An additional categorical interpretation is provided as a helper column, ranging from "very low" to "very high" intensity.
✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich - Winner of the Best Paper Award at the R&D Management Conference 2022 - Currently under peer review in a leading international journal
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
The German Internet Panel (GIP) is a long-term study at the University of Mannheim. The GIP examines individual attitudes and preferences that are relevant in political and economic decision-making processes. To this end, more than 3,500 people throughout Germany have been regularly surveyed online every two months since 2012 on a wide range of topics. The GIP is based on a random sample of the general population in Germany between the ages of 16 and 75. The study started in 2012 and was supplemented by new participants in 2014 and 2018. The panel participants were recruited offline. The GIP questionnaires cover a variety of topics that deal with current events.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.
🤖 Artificial Intelligence (AI) is a key enabler of innovation and a central pillar of digital transformation. ISTARI.AI provides verified, scalable AI intensity data by analyzing how prominently AI know-how is communicated on company websites. This enables both quantitative benchmarking and qualitative insight into how central AI is to a company’s offerings—ensuring consistently high data quality and reliability.
📊 The dataset includes: - ai_intensity: Numerical indicator reflecting the prominence of AI-related know-how - ai_intensity_level: Categorized engagement level (from very low to very high) - ai_keywords: Relevant AI-related keywords found on the company’s website
📊 The AI Intensity Score in Detail The AI Intensity Score quantifies the degree to which artificial intelligence is communicated as a core capability or business focus on a company’s website. It specifically captures evidence of: - AI-integrated products or services - AI expertise within the workforce - Strategic positioning of AI in the company’s communication
Rather than simple binary classification ("AI: yes/no"), ISTARI’s WebAI delivers a continuous, nuanced score that distinguishes between marginal mentions of AI and core AI-focused business models.
🔍 How do we measure? The webAI AI Agent, developed by ISTARI.AI, reads and analyzes company websites to: - Identify AI-related keywords - Detect and validate text segments (“paragraphs”) containing AI-related content - Classify whether a paragraph reflects genuine AI know-how or simply general information - Calculate a ratio of AI-know-how paragraphs to total website content, resulting in a numeric AI Intensity score
This approach ensures a deep contextual analysis of how central AI is to each company’s external communication and positioning.
🔍 How can the data be interpreted? - 0.0 = No communication of AI-related know-how - 0.25 = Limited communication; e.g., a consulting firm mentioning "AI services" among other topics - 2.5+ = High intensity; e.g., a startup exclusively focused on AI solutions - 3.5+ = Exceptional AI focus; typically, AI-first companies or specialized technology providers An additional categorical interpretation is provided as a helper column, ranging from "very low" to "very high" intensity.
✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich - Winner of the Best Paper Award at the R&D Management Conference 2022 - Currently under peer review in a leading international journal
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset shows selected information of patient entrances in a laboratory for molecular pathology in 2024. The following variables are shown in the dataset: patient no. as "Patienten Nr.", age (to date: admission date) as "Alter (zum Stand: Eingangsdatum)" admission date as "Eingangsdatum" diagnosis as "Diagnose" diagnosis addition as "Diagnose Zusatz" This project runs as part of the course "Forschungsdatenmanagement" of the BIDS Master of the University of Mannheim, Germany.
💉 Digital Health data is a crucial driver of smart healthcare delivery and a foundational element of Health 4.0.
ISTARI.AI provides verified, scalable Digital Health data by analyzing how prominently Digital Health know-how is communicated on company websites. This enables both quantitative benchmarking and qualitative insight into how central Digital Health is to a company’s offerings - ensuring consistently high data quality and reliability.
📊 The dataset includes: - digital_health_intensity: Numerical measure of Digital Health technology focus - digital_health_intensity_level: Categorical classification of Digital Health intensity (from very low to very high) - digital_health_keywords: Relevant Digital Health-related keywords found on the company’s website
📊 The Digital Health Score in Detail: The Digital Health Score measures how centrally the topic of Digital Health is communicated by the company on its own website and presented as essential for its own business model. Digital Health includes these of information and communication technology (ICT) in the field of healthcare. It specifically captures evidence of: - E-health: Companies providing digital solutions or products that support patient care using modern ICT, such as e-prescriptions, electronic health records, and online consultations. - Trend Health: Companies offering Digital Health products and services for private consumers, focusing on self-care, prevention, and health monitoring through wearables and apps. - Tech Health: Companies offering innovative Digital Health solutions for professional users, using technologies like AI, robotics, sensors, big data, and 3D printing. - Strategic positioning of Digital Health in the company’s communication
Rather than simple binary classification ("AI: yes/no"), ISTARI’s WebAI delivers a continuous, nuanced score that distinguishes between marginal mentions of Digital Health and core Digital Health-focused business models.
🔍 How do we measure? The webAI AI Agent, developed by ISTARI.AI, reads and analyzes company websites to: - Identify Digital Health-related keywords - Detect and validate text segments (“paragraphs”) containing Digital Health-related content - Classify whether a paragraph reflects genuine Digital Health know-how or simply general information - Calculate a ratio of Digital Health-know-how paragraphs to total website content, resulting in a numeric Digital Health Score
This approach ensures a deep contextual analysis of how central Digital Health is to each company’s external communication and positioning.
🔍 How can the data be interpreted? - 0.0 = No communication of Digital Health-related know-how - 0.25 = Limited communication; e.g., a consulting firm mentioning "Digital Health services" among other topics - 2.5+ = High intensity; e.g., a startup that is particularly focused on Digital Health solutions - 3.5+ = Exceptional Digital Health focus An additional categorical interpretation is provided as a helper column, ranging from "very low" to "very high" intensity.
✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich - Winner of the Best Paper Award at the R&D Management Conference 2022 - Currently under peer review in a leading international journal
https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
This dataset comprises event logs (XES = Extensible Event Stream) regarding the activities of daily living performed by several individuals. The event logs were derived from sensor data which was collected in different scenarios and represent activities of daily living performed by several individuals. These include e.g., sleeping, meal preparation, and washing. The event logs show the different behavior of people in their own homes but also common patterns. The attached event logs were created with Fluxicon Disco ({http://fluxicon.com/disco/}).
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
BabelEdits
BabelEdits is a benchmark designed to evaluate cross-lingual knowledge editing (CKE) in Large Language Models (LLMs). It enables robust and effective evaluation across 60 languages by combining high-quality entity translations from BabelNet with marker-based translation. BabelEdits is also accompanied by a modular CKE method, BabelReFT, which supports multilingual edit propagation while preserving downstream model performance.
Dataset Summary
As LLMs… See the full description on the dataset page: https://huggingface.co/datasets/umanlp/babeledits.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation:Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description:An augmented version of the amazon-google products dataset for benchmarking entity matching/record linkage methods found at: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolutio...The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 1,363 records describing products deriving from amazon which are matched against 3,226 product records from google. The gold standards have manual annotations for 1,298 matching and 6,306 non-matching pairs. The total number of attributes used to decribe the product records are 4 while the attribute density is 0.75.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results.The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download:http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This new dataset is designed to solve this great NLP task and is crafted with a lot of care.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Automated and statistical methods for estimating latent political traits and classes from textual data hold great promise, since virtually every political act involves the production of text. Statistical models of natural language features, however, are heavily laden with unrealistic assumptions about the process that generates this data, including the stochastic process of text generation, the functional link between political variables and observed text, and the nature of the variables (and dimensions) on which observed text should be conditioned. While acknowledging statistical models of latent traits to be “wrong†, political scientists nonetheless treat the treat their results as sufficiently valid to be useful. In this paper, we address the issue of substantive validity in the face of potential model failure, in the context of unsupervised scaling methods of latent traits. We critically examine one popular parametric measurement model of latent traits for text and then compare its results to systematic human judgments of the texts as a benchmark for validity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Replication code and data for "A Common Left-Right Scale for Voters and Parties in Europe"
Youth in Europe Study (YES!) is the name of the survey study within the project "Children of Immigrants Longitudinal Survey in Four European Countries" (CILS4EU). It is an international research project aimed at gaining more insight into the current living conditions and opinions of young people. The pupil questionnaire covers several areas, including: School and results, Feelings and opinions, Health, Friends and relationship building, Family Relationships and Leisure.
In each participating country, approximately 5,000 pupils attending 8th grade (or corresponding) were interviewed by means of a questionnaire. In Sweden, approximately 130 schools were randomly selected. The first survey in 2011 was followed by another survey in 2012 (when pupils were in 9th grade) one in 2013 (when respondents have finished compulsory school and have entered upper secondary education, the labour market or else)and another in 2016.
The survey is conducted in Sweden, Germany, the Netherlands and England. Youth in Europe (YES!) is a joint initiative of researchers from Stockholm University, the University of Mannheim, University of Utrecht, Tilburg University, and University of Oxford.
Purpose:
The purpose of the study is to answer questions on young people’s living conditions and to compare these between countries, e.g.:
Which role do school, family and friends play for youth in Europe? What are the hobbies, interests and issues they are engaged in? How do educational careers of young people with and without immigration background proceed? What are their educational and occupational goals? What can be done in order to improve educational chances of all young people?
Youth in Europe Study (YES!) is the name of the survey study within the project "Children of Immigrants Longitudinal Survey in Four European Countries" (CILS4EU). It is an international research project aimed at gaining more insight into the current living conditions and opinions of young people. The pupil questionnaire covers several areas, including: School and results, Feelings and opinions, Health, Friends and relationship building, Family Relationships and Leisure. In each participating country, approximately 5,000 pupils attending 8th grade (or corresponding) were interviewed by means of a questionnaire. In Sweden, approximately 130 schools were randomly selected. The first survey in 2011 was followed by another survey in 2012 (when pupils were in 9th grade) one in 2013 (when respondents have finished compulsory school and have entered upper secondary education, the labour market or else)and another in 2016. The survey is conducted in Sweden, Germany, the Netherlands and England. Youth in Europe (YES!) is a joint initiative of researchers from Stockholm University, the University of Mannheim, University of Utrecht, Tilburg University, and University of Oxford. Purpose: The purpose of the study is to answer questions on young people’s living conditions and to compare these between countries, e.g.: Which role do school, family and friends play for youth in Europe? What are the hobbies, interests and issues they are engaged in? How do educational careers of young people with and without immigration background proceed? What are their educational and occupational goals? What can be done in order to improve educational chances of all young people? Ungdomar i Europa (Youth in Europe Study, YES!) heter enkätstudien inom projektet "Children of Immigrants Longitudinal Survey in Four European Countries" (CILS4EU). Det är ett internationellt forskningsprojekt som görs i Sverige, England, Holland och Tyskland. Studien omfattar flera områden, såsom: Skola, Hälsa, Vänner, Familj, Fritid och Uppfattningar och åsikter. Studien omfattar ungefär 5 000 elever i varje land, totalt cirka 19 000 elever. Den svenska delen bygger på ett urval om cirka 130 grundskolor. Studien är i grunden longitudinell och eftersom ett av syftena är att studera gymnasievalet följdes den första studien med elever i åttonde klass 2011 upp av en 2012 (när eleverna går i nionde klass) samt ytterligare en år 2013 (när respondenterna går första året på gymnasiet eller har börjat arbeta, alternativt har annan sysselsättning), samt senast år 2016. Studien är ett samarbete mellan Institutet för social forskning (SOFI) vid Stockholms universitet och universiteten i Mannheim, Utrecht, Tilburg och Oxford. Syfte: Syftet med undersökningen är att svara på frågor om ungdomars levnadsvillkor och att jämföra dessa mellan länder, t ex: Vilken roll spelar skola, familj och vänner för ungdomar i Europa? Hur ser ungdomars fritid ut? Hur utvecklas utbildningskarriärer för ungdomar med och utan utländsk bakgrund? Vad har de för framtidsplaner vad gäller utbildning och arbete? Vad kan man göra för att förbättra ungdomars möjligheter till utbildning?
The Immigration Policies in Comparison (IMPIC) project provides a set of sophisticated quantitative indices to measure immigration policies in most OECD countries and for the time period 1980-2018. For more information see the project webpage: http://www.impic-project.eu/ . An earlier version has been prepublished there.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the fodors-zagats restaurants dataset for benchmarking entity matching/record linkage methods found at:https://hpi.de/en/naumann/projects/data-integration-data-quality-and-data-cleansing/dude.html#c11471 The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 533 records describing restaurants from fodors.com which are matched against 331 restaurants records from zagat.com. The gold standards have manual annotations for 112 matching and 488 non-matching pairs. The total number of attributes used to decribe the product records are 5 while the attribute density is 100%.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data version: 3.3.0
Authors:
Bernhard Ganglmair (University of Mannheim, Department of Economics, and ZEW Mannheim)
W. Keith Robinson (Wake Forest University, School of Law)
Michael Seeligson (Southern Methodist University, Cox School of Business)
1. Notes on Data Construction
2. Citation and Code
3. Description of the Data Files
3.1. File List
3.2. List of Variables for Files with Claim-Level Information
3.3. List of Variables for Files with Patent-Level Information
4. Coming Soon!
1. Notes on Data Construction
This is version 3.3.0 of the patccat data (patent claim classification by algorithmic text analysis).
Patent claims define an invention. A patent application is required to have one or more claims that distinctly claim the subject matter which the patent applicant regards as her invention or discovery. We construct a classifier of patent claims that identifies three distinct claim types: process claims, product claims, and product-by-process claims.
For this classification, we combine information obtained from both the preamble and the body of a claim. The preamble is a general description of the invention (e.g., a method, an apparatus, or a device), whereas the body identifies steps and elements (specifying in detail the invention laid out in the preamble) that the applicant is claiming as the invention. The combination of the preamble type and the body type provides us with a more detailed and more accurate classification of claims than other approaches in the literature. This approach also accounts for unconventional drafting approaches. We eventually validate our classification using close to 10,000 manually classified claims.
The data files contain the results of our classification. We provide claim-level information for each independent claim of U.S. utility patents granted between 1836 and 2020. We also provide patent-level information, i.e., the counts of different claim types for a given patent.
For a detailed description of our classification approach, please take a look at the accompanying paper (Ganglmair, Robinson, and Seeligson 2022).
2. Citation
Please cite the following paper when using the data in your own work:
Ganglmair, Bernhard, W. Keith Robinson, and Michael Seeligson (2022): "The Rise of Process Claims: Evidence from a Century of U.S. Patents," unpublished manuscript available at https://papers.ssrn.com/abstract=4069994.
In the paper, we document the use of process claims in the U.S. over the last century, using the patccat data. We show an increase in the annual share of process claims of about 25 percentage points (from below 10% in 1920). This rise in process intensity of patents is not limited to a few patent classes, but we observe it across a broad spectrum of technologies. Process intensity varies by applicant type: companies file more process-intense patents than individuals, and U.S. applicants file more process-intense patents than foreign applicants. We further show that patents with higher process intensity are more valuable but are not necessarily cited more often. Last, process claims are on average shorter than product claims (with the gap narrowing since the 1970s).
We would love to see how other researchers use the data and eventually learn from it. If you have a discussion paper or a publication in which you use the data, please send us a copy at patccat.data@gmail.com.
We will the R code used to construct the data on Github with the next data version (version 3.4.0). Contact us at b.ganglmair@gmail.com if you would like to take a look at an earlier version of the code.
3. Description of the Data Files
The data files contain claim-level information for independent claims of 10,140,848 U.S. utility patents granted between 1836 and 2020. The files further contain patent-level information for U.S. utility patents.
3.1. File List
File listclaims-patccat-v3-3-sample.csv | claim-level information for independent claims of a sample of 1000 patents issued between 1976 and 2020 |
claims-patccat-v3-3-1836-1919.csv | claim-level information for independent claims of 1,038,041 patents issued between 1836 and 1919 |
claims-patccat-v3-3-1920-2020.csv | claim-level information for independent claims of 9,102,807 patents issued between 1920 and 2020 |
patents-patccat-v3-3-sample.csv | patent-level information for a sample of 1000 patents issued between 1976 and 2020 |
patents-patccat-v3-3-1836-1919.csv | patent-level information for 1,038,041 patents issued between 1836 and 1919 |
patents-patccat-v3-3-1920-2020.csv | patent-level information for 9,102,807 patents issued between 1920 and 2020 |
3.2. List of Variables for Files with Claim-Level Information
For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).
List of Variables (Claim-Level Information)PatentClaim | patent claim identifier; 8-digit patent number and 4-digit claim number (Ex: 01234567-0001) |
singleLine | =1 if claim is published in single-line format |
singleReformat | outcome code of reformating of single-line claims |
Jepson | =1 if claim is a Jepson claim |
JepsonReformat | outcome code of reformating of Jepson claims |
inBegin | =1 if claim begins with the word "in" |
wordsPreamble | number of words in the claim preamble |
wordsBody | number of words in the claim body |
dependentClaims | number of dependent claims that refer to this independent claim |
isMeansPreamble | =1 if term "means" is used in the preamble |
isMeansBody | =1 if term "means" is used in the body |
isMeans | =1 if term "means" is used anywhere in the claim (~ means-plus-function claim) |
processPreamble | =1 if terms "method" or "process" are used in the preamble |
processBody | =1 if terms "method" or "process" are used in the body |
processSimple | =1 if terms "method" or "process" are used anywhere in the claim (for simple approach of process claim classification) |
claimType | claim type of full classification (1 = process; 2 = product; 3 = product-by-process; 0 = no type) |
preambleType | preamble type |
preambleTerm | keyword used to classify preamble type |
preambleTermAlt | alternative keyword (if preambleTerm were not used) |
preambleTextStub | first 15 words of the preamble |
bodyType | body type |
bodyLinesStep | number of steps in the body |
bodyLinesElement | number of elements in the body |
bodyLinesTotal | total number of identified lines in the body |
label | 2-character label of the preamble-body combination; classification table maps label to claim type |
3.3. List of Variables for Files with Patent-Level Information
For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).
List of Variables (Patent-Level Information)patent_id | U.S. patent number (8-digit patent number) |
claims | number of independent claims (the sum of the four claim types: 0, 1, 2, and 3) |
noCategory | number of claims without a classified type |
processClaims | number of process claims |
productClaims | number of product claims |
prodByProcessClaims | number of product-by-process claims |
firstClaim | type of the first claim (1 = process; 2 = product; 3 = product-by-process; 0 = no type) |
simpleProcessClaims | number of process claims by simple approach (terms "method" or "process" anywhere in the claim) |
simpleProcessPreamble | number of process claims by simple approach (terms "method" or "process" in the preamble) |
meansClaims | number of means-plus-function claims |
meansFirst | =1 if first claim is a means-plus-function claim |
JepsonClaims | number of Jepson claims |
JepsonFirst | =1 if first claim is a Jepson claim |
Note: The following variables/fields are currently empty (March 30, 2020); we will populate these variables/fields with data version 3.4.0.
preambleTerm
preambleTermAlt
preambleTextStub
bodyLinesStep
bodyLinesElement
bodyLinesTotal
Note: We will release the data for patents issued in 2021 with data version 3.4.0.
4. Coming Soon!
We are working on a number of extensions of the patccat data.
- With data version 3.4.0, we plan to release data for all published U.S. patent applications (2001 through 2021)
- In late spring/early summer 2022, we will release data for patents issued by the European Patent Office (EPO) [Update: March 28, 2023: see https://doi.org/10.5281/zenodo.7776092]
-
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html