44 datasets found
  1. Web Data Commons Phones Dataset, Augmented Version, Fixed Splits

    • linkagelibrary.icpsr.umich.edu
    delimited
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Primpeli; Christian Bizer (2020). Web Data Commons Phones Dataset, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127243V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html

  2. d

    ISTARI.AI | Points of Interest Dataset (POI) | Oil & Gas Industry | Verified...

    • datarade.ai
    Updated Aug 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Istari.AI (2025). ISTARI.AI | Points of Interest Dataset (POI) | Oil & Gas Industry | Verified Company Profiles filtered from over 40M+ companies [Dataset]. https://datarade.ai/data-products/istari-ai-points-of-interest-dataset-poi-oil-gas-indu-istari-ai
    Explore at:
    Dataset updated
    Aug 14, 2025
    Dataset provided by
    Istari.AI
    Area covered
    Togo, Gibraltar, Aruba, Denmark, Cook Islands, Poland, Bulgaria, Heard Island and McDonald Islands, Sint Eustatius and Saba, France
    Description

    📍 Looking for high-quality oil & gas industry data? ISTARI.AI offers tailored POI datasets to fit your exact business needs – whether you’re looking for all oil & gas exploration/refining operations, equipment manufacturers, Consultants, sub suppliers, service providers, or other specific type of location-based business.

    📊 Our POI data includes: - Organizational structure & key personnel - Products, services & partnerships - Verified contact & domain info - Tech stack & business descriptions - Detailed geographic data (address, region, country)

    We don’t offer one-size-fits-all datasets – instead, you tell us what you need.

    This flexibility makes our data ideal for use cases in: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & B2B outreach

    All POI data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency. With ISTARI.AI, you receive structured POI datasets ready for direct integration into your systems.

    ✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich

  3. g

    German Internet Panel, Welle 66 (Juli 2023)

    • search.gesis.org
    Updated Jul 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    German Internet Panel, Universität Mannheim (2024). German Internet Panel, Welle 66 (Juli 2023) [Dataset]. http://doi.org/10.4232/1.14370
    Explore at:
    (72162), (65025)Available download formats
    Dataset updated
    Jul 26, 2024
    Dataset provided by
    GESIS
    GESIS search
    Authors
    German Internet Panel, Universität Mannheim
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Time period covered
    Jul 1, 2023 - Jul 31, 2023
    Area covered
    Germany
    Description

    The German Internet Panel (GIP) is a long-term study at the University of Mannheim. The GIP examines individual attitudes and preferences that are relevant in political and economic decision-making processes. To this end, more than 3,500 people throughout Germany have been regularly surveyed online every two months since 2012 on a wide range of topics. The GIP is based on a random sample of the general population in Germany between the ages of 16 and 75. The study started in 2012 and was supplemented by new participants in 2014 and 2018. The panel participants were recruited offline. The GIP questionnaires cover a variety of topics that deal with current events.

  4. LODVec Evaluation Datasets and Experiments

    • zenodo.org
    bin, zip
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michalis Mountantonakis; Michalis Mountantonakis (2024). LODVec Evaluation Datasets and Experiments [Dataset]. http://doi.org/10.5281/zenodo.14266984
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michalis Mountantonakis; Michalis Mountantonakis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the input datasets that were used for evaluating the system LODVec. Moreover, it contains the input for the machine learning tasks, i.e., prediction classes for classification, ratings for regression and the top related entities for a set of movies and basketball players.

    The evaluation datasets for movies and music albums were derived from https://www.uni-mannheim.de/dws/research/resources/sw4ml-benchmark/, and belong to these researchers (University of Mannheim).

    https://demos.isl.ics.forth.gr/lodvec/

  5. Data from: Product Datasets from the MWPD2020 Challenge at the ISWC2020...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) [Dataset]. http://doi.org/10.3886/E127482V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below.

  6. h

    xscitldr

    • huggingface.co
    Updated May 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLP & IR Group @ University of Mannheim (2023). xscitldr [Dataset]. https://huggingface.co/datasets/umanlp/xscitldr
    Explore at:
    Dataset updated
    May 14, 2023
    Dataset authored and provided by
    NLP & IR Group @ University of Mannheim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This new dataset is designed to solve this great NLP task and is crafted with a lot of care.

  7. n

    Activities of daily living of several individuals

    • narcis.nl
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated Nov 3, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timo Sztyler; J. (Josep) Carmona (2015). Activities of daily living of several individuals [Dataset]. http://doi.org/10.4121/uuid:01eaba9f-d3ed-4e04-9945-b8b302764176
    Explore at:
    media types: application/x-gzip, application/zip, text/plainAvailable download formats
    Dataset updated
    Nov 3, 2015
    Dataset provided by
    University of Mannheim, Germany
    Authors
    Timo Sztyler; J. (Josep) Carmona
    Description

    This dataset comprises event logs (XES = Extensible Event Stream) regarding the activities of daily living performed by several individuals. The event logs were derived from sensor data which was collected in different scenarios and represent activities of daily living performed by several individuals. These include e.g., sleeping, meal preparation, and washing. The event logs show the different behavior of people in their own homes but also common patterns. The attached event logs were created with Fluxicon Disco ({http://fluxicon.com/disco/}).

  8. d

    ISTARI.AI | Points of Interest Dataset (POI) | Tourism Industry | Verified...

    • datarade.ai
    Updated Aug 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Istari.AI (2025). ISTARI.AI | Points of Interest Dataset (POI) | Tourism Industry | Verified Profiles filtered from over 40M+ companies [Dataset]. https://datarade.ai/data-products/istari-ai-points-of-interest-dataset-poi-tourism-indust-istari-ai
    Explore at:
    .json, .csv, .xls, .txt, .parquet, .pdfAvailable download formats
    Dataset updated
    Aug 14, 2025
    Dataset provided by
    Istari.AI
    Area covered
    Central African Republic, Guam, Nepal, Guernsey, Maldives, Israel, Burundi, Equatorial Guinea, Norfolk Island, Zimbabwe
    Description

    📍 Looking for high-quality global data on tourism industry? ISTARI.AI provides comprehensive, ready-to-use datasets covering hotels, tourist agencies, travel agents, travel magazine, bars, and restaurants worldwide – including location, contact, and detailed business information.

    📊 Our Tourism data includes: - Organizational structure & key personnel - Products, services & partnerships - Verified contact & domain information - Technology stack & business descriptions - Detailed geographic data (address, region, country)

    Our datasets are ideal for: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & marketing outreach

    All data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency.

    ✅ Ensuring Data Quality - Developed in close collaboration with academic experts to guarantee expert-level accuracy - Created together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from the University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich

    With ISTARI.AI, you get structured, high-quality tourism datasets from across the globe – ready for direct integration into your systems.

  9. h

    babeledits

    • huggingface.co
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLP & IR Group @ University of Mannheim (2025). babeledits [Dataset]. https://huggingface.co/datasets/umanlp/babeledits
    Explore at:
    Dataset updated
    Aug 1, 2025
    Dataset authored and provided by
    NLP & IR Group @ University of Mannheim
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    BabelEdits

    BabelEdits is a benchmark designed to evaluate cross-lingual knowledge editing (CKE) in Large Language Models (LLMs). It enables robust and effective evaluation across 60 languages by combining high-quality entity translations from BabelNet with marker-based translation. BabelEdits is also accompanied by a modular CKE method, BabelReFT, which supports multilingual edit propagation while preserving downstream model performance.

      Dataset Summary
    

    As LLMs… See the full description on the dataset page: https://huggingface.co/datasets/umanlp/babeledits.

  10. Amazon-Google, Augmented Version, Fixed Splits

    • linkagelibrary.icpsr.umich.edu
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Primpeli; Christian Bizer (2020). Amazon-Google, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127241V1
    Explore at:
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation:Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description:An augmented version of the amazon-google products dataset for benchmarking entity matching/record linkage methods found at: https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolutio...The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 1,363 records describing products deriving from amazon which are matched against 3,226 product records from google. The gold standards have manual annotations for 1,298 matching and 6,306 non-matching pairs. The total number of attributes used to decribe the product records are 4 while the attribute density is 0.75.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results.The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download:http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html

  11. H

    Replication data for: Validating estimates of latent traits from textual...

    • data.niaid.nih.gov
    • dataverse.harvard.edu
    Updated Oct 1, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Will Lowe; Kenneth Benoit (2014). Replication data for: Validating estimates of latent traits from textual data using human judgement as a benchmark [Dataset]. http://doi.org/10.7910/DVN/QWPDQJ
    Explore at:
    zip, text/plain; charset=us-asciiAvailable download formats
    Dataset updated
    Oct 1, 2014
    Dataset provided by
    London School of Economics and Political Science
    University of Mannheim
    Authors
    Will Lowe; Kenneth Benoit
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Ireland
    Description

    Automated and statistical methods for estimating latent political traits and classes from textual data hold great promise, since virtually every political act involves the production of text. Statistical models of natural language features, however, are heavily laden with unrealistic assumptions about the process that generates this data, including the stochastic process of text generation, the functional link between political variables and observed text, and the nature of the variables (and dimensions) on which observed text should be conditioned. While acknowledging statistical models of latent traits to be “wrong†, political scientists nonetheless treat the treat their results as sufficiently valid to be useful. In this paper, we address the issue of substantive validity in the face of potential model failure, in the context of unsupervised scaling methods of latent traits. We critically examine one popular parametric measurement model of latent traits for text and then compare its results to systematic human judgments of the texts as a benchmark for validity.

  12. Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

    • linkagelibrary.icpsr.umich.edu
    • da-ra.de
    Updated Nov 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
    Explore at:
    Dataset updated
    Nov 26, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Ralph Peeters; Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  13. Restaurants (Fodors-Zagats), Augmented Version, Fixed Splits

    • linkagelibrary.icpsr.umich.edu
    delimited
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Primpeli; Christian Bizer (2020). Restaurants (Fodors-Zagats), Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127242V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the fodors-zagats restaurants dataset for benchmarking entity matching/record linkage methods found at:https://hpi.de/en/naumann/projects/data-integration-data-quality-and-data-cleansing/dude.html#c11471 The augmented version adds a fixed set of non-matching pairs to the original dataset. In addition, fixed splits for training, validation and testing as well as their corresponding feature vectors are provided. The feature vectors are built using data type specific similarity metrics.The dataset contains 533 records describing restaurants from fodors.com which are matched against 331 restaurants records from zagat.com. The gold standards have manual annotations for 112 matching and 488 non-matching pairs. The total number of attributes used to decribe the product records are 5 while the attribute density is 100%.The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html

  14. e

    Youth in Europe Study (YES!) - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Jul 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Youth in Europe Study (YES!) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/52740ac2-dd63-522f-9de7-1e2460ce0eaa
    Explore at:
    Dataset updated
    Jul 7, 2024
    Description

    Youth in Europe Study (YES!) is the name of the survey study within the project "Children of Immigrants Longitudinal Survey in Four European Countries" (CILS4EU). It is an international research project aimed at gaining more insight into the current living conditions and opinions of young people. The pupil questionnaire covers several areas, including: School and results, Feelings and opinions, Health, Friends and relationship building, Family Relationships and Leisure. In each participating country, approximately 5,000 pupils attending 8th grade (or corresponding) were interviewed by means of a questionnaire. In Sweden, approximately 130 schools were randomly selected. The first survey in 2011 was followed by another survey in 2012 (when pupils were in 9th grade) one in 2013 (when respondents have finished compulsory school and have entered upper secondary education, the labour market or else)and another​ in 2016. The survey is conducted in Sweden, Germany, the Netherlands and England. Youth in Europe (YES!) is a joint initiative of researchers from Stockholm University, the University of Mannheim, University of Utrecht, Tilburg University, and University of Oxford. Purpose: The purpose of the study is to answer questions on young people’s living conditions and to compare these between countries, e.g.: Which role do school, family and friends play for youth in Europe? What are the hobbies, interests and issues they are engaged in? How do educational careers of young people with and without immigration background proceed? What are their educational and occupational goals? What can be done in order to improve educational chances of all young people? Ungdomar i Europa (Youth in Europe Study, YES!) heter enkätstudien inom projektet "Children of Immigrants Longitudinal Survey in Four European Countries" (CILS4EU). Det är ett internationellt forskningsprojekt som görs i Sverige, England, Holland och Tyskland. Studien omfattar flera områden, såsom: Skola, Hälsa, Vänner, Familj, Fritid och Uppfattningar och åsikter. Studien omfattar ungefär 5 000 elever i varje land, totalt cirka 19 000 elever. Den svenska delen bygger på ett urval om cirka 130 grundskolor. Studien är i grunden longitudinell och eftersom ett av syftena är att studera gymnasievalet följdes den första studien med elever i åttonde klass 2011 upp av en 2012 (när eleverna går i nionde klass) samt ytterligare en år 2013 (när respondenterna går första året på gymnasiet eller har börjat arbeta, alternativt har annan sysselsättning), samt senast år 2016. Studien är ett samarbete mellan Institutet för social forskning (SOFI) vid Stockholms universitet och universiteten i Mannheim, Utrecht, Tilburg och Oxford. Syfte: Syftet med undersökningen är att svara på frågor om ungdomars levnadsvillkor och att jämföra dessa mellan länder, t ex: Vilken roll spelar skola, familj och vänner för ungdomar i Europa? Hur ser ungdomars fritid ut? Hur utvecklas utbildningskarriärer för ungdomar med och utan utländsk bakgrund? Vad har de för framtidsplaner vad gäller utbildning och arbete? Vad kan man göra för att förbättra ungdomars möjligheter till utbildning?

  15. d

    ISTARI.AI | Points of Interest Dataset (POI) | Germany | 35+ Attributes |...

    • datarade.ai
    Updated Aug 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Istari.AI (2025). ISTARI.AI | Points of Interest Dataset (POI) | Germany | 35+ Attributes | 40M+ Verified Company Profiles [Dataset]. https://datarade.ai/data-products/istari-ai-points-of-interest-dataset-poi-germany-35-istari-ai
    Explore at:
    .json, .csv, .xls, .parquetAvailable download formats
    Dataset updated
    Aug 7, 2025
    Dataset provided by
    Istari.AI
    Area covered
    Germany
    Description

    📍 Looking for high-quality Point of Interest (POI) data for Germany? ISTARI.AI offers tailored POI datasets to fit your exact business needs – whether you’re looking for all restaurants, gyms, electricians, or any other specific type of location-based business.

    📊 Our POI data includes: - Accurate location data (address, coordinates) - Contact information (phone numbers, websites, email addresses where available) - Structured business attributes (opening hours, business category, service offerings, and more)

    We don’t offer one-size-fits-all datasets - instead, you tell us what you need. Whether it’s a national dataset of all fitness centers, a list of car repair shops in a specific region, or just all vegan restaurants in major German cities, we generate the dataset based on your POI category and geographic scope.

    This flexibility makes our data ideal for use cases in: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & B2B outreach

    All POI data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency.

    Tell us your POI requirements - we’ll handle the rest. With ISTARI.AI, you receive structured POI datasets ready for direct integration into your systems.

    ✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich

  16. o

    Data from: The Costs and Benefits of Home Office during the Covid-19...

    • openicpsr.org
    delimited, stata
    Updated Nov 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Schymik; Harald Fadinger (2020). The Costs and Benefits of Home Office during the Covid-19 Pandemic - Evidence from Infections and an Input-Output Model for Germany [Dataset]. https://www.openicpsr.org/openicpsr/project/124902/view?path=/openicpsr/124902/fcr:versions/V2/---Replication-Package-for-Web.zip&type=file
    Explore at:
    stata, delimitedAvailable download formats
    Dataset updated
    Nov 1, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Jan Schymik; Harald Fadinger
    Area covered
    Germany
    Dataset funded by
    Deutsche Forschungsgemeinschaft (Germany)
    Description

    We study the impact of working from home on (i) infection risk in German regions and (ii) output using an input-output (IO) model of the German economy. We find that working from home is very effective in reducing infection risk: regions whose industry structure allows for a larger fraction of work to be done from home experienced much fewer Covid-19 cases and fatalities. Moreover, confinement is significantly more costly in terms of induced output loss in regions where the share of workers who can work from home is lower. When phasing out confinement, home office should be maintained as long as possible, to allow those workers who cannot work from home to go back to work, while keeping infection risk minimal. Finally, systemic industries (with high multipliers and/or high value added per worker) should be given priority, especially those where home office is not possible.

  17. Z

    The patccat classifier for patent claims - EPO edition

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ganglmair, Bernhard (2023). The patccat classifier for patent claims - EPO edition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7776092
    Explore at:
    Dataset updated
    Mar 28, 2023
    Dataset provided by
    Ganglmair, Bernhard
    Seeligson, Michael
    Robinson, W. Keith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    !!! This is the EPO/European version of the patccat classifier of patent claims. !!!

    Note: We use the same approach that we use for USPTO patents. For a detailed description, see https://doi.org/10.5281/zenodo.6395307.

    Data version: 3.4.0

    Authors: Bernhard Ganglmair (University of Mannheim, Department of Economics, and ZEW Mannheim) W. Keith Robinson (Wake Forest University, School of Law) Michael Seeligson (Southern Methodist University, Cox School of Business)

    Please cite the following paper when using the data in your own work:

    Ganglmair, Bernhard, W. Keith Robinson, and Michael Seeligson (2022): "The Rise of Process Claims: Evidence from a Century of U.S. Patents," unpublished manuscript available at https://papers.ssrn.com/abstract=4069994.

  18. patccat: A classifier for patent claims

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernhard Ganglmair; Bernhard Ganglmair; W. Keith Robinson; W. Keith Robinson; Michael Seeligson; Michael Seeligson (2025). patccat: A classifier for patent claims [Dataset]. http://doi.org/10.5281/zenodo.6395308
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bernhard Ganglmair; Bernhard Ganglmair; W. Keith Robinson; W. Keith Robinson; Michael Seeligson; Michael Seeligson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data version: 3.3.0

    Authors:
    Bernhard Ganglmair (University of Mannheim, Department of Economics, and ZEW Mannheim)
    W. Keith Robinson (Wake Forest University, School of Law)
    Michael Seeligson (Southern Methodist University, Cox School of Business)


    1. Notes on Data Construction
    2. Citation and Code
    3. Description of the Data Files
    3.1. File List
    3.2. List of Variables for Files with Claim-Level Information
    3.3. List of Variables for Files with Patent-Level Information
    4. Coming Soon!


    1. Notes on Data Construction

    This is version 3.3.0 of the patccat data (patent claim classification by algorithmic text analysis).

    Patent claims define an invention. A patent application is required to have one or more claims that distinctly claim the subject matter which the patent applicant regards as her invention or discovery. We construct a classifier of patent claims that identifies three distinct claim types: process claims, product claims, and product-by-process claims.

    For this classification, we combine information obtained from both the preamble and the body of a claim. The preamble is a general description of the invention (e.g., a method, an apparatus, or a device), whereas the body identifies steps and elements (specifying in detail the invention laid out in the preamble) that the applicant is claiming as the invention. The combination of the preamble type and the body type provides us with a more detailed and more accurate classification of claims than other approaches in the literature. This approach also accounts for unconventional drafting approaches. We eventually validate our classification using close to 10,000 manually classified claims.

    The data files contain the results of our classification. We provide claim-level information for each independent claim of U.S. utility patents granted between 1836 and 2020. We also provide patent-level information, i.e., the counts of different claim types for a given patent.

    For a detailed description of our classification approach, please take a look at the accompanying paper (Ganglmair, Robinson, and Seeligson 2022).

    2. Citation

    Please cite the following paper when using the data in your own work:

    Ganglmair, Bernhard, W. Keith Robinson, and Michael Seeligson (2022): "The Rise of Process Claims: Evidence from a Century of U.S. Patents," unpublished manuscript available at https://papers.ssrn.com/abstract=4069994.

    In the paper, we document the use of process claims in the U.S. over the last century, using the patccat data. We show an increase in the annual share of process claims of about 25 percentage points (from below 10% in 1920). This rise in process intensity of patents is not limited to a few patent classes, but we observe it across a broad spectrum of technologies. Process intensity varies by applicant type: companies file more process-intense patents than individuals, and U.S. applicants file more process-intense patents than foreign applicants. We further show that patents with higher process intensity are more valuable but are not necessarily cited more often. Last, process claims are on average shorter than product claims (with the gap narrowing since the 1970s).

    We would love to see how other researchers use the data and eventually learn from it. If you have a discussion paper or a publication in which you use the data, please send us a copy at patccat.data@gmail.com.

    We will the R code used to construct the data on Github with the next data version (version 3.4.0). Contact us at b.ganglmair@gmail.com if you would like to take a look at an earlier version of the code.


    3. Description of the Data Files

    The data files contain claim-level information for independent claims of 10,140,848 U.S. utility patents granted between 1836 and 2020. The files further contain patent-level information for U.S. utility patents.

    3.1. File List

    File list
    claims-patccat-v3-3-sample.csvclaim-level information for independent claims of a sample of 1000 patents issued between 1976 and 2020
    claims-patccat-v3-3-1836-1919.csvclaim-level information for independent claims of 1,038,041 patents issued between 1836 and 1919
    claims-patccat-v3-3-1920-2020.csvclaim-level information for independent claims of 9,102,807 patents issued between 1920 and 2020
    patents-patccat-v3-3-sample.csvpatent-level information for a sample of 1000 patents issued between 1976 and 2020
    patents-patccat-v3-3-1836-1919.csvpatent-level information for 1,038,041 patents issued between 1836 and 1919
    patents-patccat-v3-3-1920-2020.csvpatent-level information for 9,102,807 patents issued between 1920 and 2020


    3.2. List of Variables for Files with Claim-Level Information

    For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).

    List of Variables (Claim-Level Information)
    PatentClaimpatent claim identifier; 8-digit patent number and 4-digit claim number (Ex: 01234567-0001)
    singleLine=1 if claim is published in single-line format
    singleReformatoutcome code of reformating of single-line claims
    Jepson=1 if claim is a Jepson claim
    JepsonReformatoutcome code of reformating of Jepson claims
    inBegin=1 if claim begins with the word "in"
    wordsPreamblenumber of words in the claim preamble
    wordsBodynumber of words in the claim body
    dependentClaimsnumber of dependent claims that refer to this independent claim
    isMeansPreamble=1 if term "means" is used in the preamble
    isMeansBody=1 if term "means" is used in the body
    isMeans=1 if term "means" is used anywhere in the claim (~ means-plus-function claim)
    processPreamble=1 if terms "method" or "process" are used in the preamble
    processBody=1 if terms "method" or "process" are used in the body
    processSimple=1 if terms "method" or "process" are used anywhere in the claim (for simple approach of process claim classification)
    claimTypeclaim type of full classification (1 = process; 2 = product; 3 = product-by-process; 0 = no type)
    preambleTypepreamble type
    preambleTermkeyword used to classify preamble type
    preambleTermAltalternative keyword (if preambleTerm were not used)
    preambleTextStubfirst 15 words of the preamble
    bodyTypebody type
    bodyLinesStepnumber of steps in the body
    bodyLinesElementnumber of elements in the body
    bodyLinesTotaltotal number of identified lines in the body
    label2-character label of the preamble-body combination; classification table maps label to claim type

    3.3. List of Variables for Files with Patent-Level Information

    For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).

    List of Variables (Patent-Level Information)
    patent_idU.S. patent number (8-digit patent number)
    claimsnumber of independent claims (the sum of the four claim types: 0, 1, 2, and 3)
    noCategorynumber of claims without a classified type
    processClaimsnumber of process claims
    productClaimsnumber of product claims
    prodByProcessClaimsnumber of product-by-process claims
    firstClaimtype of the first claim (1 = process; 2 = product; 3 = product-by-process; 0 = no type)
    simpleProcessClaimsnumber of process claims by simple approach (terms "method" or "process" anywhere in the claim)
    simpleProcessPreamblenumber of process claims by simple approach (terms "method" or "process" in the preamble)
    meansClaimsnumber of means-plus-function claims
    meansFirst=1 if first claim is a means-plus-function claim
    JepsonClaimsnumber of Jepson claims
    JepsonFirst=1 if first claim is a Jepson claim


    Note: The following variables/fields are currently empty (March 30, 2020); we will populate these variables/fields with data version 3.4.0.

    preambleTerm
    preambleTermAlt
    preambleTextStub
    bodyLinesStep
    bodyLinesElement
    bodyLinesTotal

    Note: We will release the data for patents issued in 2021 with data version 3.4.0.


    4. Coming Soon!

    We are working on a number of extensions of the patccat data.

    - With data version 3.4.0, we plan to release data for all published U.S. patent applications (2001 through 2021)
    - In late spring/early summer 2022, we will release data for patents issued by the European Patent Office (EPO) [Update: March 28, 2023: see https://doi.org/10.5281/zenodo.7776092]
    -

  19. d

    ISTARI.AI | Points of Interest Dataset (POI) | Mining Industry in Australia...

    • datarade.ai
    Updated Aug 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Istari.AI (2025). ISTARI.AI | Points of Interest Dataset (POI) | Mining Industry in Australia | Verified Company Profiles filtered from over 1.7 M companies [Dataset]. https://datarade.ai/data-products/istari-ai-points-of-interest-dataset-poi-mining-industr-istari-ai
    Explore at:
    .json, .csv, .xls, .txt, .parquet, .pdfAvailable download formats
    Dataset updated
    Aug 14, 2025
    Dataset provided by
    Istari.AI
    Area covered
    Australia
    Description

    📍 Looking for high-quality mining industry data? ISTARI.AI offers tailored POI datasets to fit your exact business needs – whether you’re looking for all mining operations, equipment manufacturers, Consultants, sub suppliers, service providers, or other specific type of location-based business.

    📊 Our POI data includes: - Organizational structure & key personnel - Products, services & partnerships - Verified contact & domain info - Tech stack & business descriptions - Detailed geographic data (address, region, country)

    We don’t offer one-size-fits-all datasets – instead, you tell us what you need.

    This flexibility makes our data ideal for use cases in: - Location-based services & apps - Market analysis & competitive intelligence - Retail expansion & site planning - Ad targeting & geofencing - Lead generation & B2B outreach

    All POI data is machine-generated, frequently updated, and sourced from publicly available web data, ensuring high freshness and consistency. With ISTARI.AI, you receive structured POI datasets ready for direct integration into your systems.

    ✅ Ensuring Data Quality - The webAI AI Agent was developed in close collaboration with academic experts to guarantee expert-level accuracy. - Developed together with researchers at the University of Mannheim - Validated in the award-winning academic study: "When is AI Adoption Contagious? Epidemic Effects and Relational Embeddedness in the Inter-Firm Diffusion of Artificial Intelligence" - Co-authored by scholars from University of Mannheim, University of Giessen, University of Hohenheim, and ETH Zurich

  20. o

    Wealth of two nations: The U.S. racial wealth gap, 1860-2020

    • openicpsr.org
    Updated May 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellora Derenoncourt; Chi Hyun Kim; Moritz Kuhn; Moritz Schularick (2022). Wealth of two nations: The U.S. racial wealth gap, 1860-2020 [Dataset]. http://doi.org/10.3886/E170941V2
    Explore at:
    Dataset updated
    May 22, 2022
    Dataset provided by
    University of Mannheim
    Princeton University
    Kiel Institute for the World Economy, Sciences Po
    University of Bonn
    Authors
    Ellora Derenoncourt; Chi Hyun Kim; Moritz Kuhn; Moritz Schularick
    Area covered
    United States
    Description

    PSID data extract for computing per capita white-to-Black wealth gaps and active saving rates of Black and white Americans during 1984-2019.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anna Primpeli; Christian Bizer (2020). Web Data Commons Phones Dataset, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127243V1
Organization logo

Web Data Commons Phones Dataset, Augmented Version, Fixed Splits

Explore at:
delimitedAvailable download formats
Dataset updated
Nov 23, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Anna Primpeli; Christian Bizer
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html

Search
Clear search
Close search
Google apps
Main menu