2 datasets found

Product data mining: entity classification&linking
kaggle.com
zip
Updated Jul 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zzhang (2020). Product data mining: entity classification&linking [Dataset]. https://www.kaggle.com/ziqizhang/product-data-miningentity-classificationlinking
Explore at:
zip(10933 bytes)Available download formats
Dataset updated
Jul 13, 2020
Authors
zzhang
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
IMPORTANT: Round 1 results are now released, check our website for the leaderboard. We now open Round 2 submissions!

1. Overview

We release two datasets that are part of the the Semantic Web Challenge on Mining the Web of HTML-embedded Product Data is co-located with the 19th International Semantic Web Conference (https://iswc2020.semanticweb.org/, 2-6 Nov 2020 at Athens, Greece). The datasets belong to two shared tasks related to product data mining on the Web: (1) product matching (linking) and (2) product classification. This event is organised by The University of Sheffield, The University of Mannheim and Amazon, and is open to anyone. Systems successfully beating the baseline of the respective task, will be invited to write a paper describing their method and system and present the method as a poster (and potentially also a short talk) at the ISWC2020 conference. Winners of each task will be awarded 500 euro as prize (partly sponsored by Peak Indicators, https://www.peakindicators.com/).

2. Task and dataset brief

The challenge organises two tasks, product matching and product categorisation.

i) Product Matching deals with identifying product offers on different websites that refer to the same real-world product (e.g., the same iPhone X model offered using different names/offer titles as well as different descriptions on various websites). A multi-million product offer corpus (16M) containing product offer clusters is released for the generation of training data. A validation set containing 1.1K offer pairs and a test set of 600 offer pairs will also be released. The goal of this task is to classify if the offer pairs in these datasets are match (i.e., referring to the same product) or non-match.

ii) Product classification deals with assigning predefined product category labels (which can be multiple levels) to product instances (e.g., iPhone X is a ‘SmartPhone’, and also ‘Electronics’). A training dataset containing 10K product offers, a validation set of 3K product offers and a test set of 3K product offers will be released. Each dataset contains product offers with their metadata (e.g., name, description, URL) and three classification labels each corresponding to a level in the GS1 Global Product Classification taxonomy. The goal is to classify these product offers into the pre-defined category labels.

All datasets are built based on structured data that was extracted from the Common Crawl (https://commoncrawl.org/) by the Web Data Commons project (http://webdatacommons.org/). Datasets can be found at: https://ir-ischool-uos.github.io/mwpd/

3. Resources and tools

The challenge will also release utility code (in Python) for processing the above datasets and scoring the system outputs. In addition, the following language resources for product-related data mining tasks: A text corpus of 150 million product offer descriptions Word embeddings trained on the above corpus

4. Challenge website

For details of the challenge please visit https://ir-ischool-uos.github.io/mwpd/

5. Organizing committee

Dr Ziqi Zhang (Information School, The University of Sheffield) Prof. Christian Bizer (Institute of Computer Science and Business Informatics, The Mannheim University) Dr Haiping Lu (Department of Computer Science, The University of Sheffield) Dr Jun Ma (Amazon Inc. Seattle, US) Prof. Paul Clough (Information School, The University of Sheffield & Peak Indicators) Ms Anna Primpeli (Institute of Computer Science and Business Informatics, The Mannheim University) Mr Ralph Peeters (Institute of Computer Science and Business Informatics, The Mannheim University) Mr. Abdulkareem Alqusair (Information School, The University of Sheffield)

6. Contact

To contact the organising committee please use the Google discussion group https://groups.google.com/forum/#!forum/mwpd2020
Z
Data from: A Socio-technical Perspective on Software Vulnerabilities: A...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Paradis; Rick Kazman; Mike Konrad; Robert Stoddard (2023). Data from: A Socio-technical Perspective on Software Vulnerabilities: A Causal Analysis [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7785207
Explore at:
Dataset updated
Mar 31, 2023
Authors
Carlos Paradis; Rick Kazman; Mike Konrad; Robert Stoddard
Description
This data package contains supplemental material data for the under review TSE submission: A Socio-technical Perspective on Software Vulnerabilities: A Causal Analysis. The restricted access requirement will be lifted upon approval of the manuscript.

The comprehensive explanation of this dataset can be found at: https://sailuh.github.io/causal_commit_flow_docs

The following briefly describes the contents of the folders. The analysis presented in the manuscript requires the following:

Git Log

Mailing List

Software Vulnerabilities (NVD Feed)

This data is provided to a mining software repository tool, Kaiaulu. The data specifications and configuration parameters are defined in the OpenSSL project configuration file (.yml), also included in this package.

An R notebook in Kaiaulu, taking the dataset above + project configuration file, can then perform the first analysis step:

https://github.com/sailuh/kaiaulu/blob/master/vignettes/issue_social_smell_showcase.Rmd

The file 1_openssl_social_smells_timeline.csv is generated as an output of this R Notebook, and included in the causal_model folder of this package. The following files in this folder numbered 2 through 16, describe transformation steps using Excel, Python scripts, and Tetrad (also an open source tool). These are described conceptually in the manuscript, but in more detail in the comprehensive explanation of this dataset linked at the start.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

zzhang (2020). Product data mining: entity classification&linking [Dataset]. https://www.kaggle.com/ziqizhang/product-data-miningentity-classificationlinking

Product data mining: entity classification&linking

From the Semantic Web Challenge at ISWC2020 conference

Explore at:

zip(10933 bytes)Available download formats

Dataset updated

Jul 13, 2020

Authors

zzhang

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

IMPORTANT: Round 1 results are now released, check our website for the leaderboard. We now open Round 2 submissions!

1. Overview

We release two datasets that are part of the the Semantic Web Challenge on Mining the Web of HTML-embedded Product Data is co-located with the 19th International Semantic Web Conference (https://iswc2020.semanticweb.org/, 2-6 Nov 2020 at Athens, Greece). The datasets belong to two shared tasks related to product data mining on the Web: (1) product matching (linking) and (2) product classification. This event is organised by The University of Sheffield, The University of Mannheim and Amazon, and is open to anyone. Systems successfully beating the baseline of the respective task, will be invited to write a paper describing their method and system and present the method as a poster (and potentially also a short talk) at the ISWC2020 conference. Winners of each task will be awarded 500 euro as prize (partly sponsored by Peak Indicators, https://www.peakindicators.com/).

2. Task and dataset brief

The challenge organises two tasks, product matching and product categorisation.

i) Product Matching deals with identifying product offers on different websites that refer to the same real-world product (e.g., the same iPhone X model offered using different names/offer titles as well as different descriptions on various websites). A multi-million product offer corpus (16M) containing product offer clusters is released for the generation of training data. A validation set containing 1.1K offer pairs and a test set of 600 offer pairs will also be released. The goal of this task is to classify if the offer pairs in these datasets are match (i.e., referring to the same product) or non-match.

ii) Product classification deals with assigning predefined product category labels (which can be multiple levels) to product instances (e.g., iPhone X is a ‘SmartPhone’, and also ‘Electronics’). A training dataset containing 10K product offers, a validation set of 3K product offers and a test set of 3K product offers will be released. Each dataset contains product offers with their metadata (e.g., name, description, URL) and three classification labels each corresponding to a level in the GS1 Global Product Classification taxonomy. The goal is to classify these product offers into the pre-defined category labels.

All datasets are built based on structured data that was extracted from the Common Crawl (https://commoncrawl.org/) by the Web Data Commons project (http://webdatacommons.org/). Datasets can be found at: https://ir-ischool-uos.github.io/mwpd/

3. Resources and tools

The challenge will also release utility code (in Python) for processing the above datasets and scoring the system outputs. In addition, the following language resources for product-related data mining tasks: A text corpus of 150 million product offer descriptions Word embeddings trained on the above corpus

4. Challenge website

For details of the challenge please visit https://ir-ischool-uos.github.io/mwpd/

5. Organizing committee

Dr Ziqi Zhang (Information School, The University of Sheffield) Prof. Christian Bizer (Institute of Computer Science and Business Informatics, The Mannheim University) Dr Haiping Lu (Department of Computer Science, The University of Sheffield) Dr Jun Ma (Amazon Inc. Seattle, US) Prof. Paul Clough (Information School, The University of Sheffield & Peak Indicators) Ms Anna Primpeli (Institute of Computer Science and Business Informatics, The Mannheim University) Mr Ralph Peeters (Institute of Computer Science and Business Informatics, The Mannheim University) Mr. Abdulkareem Alqusair (Information School, The University of Sheffield)

6. Contact

To contact the organising committee please use the Google discussion group https://groups.google.com/forum/#!forum/mwpd2020

Clear search

Close search

Google apps

Main menu

Product data mining: entity classification&linking

IMPORTANT: Round 1 results are now released, check our website for the leaderboard. We now open Round 2 submissions!

1. Overview

2. Task and dataset brief

3. Resources and tools

4. Challenge website

5. Organizing committee

6. Contact

Data from: A Socio-technical Perspective on Software Vulnerabilities: A...

Product data mining: entity classification&linking

From the Semantic Web Challenge at ISWC2020 conference

IMPORTANT: Round 1 results are now released, check our website for the leaderboard. We now open Round 2 submissions!

1. Overview

2. Task and dataset brief

3. Resources and tools

4. Challenge website

5. Organizing committee

6. Contact