31 datasets found

u
Public benchmark dataset for Conformance Checking in Process Mining
figshare.unimelb.edu.au
melbourne.figshare.com
xml
Updated Jan 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Reissner (2022). Public benchmark dataset for Conformance Checking in Process Mining [Dataset]. http://doi.org/10.26188/5cd91d0d3adaa
Explore at:
xmlAvailable download formats
Unique identifier
https://doi.org/10.26188/5cd91d0d3adaa
Dataset updated
Jan 30, 2022
Dataset provided by
The University of Melbourne
Authors
Daniel Reissner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a variety of publicly available real-life event logs. We derived two types of Petri nets for each event log with two state-of-the-art process miners : Inductive Miner (IM) and Split Miner (SM). Each event log-Petri net pair is intended for evaluating the scalability of existing conformance checking techniques.We used this data-set to evaluate the scalability of the S-Component approach for measuring fitness. The dataset contains tables of descriptive statistics of both process models and event logs. In addition, this dataset includes the results in terms of time performance measured in milliseconds for several approaches for both multi-threaded and single-threaded executions. Last, the dataset contains a cost-comparison of different approaches and reports on the degree of over-approximation of the S-Components approach. The description of the compared conformance checking techniques can be found here: https://arxiv.org/abs/1910.09767. Update:The dataset has been extended with the event logs of the BPIC18 and BPIC19 logs. BPIC19 is actually a collection of four different processes and thus was split into four event logs. For each of the additional five event logs, again, two process models have been mined with inductive and split miner. We used the extended dataset to test the scalability of our tandem repeats approach for measuring fitness. The dataset now contains updated tables of log and model statistics as well as tables of the conducted experiments measuring execution time and raw fitness cost of various fitness approaches. The description of the compared conformance checking techniques can be found here: https://arxiv.org/abs/2004.01781.Update: The dataset has also been used to measure the scalability of a new Generalization measure based on concurrent and repetitive patterns. : A concurrency oracle is used in tandem with partial orders to identify concurrent patterns in the log that are tested against parallel blocks in the process model. Tandem repeats are used with various trace reduction and extensions to define repetitive patterns in the log that are tested against loops in the process model. Each pattern is assigned a partial fulfillment. The generalization is then the average of pattern fulfillments weighted by the trace counts for which the patterns have been observed. The dataset no includes the time results and a breakdown of Generalization values for the dataset.
Datasets obtained from the Brazilian Federal Government's Open Data Portal -...
figshare.com
zip
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gyslla de Vasconcelos; Flavia Bernardini; Jose Viterbo (2024). Datasets obtained from the Brazilian Federal Government's Open Data Portal - dados.gov for application in process mining tools [Dataset]. http://doi.org/10.6084/m9.figshare.25514884.v5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25514884.v5
Dataset updated
Sep 20, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Gyslla de Vasconcelos; Flavia Bernardini; Jose Viterbo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a study to assess the application of process mining techniques on data from the Brazilian public services, made available on open data portals, aiming to identify bottlenecks and improvement opportunities in government processes. The datasets were obtained from the Brazilian Federal Government's Open Data Portal: dados.govCategorization:(1) event log(2) there is a complete date(3) list of data or information table(4) documents(5) no file founded(6) link to another portalLink of brasilian portal: https://dados.gov.br/homeList of content made available:open-data-sample.zip: all the files obtained from the representative sample of the studyopen-data-sample.xls: table categorizing the datasets obtained and classifying them as relevant for testing in the process mining toolsdataset137.csv: dataset with undergraduate degree records tested in the Disco, Celonis and ProM toolsdataset258.csv: dataset with software registration requests tested in the Disco, Celonis and ProM toolsdataset356.csv: dataset with public tender inspector registrations tested in the Disco, Celonis and ProM tools
m
Criteria for evaluating and qualifying public datasets obtained from the...
data.mendeley.com
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gyslla Vasconcelos (2025). Criteria for evaluating and qualifying public datasets obtained from the Brazilian Federal Government's Open Data Portal - dados.gov [Dataset]. http://doi.org/10.17632/x8sgcykthn.2
Explore at:
Unique identifier
https://doi.org/10.17632/x8sgcykthn.2
Dataset updated
May 19, 2025
Authors
Gyslla Vasconcelos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These criteria (file 1) were drawn up empirically, based on the practical challenges faced during the development of the thesis research, based on tests carried out with various datasets applied to process mining tools. These criteria were elaborated empirically, based on the practical challenges faced during the development of the thesis research, based on tests conducted with various datasets applied to process mining tools. These criteria were prepared with the aim of creating a ranking of the datasets selected and published (https://doi.org/10.6084/m9.figshare.25514884.v3), in order to classify them according to their score. The criteria are divided into informative (In), importance (I), difficulty (D) and ease (F) of handling (file 2). The datasets were selected (file 3) and, for ranking, calculations were made (file 5) to normalize the values for standardization (file 4). This data is part of a study on the application of process mining techniques to Brazilian public service data, available on the open data portal dados.gov.
Z
Annotated UI Element Dataset for Desktop Environments
data.niaid.nih.gov
Updated Sep 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
González Enríquez, José (2024). Annotated UI Element Dataset for Desktop Environments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10822751
Explore at:
Dataset updated
Sep 9, 2024
Dataset provided by
Jiménez-Ramírez, Andrés
González Enríquez, José
Martínez-Rojas, Antonio
Rodríguez-Ruíz, Antonio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introducing a specialized dataset containing high-resolution screenshots from various desktop environments, focusing on annotating individual UI components. This dataset is designed to enhance the accuracy of UI element identification and classification within desktop applications, enabling the extraction of hierarchical structures.

Desktop UI Detection Dataset.zip

This resource contains a set of 100 general-purpose screenshots intended for the training set.

Test Desktop UI Detection Dataset.zip

This resource is aimed at evaluating the models trained with the previous screenshots, using a set of captures from a specific business process.

The images have been organized into six groups (G), each representing a unique set of screenshots from the same type of application. These groups are distinguished by their levels of complexity and the depth of their UI hierarchies:

G1: PDF Reader. A combination of web and native applications for interacting with PDF documents.

G2: Public Administration Courses Manager. A web-based application used in the student admission process for public learning programs (these images are not provided due to privacy issues concerning the business process).

G3: Customer Relationship Management (CRM) System. A web-based application for managing business data.

G4: Email Client. A mix of web and native applications used for managing email communications.

G5: File Explorer. A native application for navigating the file system in Windows.

G6: Learning Management System (LMS). A web-based application used to manage courses and students for a given educational institution.

Two folders are provided, each containing all the groups mentioned. These folders represent captures with the application from the corresponding group in fullscreen, or captures with the application from the corresponding group overlapping another application randomly selected from one of the remaining groups (Overlapped).
m
Helpdesk
data.mendeley.com
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilya Verenich (2016). Helpdesk [Dataset]. http://doi.org/10.17632/39bp3vv62t.1
Explore at:
Unique identifier
https://doi.org/10.17632/39bp3vv62t.1
Dataset updated
Dec 1, 2016
Authors
Ilya Verenich
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains events from a ticketing management process of the help desk of an Italian software company. The process consists of 9 activities, and all cases start with the insertion of a new ticket into the ticketing management system. Each case ends when the issue is resolved and the ticket is closed. This log contains 3804 process instances (a.k.a "cases") and 13710 events
Dataset Public Opinion of UAE and Sentiment Analysis Process
zenodo.org
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tia Mariatul Kibtiah; Tia Mariatul Kibtiah (2024). Dataset Public Opinion of UAE and Sentiment Analysis Process [Dataset]. http://doi.org/10.5281/zenodo.13918113
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13918113
Dataset updated
Oct 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tia Mariatul Kibtiah; Tia Mariatul Kibtiah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 11, 2024
Area covered
United Arab Emirates
Description
This article explains the UAE public opinion towards Indonesia in the era of President Jokowi's administration. In the digital era, in IR studies, it must be understood that public opinion influences a country's foreign policy, including in the economic field. The research team crawled data from Twitter (X), a data set (raw data), to find out the UAE's public opinion. The raw data was then processed using SVM machine learning to find out the UAE public's positive, negative, and neutral levels regarding Indonesia. After that, the results of public opinion were compared with the UAE's and Indonesia's economic cooperation to determine whether there was a relationship between public opinion and the level of economic cooperation between the two countries. So, the data in this study are the UAE public opinion data collection, which involved tracking and analyzing public sentiment in various UAE-based media outlets and sentiment analysis processes.
d
Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...
catalog.data.gov
datadiscoverystudio.org
+5more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms [Dataset]. https://catalog.data.gov/dataset/discovering-anomalous-aviation-safety-events-using-scalable-data-mining-algorithms
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
The worldwide civilian aviation system is one of the most complex dynamical systems created. Most modern commercial aircraft have onboard flight data recorders that record several hundred discrete and continuous parameters at approximately 1Hz for the entire duration of the flight. These data contain information about the flight control systems, actuators, engines, landing gear, avionics, and pilot commands. In this paper, recent advances in the development of a novel knowledge discovery process consisting of a suite of data mining techniques for identifying precursors to aviation safety incidents are discussed. The data mining techniques include scalable multiple-kernel learning for large-scale distributed anomaly detection. A novel multivariate time-series search algorithm is used to search for signatures of discovered anomalies on massive datasets. The process can identify operationally significant events due to environmental, mechanical, and human factors issues in the high-dimensional flight operations quality assurance data. All discovered anomalies are validated by a team of independent domain experts. This novel automated knowledge discovery process is aimed at complementing the state-of-the-art human-generated exceedance-based analysis that fails to discover previously unknown aviation safety incidents. In this paper, the discovery pipeline, the methods used, and some of the significant anomalies detected on real-world commercial aviation data are discussed.
Z
Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...
data.niaid.nih.gov
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thakur, Ph.D., Nirmalya (2024). Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13896352
Explore at:
Dataset updated
Oct 21, 2024
Dataset authored and provided by
Thakur, Ph.D., Nirmalya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset:

N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)

Abstract

The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.

For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.

The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)

There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)

The following is a description of the attributes present in this dataset

Post ID: Unique ID of each Instagram post

Post Description: Complete description of each post in the language in which it was originally published

Date: Date of publication in MM/DD/YYYY format

Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API

Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API

Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral

Open Research Questions

This dataset is expected to be helpful for the investigation of the following research questions and even beyond:

How does sentiment toward COVID-19 vary across different languages?

How has public sentiment toward COVID-19 evolved from 2020 to the present?

How do cultural differences affect social media discourse about COVID-19 across various languages?

How has COVID-19 impacted mental health, as reflected in social media posts across different languages?

How effective were public health campaigns in shifting public sentiment in different languages?

What patterns of vaccine hesitancy or support are present in different languages?

How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?

What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?

How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?

What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?

All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
m
OpenScience Slovenia document metadata dataset
data.mendeley.com
narcis.nl
Updated Nov 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mladen Borovič (2019). OpenScience Slovenia document metadata dataset [Dataset]. http://doi.org/10.17632/7wh9xvvmgk.1
Explore at:
Unique identifier
https://doi.org/10.17632/7wh9xvvmgk.1
Dataset updated
Nov 5, 2019
Authors
Mladen Borovič
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Area covered
Slovenia
Description
The OpenScience Slovenia metadata dataset contains metadata entries for Slovenian public domain academic documents which include undergraduate and postgraduate theses, research and professional articles, along with other academic document types. The data within the dataset was collected as a part of the establishment of the Slovenian Open-Access Infrastructure which defined a unified document collection process and cataloguing for universities in Slovenia within the infrastructure repositories. The data was collected from several already established but separate library systems in Slovenia and merged into a single metadata scheme using metadata deduplication and merging techniques. It consists of text and numerical fields, representing attributes that describe documents. These attributes include document titles, keywords, abstracts, typologies, authors, issue years and other identifiers such as URL and UDC. The potential of this dataset lies especially in text mining and text classification tasks and can also be used in development or benchmarking of content-based recommender systems on real-world data.
A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...
zenodo.org
data.niaid.nih.gov
+2more
csv
Updated Jul 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian (2024). A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and other sources about the 2024 outbreak of Measles [Dataset]. http://doi.org/10.5281/zenodo.11711230
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11711230
Dataset updated
Jul 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 15, 2024
Area covered
YouTube
Description
Please cite the following paper when using this dataset:

N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A. Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” Proceedings of the 26th International Conference on Human-Computer Interaction (HCII 2024), Washington, USA, 29 June - 4 July 2024. (Accepted as a Late Breaking Paper, Preprint Available at: https://doi.org/10.48550/arXiv.2406.07693)

Abstract

This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.
COVID-19 Sentiment: 500K Instagram Posts (2020-24)
kaggle.com
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur, PhD (2024). COVID-19 Sentiment: 500K Instagram Posts (2020-24) [Dataset]. http://doi.org/10.34740/kaggle/dsv/9687126
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/9687126
Dataset updated
Oct 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nirmalya Thakur, PhD
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Please cite the following paper when using this dataset:

N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)

Abstract

The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.

For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.

The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)

There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)

The following is a description of the attributes present in this dataset - Post ID: Unique ID of each Instagram post - Post Description: Complete description of each post in the language in which it was originally published - Date: Date of publication in MM/DD/YYYY format - Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API - Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API - Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral

Open Research Questions

This dataset is expected to be helpful for the investigation of the following research questions and even beyond:

How does sentiment toward COVID-19 vary across different languages?

How has public sentiment toward COVID-19 evolved from 2020 to the present?

How do cultural differences affect social media discourse about COVID-19 across various languages?

How has COVID-19 impacted mental health, as reflected in social media posts across different languages?

How effective were public health campaigns in shifting public sentiment in different languages?

What patterns of vaccine hesitancy or support are present in different languages?

How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?

What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?

How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?

What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?

All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
f
Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...
figshare.com
xlsx
Updated Oct 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.27072247.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27072247.v1
Dataset updated
Oct 12, 2024
Dataset provided by
figshare
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite this paper when using this dataset: N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292Abstract: The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages.For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post intoone of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutralhate or not hateanxiety/stress detected or no anxiety/stress detected.These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.The following is a description of the attributes present in this dataset:Post ID: Unique ID of each Instagram postPost Description: Complete description of each post in the language in which it was originally publishedDate: Date of publication in MM/DD/YYYY formatLanguage: Language of the post as detected using the Google Translate APITranslated Post Description: Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.Sentiment: Results of sentiment analysis (using the preprocessed version of the translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutralHate: Results of hate speech detection (using the preprocessed version of the translated Post Description) where each post was classified as hate or not hateAnxiety or Stress: Results of anxiety or stress detection (using the preprocessed version of the translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
SyROCCo dataset
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zheng Fang; Miguel Arana-Catania; Miguel Arana-Catania; Felix-Anselm van Lier; Juliana Outes Velarde; Harry Bregazzi; Harry Bregazzi; Mara Airoldi; Mara Airoldi; Eleanor Carter; Eleanor Carter; Rob Procter; Rob Procter; Zheng Fang; Felix-Anselm van Lier; Juliana Outes Velarde (2024). SyROCCo dataset [Dataset]. http://doi.org/10.5281/zenodo.12204304
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12204304
Dataset updated
Jun 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zheng Fang; Miguel Arana-Catania; Miguel Arana-Catania; Felix-Anselm van Lier; Juliana Outes Velarde; Harry Bregazzi; Harry Bregazzi; Mara Airoldi; Mara Airoldi; Eleanor Carter; Eleanor Carter; Rob Procter; Rob Procter; Zheng Fang; Felix-Anselm van Lier; Juliana Outes Velarde
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The peer-reviewed publication for this dataset has been published in Data & Policy, and can be accessed here: https://arxiv.org/abs/2406.16527 Please cite this when using the dataset.

This dataset has been produced as a result of the “Systematic Review of Outcomes Contracts using Machine Learning” (SyROCCo) project. The goal of the project was to apply machine learning techniques to a systematic review process of outcomes-based contracting (OBC). The purpose of the systematic review was to gather and curate, for the first time, all of the existing evidence on OBC. We aimed to map the current state of the evidence, synthesise key findings from across the published studies, and provide accessible insights to our policymaker and practitioner audiences.

OBC is a model for the provision of public services wherein a service provider receives payment, in-part or in-full, only upon the achievement of pre-agreed outcomes.

The data used to conduct the review consists of 1,952 individual studies of OBC. They include peer reviewed journal articles, book chapters, doctoral dissertations, and assorted ‘grey literature’ - that is, reports and evaluations produced outside of traditional academic publications. Those studies were manually filtered by experts on the topic from an initial search of over 11,000 results.

The full text of the articles was obtained from their PDF versions and preprocessed. This involved text format normalisation, removing acknowledgements and bibliographic references.

The corpus was then connected to the INDIGO Impact Bond Dataset. Projects and organisations mentioned in this latter dataset were searched for in the article’s corpus to relate both datasets.

Other types of information that were identified in the texts were 1) financial mechanisms (type of outcomes-based instrument); using a list of terms related to those financial mechanisms based on prior discussions with a policy advisory group (Picker et al., 2021); 2) references to the 17 Sustainable Development Goals (SDGs) defined by the United Nations General Assembly in the 2030 Agenda; 3) country names mentioned in each article and income levels related to the countries; according to the World Classification of Income Levels 2022 by the World Bank.

Three machine learning techniques were applied to the corpus:

Policy areas identification. A query-driven topic model (QDTM) (Fang et al., 2021) was used to determine the probability of an article belonging to different policy areas (health, education, homelessness, criminal justice, employment and training, child and family welfare, and agriculture and environment), using all text of the article as input. The QDTM is a semi-supervised machine learning algorithm that allows users to specify their prior knowledge in the form of simple queries in words or phrases and return query-related topics.

Named Entity Recognition. Three named entity recognition models were applied: “en_core_web_lg” and “en_core_web_trf” models from the python package ‘spaCy’ and the “ner-ontonotes-large” English model from ‘Flair’. “en_core_web_trf” is based on the RoBERTa-base transformer model. ‘Flair’ is a bi-LSTM character-based model. All models were trained on the “OntoNotes 5” data source (Marcus et al., 2011) and are able to identify geographical locations, organisation names, and laws and regulations. An ensemble method was adopted, considering the entities that appear simultaneously in the results of any two models as the correct entities.

Semantic text similarity. We calculated the similarity score between articles. The 10,000 most frequently mentioned words were first extracted from all the articles’ titles and abstracts and the text vectorization technique TF*IDF was applied to convert each article’s abstract into an importance score vector based on these words. Using these numerical vectors, the cosine similarity between different articles was calculated.

The SyROCCo Dataset includes references to the 1952 studies of OBCs mentioned above and the results of the previous processing steps and techniques. Each entry of the dataset contains the following information.

The basic information of each document is its title, abstract, authors, published years, DOI and Article ID:

Title: Title of the document.

Abstract: Text of the abstract.

Authors: Authors of a study.

Published Years: Published Years of a study.

DOI: DOI link of a study.

Article ID: ID of the document selected during the screening process.

The probability of a study belonging to each policy area:

policy_sector_health: The probability of a study belongs to the policy sector “health”.

policy_sector_education: The probability of a study belongs to the policy sector “education”.

policy_sector_homelessness: The probability of a study belongs to the policy sector “homelessness”.

policy_sector_criminal: The probability of a study belongs to the policy sector “criminal”

policy_sector_employment: The probability of a study belongs to the policy sector “employment”

policy_sector_child: The probability of a study belongs to the policy sector “child”.

policy_sector_environment: The probability of a study belongs to the policy sector “environment”.

Other types of information such as financial mechanisms, Sustainable Development Goals, and different types of named entities:

financial_mechanisms: Financial mechanisms mentioned in a study.

top_financial_mechanisms: The financial mechanisms mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

top_sgds: Sustainable Development Goals mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

top_countries: Country names mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions. This entry is also used to determine the income level of the mentioned counties.

top_Project: Indigo projects mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

top_GPE: Geographical locations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

top_LAW: Relevant laws and regulations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

top_ORG: Organisations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.
m
SPAMID-PAIR
data.mendeley.com
Updated Sep 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonius Rachmat C (2022). SPAMID-PAIR [Dataset]. http://doi.org/10.17632/fj5pbdf95t.1
Explore at:
Unique identifier
https://doi.org/10.17632/fj5pbdf95t.1
Dataset updated
Sep 23, 2022
Authors
Antonius Rachmat C
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data post-comment pairs were collected from 13 selected Indonesian public figures (artists) / public accounts with more than 15 million followers and categorized as famous artists. It was collected from Instagram using an online tool and Selenium. Two persons labeled all pair data as an expert in a total of 72874 data. The data contains Unicode text (UTF-8) and emojis scrapped in posts and comments without account profile information.

It contains several fields: -igid: Account ID, -comment: Comment of a post, -post: Post from an ID, -emoji: Whether the data contains emojis or not (1 or 0), -spam: Whether the data is spam or not (1 or 0), -lengthcomment: The character length of the comment, -lengthpost: The character length of the post, -countemojicomment: Number of emoji symbol characters in comments, -countemojicommentuniq: Number of emoji symbol characters in comments (unique), -countemojipost: Number of emoji symbol characters in posts, -countemojipostuniq: Number of emoji symbol characters in the post (unique)
Data Mining Applied to Life Cycle Inventory Modeling for Cumene and Sodium...
catalog.data.gov
gimi9.com
Updated Mar 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Data Mining Applied to Life Cycle Inventory Modeling for Cumene and Sodium Hydroxide Manufacturing, Version 1, 09/2018 [Dataset]. https://catalog.data.gov/dataset/data-mining-applied-to-life-cycle-inventory-modeling-for-cumene-and-sodium-hydroxide-ma-09
Explore at:
Dataset updated
Mar 4, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This file contains the life cycle inventories (LCIs) developed for an associated journal article. Potential users of the data are referred to the journal article for a full description of the modeling methodology. LCIs were developed for cumene and sodium hydroxide manufacturing using data mining with metadata-based data preprocessing. The inventory data were collected from US EPA's 2012 Chemical Data Reporting database, 2011 National Emissions Inventory, 2011 Toxics Release Inventory, 2011 Electronic Greenhouse Gas Reporting Tool, 2011 Discharge Monitoring Report, and the 2011 Biennial Report generated from the RCRAinfo hazardous waste tracking system. The U.S. average cumene gate-to-gate inventories are provided without (baseline) and with process allocation applied using metadata-based filtering. In 2011, there were 8 facilities reporting public production volumes of cumene in the U.S., totaling to 2,609,309,687 kilograms of cumene produced that year. The U.S. average sodium hydroxide gate-to-gate inventories are also provided without (baseline) and with process allocation applied using metadata-based filtering. In 2011, there were 24 facilities reporting public production volumes of sodium hydroxide in the U.S., totaling to 3,878,021,614 kilograms of sodium hydroxide produced that year. Process allocation was only conducted for the top 12 facilities producing sodium hydroxide, which represents 97% of the public production of sodium hydroxide. The data have not been compiled in the formal Federal Commons LCI Template to avoid users interpreting the template to mean the data have been fully reviewed according to LCA standards and can be directly applied to all types of assessments and decision needs without additional review by industry and potential stakeholders. This dataset is associated with the following publication: Meyer, D.E., S. Cashman, and A. Gaglione. Improving the reliability of chemical manufacturing life cycle inventory constructed using secondary data. JOURNAL OF INDUSTRIAL ECOLOGY. Berkeley Electronic Press, Berkeley, CA, USA, 25(1): 20-35, (2021).
a
Mining and Industrial Facilities
environment-saskatchewan.hub.arcgis.com
geohub.saskatchewan.ca
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Government of Saskatchewan (2023). Mining and Industrial Facilities [Dataset]. https://environment-saskatchewan.hub.arcgis.com/datasets/mining-and-industrial-facilities/about
Explore at:
Dataset updated
Oct 30, 2023
Dataset authored and provided by
Government of Saskatchewan
License
https://gisappl.saskatchewan.ca/Html5Ext/Resources/GOS_Standard_Unrestricted_Use_Data_Licence_v2.0.pdfhttps://gisappl.saskatchewan.ca/Html5Ext/Resources/GOS_Standard_Unrestricted_Use_Data_Licence_v2.0.pdf
Area covered

Description
The Ministry of Environment manages a dataset of mining and industrial facilities it regulates. This content will help increase awareness and transparency regarding these activities in the province. These include agricultural processing facilities, mining facilities, power generation facilities, oil and gas processing facilities, and industrial waste management facilities.For further information, please contact the Ministry of Environment Inquiry Centre (Toll Free) 1-800-567-4224, centre.inquiry@gov.sk.ca or visit the link on saskatchewan.ca.Locations are approximate and do not capture the entire facilities footprint.Information on this map is provided as a public service by the Government of Saskatchewan. We cannot guarantee that all information is current and accurate. Users should verify the information before acting on it. The Saskatchewan Government does not assume any responsibility for any damages caused by (mis)use of this map.
a
Mining and Industrial Facilities
catalogue.arctic-sdi.org
gimi9.com
+1more
Updated Jun 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Mining and Industrial Facilities [Dataset]. https://catalogue.arctic-sdi.org/geonetwork/srv/search?keyword=Facilities
Explore at:
Dataset updated
Jun 5, 2022
Description
Saskatchewan Mining and Industrial facilities with permits managed by Saskatchewan Ministry of Environment (Environmental Protection Branch). The Ministry of Environment manages a dataset of mining and industrial facilities it regulates. This content will help increase awareness and transparency regarding these activities in the province. These include agricultural processing facilities, mining facilities, power generation facilities, oil and gas processing facilities, and industrial waste management facilities. For further information, please contact the Ministry of Environment Inquiry Centre (Toll Free) 1-800-567-4224, centre.inquiry@gov.sk.ca or visit the link on saskatchewan.ca. Locations are approximate and do not capture the entire facilities footprint. Information on this map is provided as a public service by the Government of Saskatchewan. We cannot guarantee that all information is current and accurate. Users should verify the information before acting on it. The Saskatchewan Government does not assume any responsibility for any damages caused by (mis)use of this map.
u
Mining and Industrial Facilities - Catalogue - Canadian Urban Data Catalogue...
data.urbandatacentre.ca
beta.data.urbandatacentre.ca
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Mining and Industrial Facilities - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-1961c113-b8c5-28c6-1b7d-10970961c2ad
Explore at:
Dataset updated
Oct 1, 2024
Area covered
Canada
Description
Saskatchewan Mining and Industrial facilities with permits managed by Saskatchewan Ministry of Environment (Environmental Protection Branch). The Ministry of Environment manages a dataset of mining and industrial facilities it regulates. This content will help increase awareness and transparency regarding these activities in the province. These include agricultural processing facilities, mining facilities, power generation facilities, oil and gas processing facilities, and industrial waste management facilities. For further information, please contact the Ministry of Environment Inquiry Centre (Toll Free) 1-800-567-4224, centre.inquiry@gov.sk.ca or visit the link on saskatchewan.ca. Locations are approximate and do not capture the entire facilities footprint. Information on this map is provided as a public service by the Government of Saskatchewan. We cannot guarantee that all information is current and accurate. Users should verify the information before acting on it. The Saskatchewan Government does not assume any responsibility for any damages caused by (mis)use of this map.
D
Mt. Weld rare earths project : incorporating mining and beneficiation at Mt....
data.nsw.gov.au
researchdata.edu.au
pdf
Updated Mar 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NSW Department of Planning, Housing and Infrastructure (2024). Mt. Weld rare earths project : incorporating mining and beneficiation at Mt. Weld and secondary processing at Meenaar : public environmental review [Dataset]. https://data.nsw.gov.au/data/dataset/mt-weld-rare-earths-project-incorporating-mining-and-beneficiation-at-mt-weld-and-secondary-pr034bf
Explore at:
pdfAvailable download formats
Dataset updated
Mar 13, 2024
Dataset provided by
Department of Planning, Housing and Infrastructurehttps://www.nsw.gov.au/departments-and-agencies/department-of-planning-housing-and-infrastructure
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Earth
Description
Environmental Impact Statement: Mt. Weld rare earths project : incorporating mining and beneficiation at Mt. Weld and secondary processing at Meenaar : public environmental review
Critical minerals advanced projects, mines and processing facilities in...
open.canada.ca
catalogue.arctic-sdi.org
esri rest, fgdb/gdb +4
Updated Feb 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natural Resources Canada (2025). Critical minerals advanced projects, mines and processing facilities in Canada [Dataset]. https://open.canada.ca/data/en/dataset/22b2db8a-dc12-47f2-9737-99d3da921751
Explore at:
esri rest, mxd, shp, html, fgdb/gdb, wmsAvailable download formats
Dataset updated
Feb 20, 2025
Dataset provided by
Ministry of Natural Resources of Canadahttps://www.nrcan.gc.ca/
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
Jan 1, 2024 - Dec 31, 2024
Area covered
Canada
Description
This dataset contains primary processing facilities (e.g., smelters and refineries), mines and advanced projects related to Canada’s 34 critical minerals. Advanced projects are those with mineral reserves or resources (measured or indicated), the potential viability of which is supported by a preliminary economic assessment or a prefeasibility/feasibility study. These sites process, produce or consider producing at least one of Canada's critical minerals, but other minerals and metals may also be present. This dataset contains links that direct to non-Government of Canada websites that are not subject to the Privacy Act, the Official Languages Act or the Standard on Web Accessibility. Please see our terms and conditions for more information (https://www.nrcan.gc.ca/terms-and-conditions/10847). Primary processing facilities and mines data are sourced from Map 900A, Principal mineral areas, producing mines, and oil and gas fields in Canada. Data on advanced critical minerals projects are produced and published annually by Natural Resources Canada, in collaboration with provinces and territories. Data are compiled from a variety of public sources. Natural Resources Canada does not assume responsibility for errors or omissions. Please report any recommended revisions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Daniel Reissner (2022). Public benchmark dataset for Conformance Checking in Process Mining [Dataset]. http://doi.org/10.26188/5cd91d0d3adaa

Public benchmark dataset for Conformance Checking in Process Mining

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

xmlAvailable download formats

Unique identifier

https://doi.org/10.26188/5cd91d0d3adaa

Dataset updated

Jan 30, 2022

Dataset provided by

The University of Melbourne

Authors

Daniel Reissner

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains a variety of publicly available real-life event logs. We derived two types of Petri nets for each event log with two state-of-the-art process miners : Inductive Miner (IM) and Split Miner (SM). Each event log-Petri net pair is intended for evaluating the scalability of existing conformance checking techniques.We used this data-set to evaluate the scalability of the S-Component approach for measuring fitness. The dataset contains tables of descriptive statistics of both process models and event logs. In addition, this dataset includes the results in terms of time performance measured in milliseconds for several approaches for both multi-threaded and single-threaded executions. Last, the dataset contains a cost-comparison of different approaches and reports on the degree of over-approximation of the S-Components approach. The description of the compared conformance checking techniques can be found here: https://arxiv.org/abs/1910.09767. Update:The dataset has been extended with the event logs of the BPIC18 and BPIC19 logs. BPIC19 is actually a collection of four different processes and thus was split into four event logs. For each of the additional five event logs, again, two process models have been mined with inductive and split miner. We used the extended dataset to test the scalability of our tandem repeats approach for measuring fitness. The dataset now contains updated tables of log and model statistics as well as tables of the conducted experiments measuring execution time and raw fitness cost of various fitness approaches. The description of the compared conformance checking techniques can be found here: https://arxiv.org/abs/2004.01781.Update: The dataset has also been used to measure the scalability of a new Generalization measure based on concurrent and repetitive patterns. : A concurrency oracle is used in tandem with partial orders to identify concurrent patterns in the log that are tested against parallel blocks in the process model. Tandem repeats are used with various trace reduction and extensions to define repetitive patterns in the log that are tested against loops in the process model. Each pattern is assigned a partial fulfillment. The generalization is then the average of pattern fulfillments weighted by the trace counts for which the patterns have been observed. The dataset no includes the time results and a breakdown of Generalization values for the dataset.

Clear search

Close search

Google apps

Main menu

Public benchmark dataset for Conformance Checking in Process Mining

Datasets obtained from the Brazilian Federal Government's Open Data Portal -...

Criteria for evaluating and qualifying public datasets obtained from the...

Annotated UI Element Dataset for Desktop Environments

Helpdesk

Dataset Public Opinion of UAE and Sentiment Analysis Process

Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...

Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...

OpenScience Slovenia document metadata dataset

A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...

COVID-19 Sentiment: 500K Instagram Posts (2020-24)

Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

SyROCCo dataset

SPAMID-PAIR

Data Mining Applied to Life Cycle Inventory Modeling for Cumene and Sodium...

Mining and Industrial Facilities

Mining and Industrial Facilities

Mining and Industrial Facilities - Catalogue - Canadian Urban Data Catalogue...

Mt. Weld rare earths project : incorporating mining and beneficiation at Mt....

Critical minerals advanced projects, mines and processing facilities in...

Public benchmark dataset for Conformance Checking in Process Mining