16 datasets found

Social Media and Mental Health
kaggle.com
zip
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
Explore at:
zip(10944 bytes)Available download formats
Dataset updated
Jul 18, 2023
Authors
SouvikAhmed071
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

The following is the Google Colab link to the project, done on Jupyter Notebook -

https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

The following is the GitHub Repository of the project -

https://github.com/daerkns/social-media-and-mental-health

Libraries used for the Project -

Pandas Numpy Matplotlib Seaborn Sci-kit Learn
COVID-19 rumor dataset
figshare.com
html
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cheng (2023). COVID-19 rumor dataset [Dataset]. http://doi.org/10.6084/m9.figshare.14456385.v2
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14456385.v2
Dataset updated
Jun 10, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
cheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }
Github User Analysis 2019 for Graph Dataset
kaggle.com
zip
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christofel Ganteng (2023). Github User Analysis 2019 for Graph Dataset [Dataset]. https://www.kaggle.com/datasets/christofel04/github-user-analysis-2019-for-graph-dataset
Explore at:
zip(2447769 bytes)Available download formats
Dataset updated
Sep 25, 2023
Authors
Christofel Ganteng
Description
[ GitHub User Analysis 2019 for Graph Dataset ]

This is GitHub User Analysis 2019 for Graph Dataset. A large social network of GitHub developers which was collected from the public API in June 2019. Nodes are developers who have starred at least 10 repositories and edges are mutual follower relationships between them. The vertex features are extracted based on the location, repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This target feature was derived from the job title of each user.

Data Description :

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2941945%2F0297b539f7d955df091ebc19eee2d996%2FScreenshot%20from%202023-09-25%2016-30-37.png?generation=1695627254231053&alt=media" alt="">

GitHub User Analysis 2019 for Graph Dataset Tasks :

1. Can you predict GitHub User 2019 is a Software Engineer or AI Engineer based on GitHub User 2019 Analysis and GitHub Post and tendency ? 2. Can you predict GitHub User 2019 would follow AI Researcher based on GitHub User 2019 Analysis and GitHub Post and Tendency ? 3. Can you predict GitHub user 2019 would make good publications based on GitHub User 2019 Analysis and GitHub Post and Tendency ? 4. Can you predict GitHub User 2019 would make good publications based on GitHub User 2019 Analysis and GitHub Post Tendency ? Try to Visualize GitHub User 2019 Analysis and Tendency and try to find GitHub User 2019 Analysis and Tendency Pattern.
Social Power NBA
kaggle.com
zip
Updated Aug 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noah Gift (2017). Social Power NBA [Dataset]. https://www.kaggle.com/noahgift/social-power-nba
Explore at:
zip(1397766 bytes)Available download formats
Dataset updated
Aug 1, 2017
Authors
Noah Gift
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

This data set contains combined on-court performance data for NBA players in the 2016-2017 season, alongside salary, Twitter engagement, and Wikipedia traffic data.

Further information can be found in a series of articles for IBM Developerworks: "Explore valuation and attendance using data science and machine learning" and "Exploring the individual NBA players".

A talk about this dataset has slides from March, 2018, Strata:

https://www.slideshare.net/noahgift/social-power-andinfluenceinthenba-89807740?qid=3f9f835a-f3d7-4174-8a8c-c97f9c82e614&v=&b=&from_search=1

Further reading on this dataset is in the book Pragmatic AI, in Chapter 6 or full book, Pragmatic AI: An introduction to Cloud-based Machine Learning and watch lesson 9 in Essential Machine Learning and AI with Python and Jupyter Notebook

Followup Items

You can watch a breakdown of using cluster analysis on the Pragmatic AI YouTube channel

Learn to deploy a Kaggle project into a production Machine Learning sklearn + flask + container by reading Python for Devops: Learn Ruthlessly Effective Automation, Chapter 14: MLOps and Machine learning engineering

Use social media to predict a winning season with this notebook: https://github.com/noahgift/core-stats-datascience/blob/master/Lesson2_7_Trends_Supervized_Learning.ipynb

Learn to use the cloud for data analysis.

Acknowledgement

Data sources include ESPN, Basketball-Reference, Twitter, Five-ThirtyEight, and Wikipedia. The source code for this dataset (in Python and R) can be found on GitHub. Links to more writing can be found at noahgift.com.

Inspiration

Do NBA fans know more about who the best players are, or do owners?

What is the true worth of the social media presence of athletes in the NBA?
Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...
zenodo.org
data.niaid.nih.gov
+1more
Updated Oct 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2025). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5996864
Dataset updated
Oct 21, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
Description
Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

In order to obtain an access to the full dataset (in the CSV format), please, request the access by following the instructions provided below.

Note: Please, check also our MultiClaim Dataset that provides a more recent, a larger, and a highly multilingual dataset of fact-checked claims, social media posts and relations between them.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files:

sources.csv

articles.csv

article_media.csv

article_authors.csv

discussion_posts.csv

discussion_post_authors.csv

fact_checking_articles.csv

fact_checking_article_media.csv

claims.csv

feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.

At the same time, annotations are associated with a particular object identified by:

entity type (parameter `entity_type` in case of entity annotations, or `source_entity_type` and `target_entity_type` in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

entity id (parameter `entity_id` in case of entity annotations, or `source_entity_id` and `target_entity_id` in case of relation annotations).

The dataset provides specifically these entity
4
Data and Code for the PhD Thesis "Sensing the Cultural Significance with AI...
data.4tu.nl
zip
Updated Sep 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nan Bai (2023). Data and Code for the PhD Thesis "Sensing the Cultural Significance with AI for Social Inclusion" [Dataset]. http://doi.org/10.4121/42144de2-d61e-48b9-a288-aa4da3a806fe.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/42144de2-d61e-48b9-a288-aa4da3a806fe.v1
Dataset updated
Sep 6, 2023
Dataset provided by
4TU.ResearchData
Authors
Nan Bai
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
2019 - 2023
Area covered

Dataset funded by
European Union’s Horizon 2020 research and innovation programme
Description
This is the Repository of all the research data for PhD Thesis of the doctoral candidate Nan BAI from the Faculty Architecture and Built Environment at Delft University of Technology, with the title of '*Sensing the Cultural Significance with AI for Social Inclusion: A Computational Spatiotemporal Network-based Framework of Heritage Knowledge Documentation using User-Generated*', to be defended on October 5th, 2023.
Social Inclusion has been growing as a goal in heritage management. Whereas the 2011 UNESCO Recommendation on the Historic Urban Landscape (HUL) called for tools of knowledge documentation, social media already functions as a platform for online communities to actively involve themselves in heritage-related discussions. Such discussions happen both in “baseline scenarios” when people calmly share their experiences about the cities they live in or travel to, and in “activated scenarios” when radical events trigger their emotions. To organize, process, and analyse the massive unstructured multi-modal (mainly images and texts) user-generated data from social media efficiently and systematically, Artificial Intelligence (AI) is shown to be indispensable. This thesis explores the use of AI in a methodological framework to include the contribution of a larger and more diverse group of participants with user-generated data. It is an interdisciplinary study integrating methods and knowledge from heritage studies, computer science, social sciences, network science, and spatial analysis. AI models were applied, nurtured, and tested, helping to analyse the massive information content to derive the knowledge of cultural significance perceived by online communities. The framework was tested in case study cities including Venice, Paris, Suzhou, Amsterdam, and Rome for the baseline and/or activated scenarios. The AI-based methodological framework proposed in this thesis is shown to be able to collect information in cities and map the knowledge of the communities about cultural significance, fulfilling the expectation and requirement of HUL, useful and informative for future socially inclusive heritage management processes.
Some parts of this data are published as GitHub repositories:
WHOSe Heritage
The data of Chapter_3_Lexicon is published as https://github.com/zzbn12345/WHOSe_Heritage, which is also the Code for the Paper WHOSe Heritage: Classification of UNESCO World Heritage Statements of “Outstanding Universal Value” Documents with Soft Labels published in Findings of EMNLP 2021 (https://aclanthology.org/2021.findings-emnlp.34/).
Heri Graphs
The data of Chapter_4_Datasets is published as https://github.com/zzbn12345/Heri_Graphs, which is also the Code and Dataset for the Paper Heri-Graphs: A Dataset Creation Framework for Multi-modal Machine Learning on Graphs of Heritage Values and Attributes with Social Media published in ISPRS International Journal of Geo-Information showing the collection, preprocessing, and rearrangement of data related to Heritage values and attributes in three cities that have canal-related UNESCO World Heritage properties: Venice, Suzhou, and Amsterdam.
Stones Venice
The data of Chapter_5_Mapping is published as https://github.com/zzbn12345/Stones_Venice, which is also the Code and Dataset for the Paper Screening the stones of Venice: Mapping social perceptions of cultural significance through graph-based semi-supervised classification published in ISPRS Journal of Photogrammetry and Remote Sensing showing the mapping of cultural significance in the city of Venice.
Twitter Vaccination Dataset
kaggle.com
zip
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William K. O (2020). Twitter Vaccination Dataset [Dataset]. https://www.kaggle.com/datasets/keplaxo/twitter-vaccination-dataset/code
Explore at:
zip(320206673 bytes)Available download formats
Dataset updated
Apr 15, 2020
Authors
William K. O
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

There is a lot more that can we attain from social media sentiment and data than mere likes and shares especially where health care is concerned. This dataset is part of the data collected for the Vaccine hesitancy challenge on JOGL. We believe it is important to capture the views and trends of the public, social media sites like twitter provide a good window into this area.

Content

We collected all tweets containing at the search string: vaccination. Along with the tweet text, we downloaded the date and time when the tweet was published, and the location of the user (if provided). We also downloaded the user id, follower ids, and friends ids. The followers of a user A are those users who will receive messages from user A. The friends of a user A are those users from whom user A receives messages. Thus, information flows from a user to his followers. We collected tweets using the open source information tool, TWINT.(https://github.com/twintproject) and a python algorithm.

In contrast to the open Twitter Search API, which only allows one to query tweets posted within the last seven days, Twint makes it possible to collect a much larger sample of Twitter posts, ranging several years. We queried Twint for different key terms that relate to the topic of vaccination ranging from the year 2006 to 30th of November 2019 and stored in an aggregated CSV file.

Acknowledgements

We wouldn't be here without the help of others.

Inspiration

To my knowledge there is no active program that is currently actively carrying out qualitative analysis on Twitter data for sentiment associated with Vaccination. However, a number of studies have been carried out to analyse twitter for social media trends on Vaccination.

The Dataset can be used for analysis Including:

Topic modeling from the dataset

Graph analysis

Machine/deep learning models.

Descriptive analysis of twitter vaccination data with epidemiological data.

Model simulations for assessment of the effects of changing vaccine sentiment on outbreaks and disease spread.

Extracting high quality content from the tweets of users that have been identified as key influencers by our system and use it to train an LDA model, which will then be used to classify other users.

Extract topics using topic modelling per location.

Provide a filtering process for identifying polarising tweets.

Develop an iterative methodology that will be built upon the intelligence extracted by the already available high-quality content (top tweets – top URLs) to identify new trends and dynamically update the keywords used to track tweets of specific content.
g
InstaFake Dataset: An Instagram fake and automated account detection dataset...
github.com
opendatalab.com
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatih C. Akyon and Esat Kalfaoglu (2024). InstaFake Dataset: An Instagram fake and automated account detection dataset [Dataset]. https://github.com/viai957/LP
Explore at:
Dataset updated
Aug 23, 2024
Dataset provided by
Fatih C. Akyon and Esat Kalfaoglu
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The InstaFake Dataset is comprised of anonymized Instagram user data collected by Fatih Cagatay Akyon and Esat Kalfaoglu over the second half of 2018. We’re releasing this dataset publicly to aid the research community in making advancements in machine learning based social media analysis.
r
CODE dataset
researchdata.se
figshare.scilifelab.se
+1more
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio H. Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Derick M. Oliveira; Paulo R. Gomes; Jéssica A. Canazart; Milton P. Ferreira; Carl R. Andersson; Peter W. Macfarlane; Wagner Meira Jr.; Thomas B. Schön; Antonio Luiz P. Ribeiro (2025). CODE dataset [Dataset]. http://doi.org/10.17044/SCILIFELAB.15169716
Explore at:
Unique identifier
https://doi.org/10.17044/SCILIFELAB.15169716
Dataset updated
Feb 27, 2025
Dataset provided by
Uppsala University
Authors
Antonio H. Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Derick M. Oliveira; Paulo R. Gomes; Jéssica A. Canazart; Milton P. Ferreira; Carl R. Andersson; Peter W. Macfarlane; Wagner Meira Jr.; Thomas B. Schön; Antonio Luiz P. Ribeiro
Description
Dataset with annotated 12-lead ECG records. The exams were taken in 811 counties in the state of Minas Gerais/Brazil by the Telehealth Network of Minas Gerais (TNMG) between 2010 and 2016. And organized by the CODE (Clinical outcomes in digital electrocardiography) group. Requesting access Researchers affiliated to educational or research institutions might make requests to access this data dataset. Requests will be analyzed on an individual basis and should contain: Name of PI and host organisation; Contact details (including your name and email); and, the scientific purpose of data access request. If approved, a data user agreement will be forwarded to the researcher that made the request (through the email that was provided). After the agreement has been signed (by the researcher or by the research institution) access to the dataset will be granted. Openly available subset: A subset of this dataset (with 15% of the patients) is openly available. See: "CODE-15%: a large scale annotated dataset of 12-lead ECGs" https://doi.org/10.5281/zenodo.4916206. Content The folder contains: A column separated file containing basic patient attributes. The ECG waveforms in the wfdb format. Additional references The dataset is described in the paper "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4. Related publications also using this dataset are: - [1] G. Paixao et al., “Validation of a Deep Neural Network Electrocardiographic-Age as a Mortality Predictor: The CODE Study,” Circulation, vol. 142, no. Suppl_3, pp. A16883–A16883, Nov. 2020, doi: 10.1161/circ.142.suppl_3.16883.- [2] A. L. P. Ribeiro et al., “Tele-electrocardiography and bigdata: The CODE (Clinical Outcomes in Digital Electrocardiography) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/gf7pwg.- [3] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. P. Ribeiro, and W. Meira Jr, “Explaining end-to-end ECG automated diagnosis using contextual features,” in Machine Learning and Knowledge Discovery in Databases. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Ghent, Belgium, Sep. 2020, vol. 12461, pp. 204--219. doi: 10.1007/978-3-030-67670-4_13.- [4] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. Ribeiro, and W. M. Jr, “Explaining black-box automated electrocardiogram classiﬁcation to cardiologists,” in 2020 Computing in Cardiology (CinC), 2020, vol. 47. doi: 10.22489/CinC.2020.452.- [5] G. M. M. Paixão et al., “Evaluation of mortality in bundle branch block patients from an electronic cohort: Clinical Outcomes in Digital Electrocardiography (CODE) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/dcgk.- [6] G. M. M. Paixão et al., “Evaluation of Mortality in Atrial Fibrillation: Clinical Outcomes in Digital Electrocardiography (CODE) Study,” Global Heart, vol. 15, no. 1, p. 48, Jul. 2020, doi: 10.5334/gh.772.- [7] G. M. M. Paixão et al., “Electrocardiographic Predictors of Mortality: Data from a Primary Care Tele-Electrocardiography Cohort of Brazilian Patients,” Hearts, vol. 2, no. 4, Art. no. 4, Dec. 2021, doi: 10.3390/hearts2040035.- [8] G. M. Paixão et al., “ECG-AGE FROM ARTIFICIAL INTELLIGENCE: A NEW PREDICTOR FOR MORTALITY? THE CODE (CLINICAL OUTCOMES IN DIGITAL ELECTROCARDIOGRAPHY) STUDY,” Journal of the American College of Cardiology, vol. 75, no. 11 Supplement 1, p. 3672, 2020, doi: 10.1016/S0735-1097(20)34299-6.- [9] E. M. Lima et al., “Deep neural network estimated electrocardiographic-age as a mortality predictor,” Nature Communications, vol. 12, 2021, doi: 10.1038/s41467-021-25351-7.- [10] W. Meira Jr, A. L. P. Ribeiro, D. M. Oliveira, and A. H. Ribeiro, “Contextualized Interpretable Machine Learning for Medical Diagnosis,” Communications of the ACM, 2020, doi: 10.1145/3416965.- [11] A. H. Ribeiro et al., “Automatic diagnosis of the 12-lead ECG using a deep neural network,” Nature Communications, vol. 11, no. 1, p. 1760, 2020, doi: 10/drkd.- [12] A. H. Ribeiro et al., “Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Convolutional Network,” Machine Learning for Health (ML4H) Workshop at NeurIPS, 2018.- [13] A. H. Ribeiro et al., “Automatic 12-lead ECG classiﬁcation using a convolutional network ensemble,” 2020. doi: 10.22489/CinC.2020.130.- [14] V. Sangha et al., “Automated Multilabel Diagnosis on Electrocardiographic Images and Signals,” medRxiv, Sep. 2021, doi: 10.1101/2021.09.22.21263926.- [15] S. Biton et al., “Atrial fibrillation risk prediction from the 12-lead ECG using digital biomarkers and deep representation learning,” European Heart Journal - Digital Health, 2021, doi: 10.1093/ehjdh/ztab071. Code: The following github repositories perform analysis that use this dataset: - https://github.com/antonior92/automatic-ecg-diagnosis- https://github.com/antonior92/ecg-age-prediction Related Datasets: - CODE-test: An annotated 12-lead ECG dataset (https://doi.org/10.5281/zenodo.3765780)- CODE-15%: a large scale annotated dataset of 12-lead ECGs (https://doi.org/10.5281/zenodo.4916206)- Sami-Trop: 12-lead ECG traces with age and mortality annotations (https://doi.org/10.5281/zenodo.4905618) Ethics declarations The CODE Study was approved by the Research Ethics Committee of the Universidade Federal de Minas Gerais, protocol 49368496317.7.0000.5149.
Z
TreeSatAI Benchmark Archive for Deep Learning in Forest Applications
data.niaid.nih.gov
zenodo.org
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schulz, Christian; Ahlswede, Steve; Gava, Christiano; Helber, Patrick; Bischke, Benjamin; Arias, Florencia; Förster, Michael; Hees, Jörn; Demir, Begüm; Kleinschmit, Birgit (2024). TreeSatAI Benchmark Archive for Deep Learning in Forest Applications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6598390
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Technische Universität Berlin, Remote Sensing Image Analysis Group
Vision Impulse GmbH
Technische Universität Berlin, Geoinformation in Environmental Planning Lab
Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Smart Data and Knowledge Services
Authors
Schulz, Christian; Ahlswede, Steve; Gava, Christiano; Helber, Patrick; Bischke, Benjamin; Arias, Florencia; Förster, Michael; Hees, Jörn; Demir, Begüm; Kleinschmit, Birgit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context and Aim

Deep learning in Earth Observation requires large image archives with highly reliable labels for model training and testing. However, a preferable quality standard for forest applications in Europe has not yet been determined. The TreeSatAI consortium investigated numerous sources for annotated datasets as an alternative to manually labeled training datasets.

We found the federal forest inventory of Lower Saxony, Germany represents an unseen treasure of annotated samples for training data generation. The respective 20-cm Color-infrared (CIR) imagery, which is used for forestry management through visual interpretation, constitutes an excellent baseline for deep learning tasks such as image segmentation and classification.

Description

The data archive is highly suitable for benchmarking as it represents the real-world data situation of many German forest management services. One the one hand, it has a high number of samples which are supported by the high-resolution aerial imagery. On the other hand, this data archive presents challenges, including class label imbalances between the different forest stand types.

The TreeSatAI Benchmark Archive contains:

50,381 image triplets (aerial, Sentinel-1, Sentinel-2)

synchronized time steps and locations

all original spectral bands/polarizations from the sensors

20 species classes (single labels)

12 age classes (single labels)

15 genus classes (multi labels)

60 m and 200 m patches

fixed split for train (90%) and test (10%) data

additional single labels such as English species name, genus, forest stand type, foliage type, land cover

The geoTIFF and GeoJSON files are readable in any GIS software, such as QGIS. For further information, we refer to the PDF document in the archive and publications in the reference section.

Version history

v1.0.2 - Minor bug fix multi label JSON file

v1.0.1 - Minor bug fixes in multi label JSON file and description file

v1.0.0 - First release

Citation

Ahlswede, S., Schulz, C., Gava, C., Helber, P., Bischke, B., Förster, M., Arias, F., Hees, J., Demir, B., and Kleinschmit, B.: TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing, Earth Syst. Sci. Data, 15, 681–695, https://doi.org/10.5194/essd-15-681-2023, 2023.

GitHub

Full code examples and pre-trained models from the dataset article (Ahlswede et al. 2022) using the TreeSatAI Benchmark Archive are published on the GitLab and GitHub repositories of the Remote Sensing Image Analysis (RSiM) Group (https://git.tu-berlin.de/rsim/treesat_benchmark) and the Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI) (https://github.com/DFKI/treesatai_benchmark). Code examples for the sampling strategy can be made available by Christian Schulz via email request.

Folder structure

We refer to the proposed folder structure in the PDF file.

Folder “aerial” contains the aerial imagery patches derived from summertime orthophotos of the years 2011 to 2020. Patches are available in 60 x 60 m (304 x 304 pixels). Band order is near-infrared, red, green, and blue. Spatial resolution is 20 cm.

Folder “s1” contains the Sentinel-1 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is VV, VH, and VV/VH ratio. Spatial resolution is 10 m.

Folder “s2” contains the Sentinel-2 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, and B09. Spatial resolution is 10 m.

The folder “labels” contains a JSON string which was used for multi-labeling of the training patches. Code example of an image sample with respective proportions of 94% for Abies and 6% for Larix is: "Abies_alba_3_834_WEFL_NLF.tif": [["Abies", 0.93771], ["Larix", 0.06229]]

The two files “test_filesnames.lst” and “train_filenames.lst” define the filenames used for train (90%) and test (10%) split. We refer to this fixed split for better reproducibility and comparability.

The folder “geojson” contains geoJSON files with all the samples chosen for the derivation of training patch generation (point, 60 m bounding box, 200 m bounding box).

CAUTION: As we could not upload the aerial patches as a single zip file on Zenodo, you need to download the 20 single species files (aerial_60m_…zip) separately. Then, unzip them into a folder named “aerial” with a subfolder named “60m”. This structure is recommended for better reproducibility and comparability to the experimental results of Ahlswede et al. (2022),

Join the archive

Model training, benchmarking, algorithm development… many applications are possible! Feel free to add samples from other regions in Europe or even worldwide. Additional remote sensing data from Lidar, UAVs or aerial imagery from different time steps are very welcome. This helps the research community in development of better deep learning and machine learning models for forest applications. You might have questions or want to share code/results/publications using that archive? Feel free to contact the authors.

Project description

This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TUB Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab).

Project publications

Ahlswede, S., Schulz, C., Gava, C., Helber, P., Bischke, B., Förster, M., Arias, F., Hees, J., Demir, B., and Kleinschmit, B.: TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing, Earth System Science Data, 15, 681–695, https://doi.org/10.5194/essd-15-681-2023, 2023.

Schulz, C., Förster, M., Vulova, S. V., Rocha, A. D., and Kleinschmit, B.: Spectral-temporal traits in Sentinel-1 C-band SAR and Sentinel-2 multispectral remote sensing time series for 61 tree species in Central Europe. Remote Sensing of Environment, 307, 114162, https://doi.org/10.1016/j.rse.2024.114162, 2024.

Conference contributions

Ahlswede, S. Madam, N.T., Schulz, C., Kleinschmit, B., and Demіr, B.: Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, https://doi.org/10.48550/arXiv.2201.07495, 2022.

Schulz, C., Förster, M., Vulova, S., Gränzig, T., and Kleinschmit, B.: Exploring the temporal fingerprints of mid-European forest types from Sentinel-1 RVI and Sentinel-2 NDVI time series, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, https://doi.org/10.1109/IGARSS46834.2022.9884173, 2022.

Schulz, C., Förster, M., Vulova, S., and Kleinschmit, B.: The temporal fingerprints of common European forest types from SAR and optical remote sensing data, AGU Fall Meeting, New Orleans, USA, 2021.

Kleinschmit, B., Förster, M., Schulz, C., Arias, F., Demir, B., Ahlswede, S., Aksoy, A.K., Ha Minh, T., Hees, J., Gava, C., Helber, P., Bischke, B., Habelitz, P., Frick, A., Klinke, R., Gey, S., Seidel, D., Przywarra, S., Zondag, R., and Odermatt B.: Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees and Forests, Living Planet Symposium, Bonn, Germany, 2022.

Schulz, C., Förster, M., Vulova, S., Gränzig, T., and Kleinschmit, B.: Exploring the temporal fingerprints of sixteen mid-European forest types from Sentinel-1 and Sentinel-2 time series, ForestSAT, Berlin, Germany, 2022.
O
mcal
kg.odissei.nl
application/n-quads +6
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ODISSEI (2025). mcal [Dataset]. https://kg.odissei.nl/odissei/mcal
Explore at:
application/trig, application/n-triples, jsonld, json, ttl, application/n-quads, application/sparql-results+jsonAvailable download formats
Dataset updated
May 8, 2025
Dataset authored and provided by
ODISSEI
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
MCALentory dataset: the Media Content Analysis Lab inventory

One of the aims of the Media Content Analysis Lab (MCAL) is to provide an overview of the field of content analysis research. To this end, the MCALentory consists of an inventory of content analytical studies in The Netherlands, 2000-2023. This inventory serves multiple purposes. First, it can be used as a source of inspiration for designing future content analytical studies, for example in terms operationalization of key concepts. Second, the archive can be used for replication studies and meta analyses. Third, data can potentially be used as training data for machine learning algorithms.

Introduction

To get an overview of what is in the data and what you can do it, please first read the "data story" Introduction to MCAL by Annelien Van Remoortere. More examples can be found here. Below we describe how this dataset was made, you can skip this if you are not interested in the "making of" story.

Data collection

In a first step, we systematically collected data (in this case scientific articles, see description below). In the next step, we coded the collected material on several features related to the content, type of content analysis conducted, reporting of reliability/quality of content analysis, and the availability of corpora and datasets. Special attention was devoted to the degree to which authors adhere to the FAIR principles.

Methods

To make an inventory of existing media content analysis studies conducted (partly) focusing on the Netherlands, we first selected the top 30 communication science journals according to Web of Science (in 2021). We selected papers published from 2000 until 2023. All papers that looked at traditional media outlets (television, newspapers), new media (online news outlets) and social media (Twitter, Instagram, Facebook) in The Netherlands were included. In total we collected 196 articles that were annotated.

For every journal, a search was performed in Google Scholar using the query: "content analysis" media netherlands source:"selected journal" site:site of journal. Next, all Google Scholar hits were manually checked based on the title, abstract and, if there was still doubt, the method section of the paper. The results are available in CSV format, see the most recent Mcalentory.csv file in the Assets section of this site. For questions about the data collection methods and the content of this file please contact annelien.vanremoortere@wur.nl or rens.vliegenthart@wur.nl.

MCAL knowledge graph

The controlled vocabularies created for this project have all been published in the https://w3id.org/odissei/cv/ namespace. The other MCAL RDF knowledge graphs published here use the mcal: URI prefix https://w3id.org/odissei/ns/mcal/. For an explanation of how URIs starting with https://w3id.org/odissei/ are redirected, see https://github.com/odissei-data/w3id.org/tree/master/odissei. The dataset consists of the following named graphs, see the Graphs section on this site: - The main dataset as mcal:graph/mcalentory. The TriplyETL code to convert the CSV file above into RDF is publicly available on the ODISSEI github. This graph has been validated against these constraints expressed in SHACL. - A controlled vocabulary https://w3id.org/odissei/cv/contentFeature/v0.1/ for the content features has been generated from this Google sheet, for the conversion code see the same codebase as above. - https://w3id.org/odissei/cv/contentAnalysisType/v0.1/ copy of the manually created vocabulary for content analysis types - https://w3id.org/odissei/cv/researchQuestionType/v0.1/, a copy of the manually created vocabulary for research questions types

The project is now wrapping up, and the data as represented here should be correct to the best of our knowledge. For comments or questions on the MCAL knowledge graph, feel free to contact jacco.van.ossenbruggen@vu.nl or angelica@odissei-data.nl.

For those with an account on this platform: This dataset is published by the most recent TriplyETL pipeline listed here: internal link
Sensor data set of 3 electromechanical cylinder at ZeMA testbed (2kHz)
zenodo.org
bin, pdf
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanja Dorst; Tanja Dorst (2024). Sensor data set of 3 electromechanical cylinder at ZeMA testbed (2kHz) [Dataset]. http://doi.org/10.5281/zenodo.3364431
Explore at:
bin, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3364431
Dataset updated
Jul 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tanja Dorst; Tanja Dorst
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General information on the data set

The data set was generated at the ZeMA testbed. A working cycle lasts 2.8s and consists of a forward stroke, a waiting time and a return stroke. The data set does not consist of the entire working cycles. Only one second of the return stroke of each working cycle is used.

Structure of the data

data saved in three HDF5 file as a 3D-matrix, one file is for one axis

one row represents one second of the return stroke of one working cycle
axis 3: 6292 cycles
axis 5: 6083 cycles
axis 7: 5732 cycles

one column represents one datapoint of the cycle, that is resampled to 2 kHz (2000 columns)

one page represent one sensor (11 pages: 11 sensors)

Allocation of the pages to the sensors

page 1: microphone
page 2: acceleration plain bearing
page 3: acceleration piston rod
page 4: acceleration ball bearing
page 5: axial force
page 6: pressure
page 7: velocity
page 8: active current
page 9: motor current phase 1
page 10: motor current phase 2
page 11: motor current phase 3

Remark

The datasets are not in SI units. For conversion, you can use the PDF documentation.

Further information

For an introduction and tutorial to this data, a set of Jupyter notebooks is available here. These notebooks contain Python code and a documentation of example machine learning tasks and analysis of this data set. In the near future, these will be extended to also include uncertainties in the input data.
Reviews of the top 250 Letterboxd Films (MOVIES)
kaggle.com
zip
Updated Jul 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Alarcón (2023). Reviews of the top 250 Letterboxd Films (MOVIES) [Dataset]. https://www.kaggle.com/datasets/alarchemn/letterbox-top250-short-reviews
Explore at:
zip(455139 bytes)Available download formats
Dataset updated
Jul 28, 2023
Authors
Martin Alarcón
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description This dataset contains a collection of short reviews extracted from Letterboxd, a popular movie social networking site. Each review is a brief commentary, consisting of user-generated text expressing opinions and sentiments about various movies. The dataset covers a wide range of films from different genres and time periods, making it a valuable resource for sentiment analysis, natural language processing, and other related tasks.

Dataset Highlights - Short reviews from real users: The dataset comprises genuine reviews shared by users, reflecting diverse opinions and emotions towards the movies they have watched. - Movie diversity: The reviews cover a vast array of films, including classics, recent releases, and cult favorites, providing a rich and diverse dataset for analysis. - Potential Applications: This dataset can be used for sentiment analysis, emotion detection, and other text-based machine learning tasks, enabling researchers and practitioners to gain insights into movie reception and audience preferences.

Scraping code Here

WWE Superstar Popularity Prediction Dataset 🏆

kaggle.com

Updated Sep 27, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Jay Arre Talosig (2025). WWE Superstar Popularity Prediction Dataset 🏆 [Dataset]. https://www.kaggle.com/datasets/flexycode/wwe-popularity-prediction-2025

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 27, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Jay Arre Talosig

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

WWE Superstar Popularity Dataset 🏆

📊 Dataset Overview

The WWE Superstar Popularity Prediction Dataset is a comprehensive collection of professional wrestling data designed for machine learning, data analysis, and sports analytics projects. This dataset captures the complete ecosystem of WWE superstars, their careers, and performance metrics in a unified structure.

Description

There are no current WWE Superstar datasets uploaded here on Kaggle, that's why I wanted it to create a new dataset for the current WWE roster that will determine their popularity tier such as main eventer, mid-carder or jobber base on their career statistics and performance metrics using machine learning algorithms.

It is my first published dataset . Upvotes are really appreciated.

This is the full github repository containing all notebooks with a clear documentation : Github

🎯 Dataset Design Philosophy

This dataset was specifically designed to meet the needs of modern data science projects:

✅ Comprehensive Coverage

Fighters: 70+ WWE superstars with detailed profiles
Fights: Career match statistics and performance metrics
Events: Pay-per-view main events and championship history
Context: Brand affiliations, weight classes, and career timelines

✅ Data Cleaning Ready

Raw Data Structure: Contains natural variations and real-world data challenges
Missing Value Opportunities: Some fields intentionally sparse for cleaning practice
Data Type Diversity: Mixed numerical, categorical, and encoded features
Outlier Detection: Natural variations in career statistics

✅ Regular Updates

Current Roster: 2025 WWE superstars from RAW and SmackDown
Active Champions: Current title holders and recent changes
Modern Metrics: Social media integration and digital presence
Career Progression: Ongoing career tracking

✅ Multi-Purpose Applications

Machine Learning: Classification, regression, clustering
Data Analysis: Statistical analysis and trend identification
Data Visualization: Rich feature set for comprehensive charts
Sports Analytics: Talent evaluation and performance prediction

📁 Dataset Structure

Core Tables Integration

The dataset combines multiple data domains into a single, unified structure:

Domain	Features Included	Description
Fighter Profiles	wrestler_name, age, weight_class, brand	Personal and physical attributes
Career Statistics	total_matches, years_active, win_percentage	Long-term performance metrics
Championship History	world_title_reigns, secondary_titles, tag_titles	Success and achievement tracking
Event Participation	main_evented_ppv, avg_matches_per_month	Schedule and exposure metrics
Popularity Metrics	social_media_followers, current_champion	Modern success indicators

Key Feature Categories

🥊 Fighter Attributes

# Physical and Career Profile
['wrestler_name', 'age', 'weight_class', 'brand', 
 'debut_year', 'years_active', 'experience_level']

🏆 Performance Metrics

# In-Ring Performance
['total_matches', 'career_win_percentage', 'avg_matches_per_month',
 'main_evented_ppv', 'current_champion']

🏅 Achievement Tracking

# Championship Success
['world_title_reigns', 'secondary_title_reigns', 'tag_title_reigns',
 'total_title_reigns', 'title_impact']

📈 Popularity Indicators

# Success Metrics
['social_media_followers_millions', 'popularity_tier',
 'main_evented_ppv', 'current_champion']

🔧 Data Cleaning Opportunities

Intended Data Challenges

This dataset provides realistic data cleaning scenarios:

Categorical Encoding

# Raw categorical features needing encoding
['brand', 'weight_class', 'popularity_tier', 'experience_level']

Feature Engineering

# Derived features to create
df['title_impact'] = (df['world_title_reigns'] * 2 + 
           df['secondary_title_reigns'] * 1.5 + 
           df['tag_title_reigns'])
df['career_longevity'] = df['years_active'] / df['age']

Outlier Detection
Extreme social media followers (Logan Paul: 26M)

Career length variations (R-Truth: 25 years)
Match frequency differences

Data Validation
Consistency checks between related fields

Chronological validation (debut_year + years_active)
Statistical boundary checks

📈 Machine Learning Applications

Classification Tasks

# Primary: Popularity Tier Prediction
target = 'popularity_tier' # ['Main Event...

Z
Sensor data set radial forging at AFRC testbed v2
data.niaid.nih.gov
zenodo.org
Updated Oct 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christos Tachtatzis; Gordon Gourlay; Ivan Andonovic; Omer Panni (2021). Sensor data set radial forging at AFRC testbed v2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2573860
Explore at:
Dataset updated
Oct 11, 2021
Dataset provided by
University of Strathclyde
NPL
Authors
Christos Tachtatzis; Gordon Gourlay; Ivan Andonovic; Omer Panni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sensor data set, radial forging at AFRC testbed

General information on the data set

Radial forging is widely used in industry to manufacture components for a broad range of sectors including automotive, medical, aerospace, rail and industrial. The Advanced Forming Research Centre (AFRC) at the University of Strathclyde, Glasgow, houses a GFM SKK10/R radial forge that has been used as a testbed for this project. Using two pairs of hammers operating at 1200 strokes/min, and providing a maximum forging force per hammer of 150 tons, the radial forge is capable of processing a range of metals, including steel, titanium and inconel. Both hollow and solid material can be formed with the added benefit of creating internal features on hollow parts using a mandrel. Parts can be formed at a range of temperatures from ambient temperature to 1200 °C.

For the provided data set, a total of 81 parts were forged over one day of operation. A machine failure occurred during the forging of part number 70, and this part was re-run once the malfunction had been fixed. Each forged part was then measured using a CMM to provide dimensional output relative to a target specification and tolerances. The CMM records 18 dimensional measurements.

The aim of the measurement setup is to predict the quality (in terms of dimensional properties) of the forged part from the sensor measurements during the forging process.

Structure of the data

The sensor readings for the forging of the parts are provided in 81 csv files in the folder “Scope Traces”, named “Scope0001.csv” to “Scope0081.csv”. Each file contains the readings (columns) against time (rows). The first column displays the clock times (in milliseconds).

A commentary on the sensors is provided in the file “ForgedPartDataStructureSummaryv3.xlsx” (NOTE: Some columns do not have sensor descriptions as this information is not available).

The CMM data is provided in the file “CMMData.xlsx”.

Further Information

For an introduction and tutorial to this data, a set of Jupyter notebooks is available here:

https://github.com/harislulic/Strathcylde_AFRC_machine_learning_tutorials/releases/tag/v2.0

These notebooks contain Python code and a documentation of example machine learning tasks and analysis of this data set.
Classification Graphs
kaggle.com
zip
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Classification Graphs [Dataset]. https://www.kaggle.com/wolfram77/graphs-classification
Explore at:
zip(57357399 bytes)Available download formats
Dataset updated
Nov 12, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Deezer Ego Nets

The ego-nets of Eastern European users collected from the music streaming service Deezer in February 2020. Nodes are users and edges are mutual follower relationships. The related task is the prediction of gender for the ego node in the graph.

Github Stargazers

The social networks of developers who starred popular machine learning and web development repositories (with at least 10 stars) until 2019 August. Nodes are users and links are follower relationships. The task is to decide whether a social network belongs to web or machine learning developers. We only included the largest component (at least with 10 users) of graphs.

Reddit Threads

Discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them. The task is to predict whether a thread is discussion based or not (binary classification).

Twitch Ego Nets

The ego-nets of Twitch users who participated in the partnership program in April 2018. Nodes are users and links are friendships. The binary classification task is to predict using the ego-net whether the ego user plays a single or multple games. Players who play a single game usually have a more dense ego-net.

Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

http://snap.stanford.edu/data/index.html#disjointgraphs
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health

Social Media and Mental Health

Correlation between Social Media use and General Mental Well-being

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

zip(10944 bytes)Available download formats

Dataset updated

Jul 18, 2023

Authors

SouvikAhmed071

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

The following is the Google Colab link to the project, done on Jupyter Notebook -

https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

The following is the GitHub Repository of the project -

https://github.com/daerkns/social-media-and-mental-health

Libraries used for the Project -

Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn

Clear search

Close search

Google apps

Main menu

Social Media and Mental Health

COVID-19 rumor dataset

Github User Analysis 2019 for Graph Dataset

Social Power NBA

Context

Followup Items

Acknowledgement

Inspiration

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

Data and Code for the PhD Thesis "Sensing the Cultural Significance with AI...

Twitter Vaccination Dataset

Context

Content

Acknowledgements

Inspiration

InstaFake Dataset: An Instagram fake and automated account detection dataset...

CODE dataset

TreeSatAI Benchmark Archive for Deep Learning in Forest Applications

mcal

MCALentory dataset: the Media Content Analysis Lab inventory

Introduction

Data collection

Methods

MCAL knowledge graph

Sensor data set of 3 electromechanical cylinder at ZeMA testbed (2kHz)

Reviews of the top 250 Letterboxd Films (MOVIES)

WWE Superstar Popularity Prediction Dataset 🏆

WWE Superstar Popularity Dataset 🏆

📊 Dataset Overview

Description

🎯 Dataset Design Philosophy

✅ Comprehensive Coverage

✅ Data Cleaning Ready

✅ Regular Updates

✅ Multi-Purpose Applications

📁 Dataset Structure

Core Tables Integration

Key Feature Categories

🥊 Fighter Attributes

🏆 Performance Metrics

🏅 Achievement Tracking

📈 Popularity Indicators

🔧 Data Cleaning Opportunities

Intended Data Challenges

📈 Machine Learning Applications

Classification Tasks

Sensor data set radial forging at AFRC testbed v2

Classification Graphs

Deezer Ego Nets

Github Stargazers

Reddit Threads

Twitch Ego Nets

Social Media and Mental Health

Correlation between Social Media use and General Mental Well-being