Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A COVID-19 misinformation / fake news / rumor / disinformation dataset collected from online social media and news websites. Usage note:Misinformation detection, classification, tracking, prediction.Misinformation sentiment analysis.Rumor veracity classification, comment stance classification.Rumor tracking, social network analysis.Data pre-processing and data analysis codes available at https://github.com/MickeysClubhouse/COVID-19-rumor-datasetPlease see full info in our GitHub link.Cite us:Cheng, Mingxi, et al. "A COVID-19 Rumor Dataset." Frontiers in Psychology 12 (2021): 1566.@article{cheng2021covid, title={A COVID-19 Rumor Dataset}, author={Cheng, Mingxi and Wang, Songli and Yan, Xiaofeng and Yang, Tianqi and Wang, Wenshuo and Huang, Zehao and Xiao, Xiongye and Nazarian, Shahin and Bogdan, Paul}, journal={Frontiers in Psychology}, volume={12}, pages={1566}, year={2021}, publisher={Frontiers} }
Facebook
Twitter[ GitHub User Analysis 2019 for Graph Dataset ]
This is GitHub User Analysis 2019 for Graph Dataset. A large social network of GitHub developers which was collected from the public API in June 2019. Nodes are developers who have starred at least 10 repositories and edges are mutual follower relationships between them. The vertex features are extracted based on the location, repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This target feature was derived from the job title of each user.
Data Description :
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2941945%2F0297b539f7d955df091ebc19eee2d996%2FScreenshot%20from%202023-09-25%2016-30-37.png?generation=1695627254231053&alt=media" alt="">
GitHub User Analysis 2019 for Graph Dataset Tasks :
1. Can you predict GitHub User 2019 is a Software Engineer or AI Engineer based on GitHub User 2019 Analysis and GitHub Post and tendency ? 2. Can you predict GitHub User 2019 would follow AI Researcher based on GitHub User 2019 Analysis and GitHub Post and Tendency ? 3. Can you predict GitHub user 2019 would make good publications based on GitHub User 2019 Analysis and GitHub Post and Tendency ? 4. Can you predict GitHub User 2019 would make good publications based on GitHub User 2019 Analysis and GitHub Post Tendency ? Try to Visualize GitHub User 2019 Analysis and Tendency and try to find GitHub User 2019 Analysis and Tendency Pattern.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data set contains combined on-court performance data for NBA players in the 2016-2017 season, alongside salary, Twitter engagement, and Wikipedia traffic data.
Further information can be found in a series of articles for IBM Developerworks: "Explore valuation and attendance using data science and machine learning" and "Exploring the individual NBA players".
A talk about this dataset has slides from March, 2018, Strata:
Further reading on this dataset is in the book Pragmatic AI, in Chapter 6 or full book, Pragmatic AI: An introduction to Cloud-based Machine Learning and watch lesson 9 in Essential Machine Learning and AI with Python and Jupyter Notebook
You can watch a breakdown of using cluster analysis on the Pragmatic AI YouTube channel
Learn to deploy a Kaggle project into a production Machine Learning sklearn + flask + container by reading Python for Devops: Learn Ruthlessly Effective Automation, Chapter 14: MLOps and Machine learning engineering
Use social media to predict a winning season with this notebook: https://github.com/noahgift/core-stats-datascience/blob/master/Lesson2_7_Trends_Supervized_Learning.ipynb
Learn to use the cloud for data analysis.
Data sources include ESPN, Basketball-Reference, Twitter, Five-ThirtyEight, and Wikipedia. The source code for this dataset (in Python and R) can be found on GitHub. Links to more writing can be found at noahgift.com.
Facebook
TwitterOverview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
In order to obtain an access to the full dataset (in the CSV format), please, request the access by following the instructions provided below.
Note: Please, check also our MultiClaim Dataset that provides a more recent, a larger, and a highly multilingual dataset of fact-checked claims, social media posts and relations between them.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform,
author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
pages = {1--7},
title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
year = {2019}
}
@inproceedings{SrbaMonantMedicalDataset,
author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
numpages = {11},
title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
year = {2022},
doi = {10.1145/3477495.3531726},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531726},
}
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files:
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
At the same time, annotations are associated with a particular object identified by:
The dataset provides specifically these entity
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is the Repository of all the research data for PhD Thesis of the doctoral candidate Nan BAI from the Faculty Architecture and Built Environment at Delft University of Technology, with the title of '*Sensing the Cultural Significance with AI for Social Inclusion: A Computational Spatiotemporal Network-based Framework of Heritage Knowledge Documentation using User-Generated*', to be defended on October 5th, 2023.
Social Inclusion has been growing as a goal in heritage management. Whereas the 2011 UNESCO Recommendation on the Historic Urban Landscape (HUL) called for tools of knowledge documentation, social media already functions as a platform for online communities to actively involve themselves in heritage-related discussions. Such discussions happen both in “baseline scenarios” when people calmly share their experiences about the cities they live in or travel to, and in “activated scenarios” when radical events trigger their emotions. To organize, process, and analyse the massive unstructured multi-modal (mainly images and texts) user-generated data from social media efficiently and systematically, Artificial Intelligence (AI) is shown to be indispensable. This thesis explores the use of AI in a methodological framework to include the contribution of a larger and more diverse group of participants with user-generated data. It is an interdisciplinary study integrating methods and knowledge from heritage studies, computer science, social sciences, network science, and spatial analysis. AI models were applied, nurtured, and tested, helping to analyse the massive information content to derive the knowledge of cultural significance perceived by online communities. The framework was tested in case study cities including Venice, Paris, Suzhou, Amsterdam, and Rome for the baseline and/or activated scenarios. The AI-based methodological framework proposed in this thesis is shown to be able to collect information in cities and map the knowledge of the communities about cultural significance, fulfilling the expectation and requirement of HUL, useful and informative for future socially inclusive heritage management processes.
Some parts of this data are published as GitHub repositories:
WHOSe Heritage
The data of Chapter_3_Lexicon is published as https://github.com/zzbn12345/WHOSe_Heritage, which is also the Code for the Paper WHOSe Heritage: Classification of UNESCO World Heritage Statements of “Outstanding Universal Value” Documents with Soft Labels published in Findings of EMNLP 2021 (https://aclanthology.org/2021.findings-emnlp.34/).
Heri Graphs
The data of Chapter_4_Datasets is published as https://github.com/zzbn12345/Heri_Graphs, which is also the Code and Dataset for the Paper Heri-Graphs: A Dataset Creation Framework for Multi-modal Machine Learning on Graphs of Heritage Values and Attributes with Social Media published in ISPRS International Journal of Geo-Information showing the collection, preprocessing, and rearrangement of data related to Heritage values and attributes in three cities that have canal-related UNESCO World Heritage properties: Venice, Suzhou, and Amsterdam.
Stones Venice
The data of Chapter_5_Mapping is published as https://github.com/zzbn12345/Stones_Venice, which is also the Code and Dataset for the Paper Screening the stones of Venice: Mapping social perceptions of cultural significance through graph-based semi-supervised classification published in ISPRS Journal of Photogrammetry and Remote Sensing showing the mapping of cultural significance in the city of Venice.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
There is a lot more that can we attain from social media sentiment and data than mere likes and shares especially where health care is concerned. This dataset is part of the data collected for the Vaccine hesitancy challenge on JOGL. We believe it is important to capture the views and trends of the public, social media sites like twitter provide a good window into this area.
We collected all tweets containing at the search string: vaccination. Along with the tweet text, we downloaded the date and time when the tweet was published, and the location of the user (if provided). We also downloaded the user id, follower ids, and friends ids. The followers of a user A are those users who will receive messages from user A. The friends of a user A are those users from whom user A receives messages. Thus, information flows from a user to his followers. We collected tweets using the open source information tool, TWINT.(https://github.com/twintproject) and a python algorithm.
In contrast to the open Twitter Search API, which only allows one to query tweets posted within the last seven days, Twint makes it possible to collect a much larger sample of Twitter posts, ranging several years. We queried Twint for different key terms that relate to the topic of vaccination ranging from the year 2006 to 30th of November 2019 and stored in an aggregated CSV file.
We wouldn't be here without the help of others.
To my knowledge there is no active program that is currently actively carrying out qualitative analysis on Twitter data for sentiment associated with Vaccination. However, a number of studies have been carried out to analyse twitter for social media trends on Vaccination.
The Dataset can be used for analysis Including:
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The InstaFake Dataset is comprised of anonymized Instagram user data collected by Fatih Cagatay Akyon and Esat Kalfaoglu over the second half of 2018. We’re releasing this dataset publicly to aid the research community in making advancements in machine learning based social media analysis.
Facebook
TwitterDataset with annotated 12-lead ECG records. The exams were taken in 811 counties in the state of Minas Gerais/Brazil by the Telehealth Network of Minas Gerais (TNMG) between 2010 and 2016. And organized by the CODE (Clinical outcomes in digital electrocardiography) group. Requesting access Researchers affiliated to educational or research institutions might make requests to access this data dataset. Requests will be analyzed on an individual basis and should contain: Name of PI and host organisation; Contact details (including your name and email); and, the scientific purpose of data access request. If approved, a data user agreement will be forwarded to the researcher that made the request (through the email that was provided). After the agreement has been signed (by the researcher or by the research institution) access to the dataset will be granted. Openly available subset: A subset of this dataset (with 15% of the patients) is openly available. See: "CODE-15%: a large scale annotated dataset of 12-lead ECGs" https://doi.org/10.5281/zenodo.4916206. Content The folder contains: A column separated file containing basic patient attributes. The ECG waveforms in the wfdb format. Additional references The dataset is described in the paper "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4. Related publications also using this dataset are: - [1] G. Paixao et al., “Validation of a Deep Neural Network Electrocardiographic-Age as a Mortality Predictor: The CODE Study,” Circulation, vol. 142, no. Suppl_3, pp. A16883–A16883, Nov. 2020, doi: 10.1161/circ.142.suppl_3.16883.- [2] A. L. P. Ribeiro et al., “Tele-electrocardiography and bigdata: The CODE (Clinical Outcomes in Digital Electrocardiography) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/gf7pwg.- [3] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. P. Ribeiro, and W. Meira Jr, “Explaining end-to-end ECG automated diagnosis using contextual features,” in Machine Learning and Knowledge Discovery in Databases. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), Ghent, Belgium, Sep. 2020, vol. 12461, pp. 204--219. doi: 10.1007/978-3-030-67670-4_13.- [4] D. M. Oliveira, A. H. Ribeiro, J. A. O. Pedrosa, G. M. M. Paixao, A. L. Ribeiro, and W. M. Jr, “Explaining black-box automated electrocardiogram classification to cardiologists,” in 2020 Computing in Cardiology (CinC), 2020, vol. 47. doi: 10.22489/CinC.2020.452.- [5] G. M. M. Paixão et al., “Evaluation of mortality in bundle branch block patients from an electronic cohort: Clinical Outcomes in Digital Electrocardiography (CODE) study,” Journal of Electrocardiology, Sep. 2019, doi: 10/dcgk.- [6] G. M. M. Paixão et al., “Evaluation of Mortality in Atrial Fibrillation: Clinical Outcomes in Digital Electrocardiography (CODE) Study,” Global Heart, vol. 15, no. 1, p. 48, Jul. 2020, doi: 10.5334/gh.772.- [7] G. M. M. Paixão et al., “Electrocardiographic Predictors of Mortality: Data from a Primary Care Tele-Electrocardiography Cohort of Brazilian Patients,” Hearts, vol. 2, no. 4, Art. no. 4, Dec. 2021, doi: 10.3390/hearts2040035.- [8] G. M. Paixão et al., “ECG-AGE FROM ARTIFICIAL INTELLIGENCE: A NEW PREDICTOR FOR MORTALITY? THE CODE (CLINICAL OUTCOMES IN DIGITAL ELECTROCARDIOGRAPHY) STUDY,” Journal of the American College of Cardiology, vol. 75, no. 11 Supplement 1, p. 3672, 2020, doi: 10.1016/S0735-1097(20)34299-6.- [9] E. M. Lima et al., “Deep neural network estimated electrocardiographic-age as a mortality predictor,” Nature Communications, vol. 12, 2021, doi: 10.1038/s41467-021-25351-7.- [10] W. Meira Jr, A. L. P. Ribeiro, D. M. Oliveira, and A. H. Ribeiro, “Contextualized Interpretable Machine Learning for Medical Diagnosis,” Communications of the ACM, 2020, doi: 10.1145/3416965.- [11] A. H. Ribeiro et al., “Automatic diagnosis of the 12-lead ECG using a deep neural network,” Nature Communications, vol. 11, no. 1, p. 1760, 2020, doi: 10/drkd.- [12] A. H. Ribeiro et al., “Automatic Diagnosis of Short-Duration 12-Lead ECG using a Deep Convolutional Network,” Machine Learning for Health (ML4H) Workshop at NeurIPS, 2018.- [13] A. H. Ribeiro et al., “Automatic 12-lead ECG classification using a convolutional network ensemble,” 2020. doi: 10.22489/CinC.2020.130.- [14] V. Sangha et al., “Automated Multilabel Diagnosis on Electrocardiographic Images and Signals,” medRxiv, Sep. 2021, doi: 10.1101/2021.09.22.21263926.- [15] S. Biton et al., “Atrial fibrillation risk prediction from the 12-lead ECG using digital biomarkers and deep representation learning,” European Heart Journal - Digital Health, 2021, doi: 10.1093/ehjdh/ztab071. Code: The following github repositories perform analysis that use this dataset: - https://github.com/antonior92/automatic-ecg-diagnosis- https://github.com/antonior92/ecg-age-prediction Related Datasets: - CODE-test: An annotated 12-lead ECG dataset (https://doi.org/10.5281/zenodo.3765780)- CODE-15%: a large scale annotated dataset of 12-lead ECGs (https://doi.org/10.5281/zenodo.4916206)- Sami-Trop: 12-lead ECG traces with age and mortality annotations (https://doi.org/10.5281/zenodo.4905618) Ethics declarations The CODE Study was approved by the Research Ethics Committee of the Universidade Federal de Minas Gerais, protocol 49368496317.7.0000.5149.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context and Aim
Deep learning in Earth Observation requires large image archives with highly reliable labels for model training and testing. However, a preferable quality standard for forest applications in Europe has not yet been determined. The TreeSatAI consortium investigated numerous sources for annotated datasets as an alternative to manually labeled training datasets.
We found the federal forest inventory of Lower Saxony, Germany represents an unseen treasure of annotated samples for training data generation. The respective 20-cm Color-infrared (CIR) imagery, which is used for forestry management through visual interpretation, constitutes an excellent baseline for deep learning tasks such as image segmentation and classification.
Description
The data archive is highly suitable for benchmarking as it represents the real-world data situation of many German forest management services. One the one hand, it has a high number of samples which are supported by the high-resolution aerial imagery. On the other hand, this data archive presents challenges, including class label imbalances between the different forest stand types.
The TreeSatAI Benchmark Archive contains:
50,381 image triplets (aerial, Sentinel-1, Sentinel-2)
synchronized time steps and locations
all original spectral bands/polarizations from the sensors
20 species classes (single labels)
12 age classes (single labels)
15 genus classes (multi labels)
60 m and 200 m patches
fixed split for train (90%) and test (10%) data
additional single labels such as English species name, genus, forest stand type, foliage type, land cover
The geoTIFF and GeoJSON files are readable in any GIS software, such as QGIS. For further information, we refer to the PDF document in the archive and publications in the reference section.
Version history
v1.0.2 - Minor bug fix multi label JSON file
v1.0.1 - Minor bug fixes in multi label JSON file and description file
v1.0.0 - First release
Citation
Ahlswede, S., Schulz, C., Gava, C., Helber, P., Bischke, B., Förster, M., Arias, F., Hees, J., Demir, B., and Kleinschmit, B.: TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing, Earth Syst. Sci. Data, 15, 681–695, https://doi.org/10.5194/essd-15-681-2023, 2023.
GitHub
Full code examples and pre-trained models from the dataset article (Ahlswede et al. 2022) using the TreeSatAI Benchmark Archive are published on the GitLab and GitHub repositories of the Remote Sensing Image Analysis (RSiM) Group (https://git.tu-berlin.de/rsim/treesat_benchmark) and the Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI) (https://github.com/DFKI/treesatai_benchmark). Code examples for the sampling strategy can be made available by Christian Schulz via email request.
Folder structure
We refer to the proposed folder structure in the PDF file.
Folder “aerial” contains the aerial imagery patches derived from summertime orthophotos of the years 2011 to 2020. Patches are available in 60 x 60 m (304 x 304 pixels). Band order is near-infrared, red, green, and blue. Spatial resolution is 20 cm.
Folder “s1” contains the Sentinel-1 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is VV, VH, and VV/VH ratio. Spatial resolution is 10 m.
Folder “s2” contains the Sentinel-2 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, and B09. Spatial resolution is 10 m.
The folder “labels” contains a JSON string which was used for multi-labeling of the training patches. Code example of an image sample with respective proportions of 94% for Abies and 6% for Larix is: "Abies_alba_3_834_WEFL_NLF.tif": [["Abies", 0.93771], ["Larix", 0.06229]]
The two files “test_filesnames.lst” and “train_filenames.lst” define the filenames used for train (90%) and test (10%) split. We refer to this fixed split for better reproducibility and comparability.
The folder “geojson” contains geoJSON files with all the samples chosen for the derivation of training patch generation (point, 60 m bounding box, 200 m bounding box).
CAUTION: As we could not upload the aerial patches as a single zip file on Zenodo, you need to download the 20 single species files (aerial_60m_…zip) separately. Then, unzip them into a folder named “aerial” with a subfolder named “60m”. This structure is recommended for better reproducibility and comparability to the experimental results of Ahlswede et al. (2022),
Join the archive
Model training, benchmarking, algorithm development… many applications are possible! Feel free to add samples from other regions in Europe or even worldwide. Additional remote sensing data from Lidar, UAVs or aerial imagery from different time steps are very welcome. This helps the research community in development of better deep learning and machine learning models for forest applications. You might have questions or want to share code/results/publications using that archive? Feel free to contact the authors.
Project description
This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TUB Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab).
Project publications
Ahlswede, S., Schulz, C., Gava, C., Helber, P., Bischke, B., Förster, M., Arias, F., Hees, J., Demir, B., and Kleinschmit, B.: TreeSatAI Benchmark Archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing, Earth System Science Data, 15, 681–695, https://doi.org/10.5194/essd-15-681-2023, 2023.
Schulz, C., Förster, M., Vulova, S. V., Rocha, A. D., and Kleinschmit, B.: Spectral-temporal traits in Sentinel-1 C-band SAR and Sentinel-2 multispectral remote sensing time series for 61 tree species in Central Europe. Remote Sensing of Environment, 307, 114162, https://doi.org/10.1016/j.rse.2024.114162, 2024.
Conference contributions
Ahlswede, S. Madam, N.T., Schulz, C., Kleinschmit, B., and Demіr, B.: Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, https://doi.org/10.48550/arXiv.2201.07495, 2022.
Schulz, C., Förster, M., Vulova, S., Gränzig, T., and Kleinschmit, B.: Exploring the temporal fingerprints of mid-European forest types from Sentinel-1 RVI and Sentinel-2 NDVI time series, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, https://doi.org/10.1109/IGARSS46834.2022.9884173, 2022.
Schulz, C., Förster, M., Vulova, S., and Kleinschmit, B.: The temporal fingerprints of common European forest types from SAR and optical remote sensing data, AGU Fall Meeting, New Orleans, USA, 2021.
Kleinschmit, B., Förster, M., Schulz, C., Arias, F., Demir, B., Ahlswede, S., Aksoy, A.K., Ha Minh, T., Hees, J., Gava, C., Helber, P., Bischke, B., Habelitz, P., Frick, A., Klinke, R., Gey, S., Seidel, D., Przywarra, S., Zondag, R., and Odermatt B.: Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees and Forests, Living Planet Symposium, Bonn, Germany, 2022.
Schulz, C., Förster, M., Vulova, S., Gränzig, T., and Kleinschmit, B.: Exploring the temporal fingerprints of sixteen mid-European forest types from Sentinel-1 and Sentinel-2 time series, ForestSAT, Berlin, Germany, 2022.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
One of the aims of the Media Content Analysis Lab (MCAL) is to provide an overview of the field of content analysis research. To this end, the MCALentory consists of an inventory of content analytical studies in The Netherlands, 2000-2023. This inventory serves multiple purposes. First, it can be used as a source of inspiration for designing future content analytical studies, for example in terms operationalization of key concepts. Second, the archive can be used for replication studies and meta analyses. Third, data can potentially be used as training data for machine learning algorithms.
To get an overview of what is in the data and what you can do it, please first read the "data story" Introduction to MCAL by Annelien Van Remoortere. More examples can be found here. Below we describe how this dataset was made, you can skip this if you are not interested in the "making of" story.
In a first step, we systematically collected data (in this case scientific articles, see description below). In the next step, we coded the collected material on several features related to the content, type of content analysis conducted, reporting of reliability/quality of content analysis, and the availability of corpora and datasets. Special attention was devoted to the degree to which authors adhere to the FAIR principles.
To make an inventory of existing media content analysis studies conducted (partly) focusing on the Netherlands, we first selected the top 30 communication science journals according to Web of Science (in 2021). We selected papers published from 2000 until 2023. All papers that looked at traditional media outlets (television, newspapers), new media (online news outlets) and social media (Twitter, Instagram, Facebook) in The Netherlands were included. In total we collected 196 articles that were annotated.
For every journal, a search was performed in Google Scholar using the query: "content analysis" media netherlands source:"selected journal" site:site of journal. Next, all Google Scholar hits were manually checked based on the title, abstract and, if there was still doubt, the method section of the paper. The results are available in CSV format, see the most recent Mcalentory.csv file in the Assets section of this site. For questions about the data collection methods and the content of this file please contact annelien.vanremoortere@wur.nl or rens.vliegenthart@wur.nl.
The controlled vocabularies created for this project have all been published in the https://w3id.org/odissei/cv/ namespace. The other MCAL RDF knowledge graphs published here use the mcal: URI prefix https://w3id.org/odissei/ns/mcal/. For an explanation of how URIs starting with https://w3id.org/odissei/ are redirected, see https://github.com/odissei-data/w3id.org/tree/master/odissei.
The dataset consists of the following named graphs, see the Graphs section on this site:
- The main dataset as mcal:graph/mcalentory. The TriplyETL code to convert the CSV file above into RDF is publicly available on the ODISSEI github. This graph has been validated against these constraints expressed in SHACL.
- A controlled vocabulary https://w3id.org/odissei/cv/contentFeature/v0.1/ for the content features has been generated from this Google sheet, for the conversion code see the same codebase as above.
- https://w3id.org/odissei/cv/contentAnalysisType/v0.1/ copy of the manually created vocabulary for content analysis types
- https://w3id.org/odissei/cv/researchQuestionType/v0.1/, a copy of the manually created vocabulary for research questions types
The project is now wrapping up, and the data as represented here should be correct to the best of our knowledge. For comments or questions on the MCAL knowledge graph, feel free to contact jacco.van.ossenbruggen@vu.nl or angelica@odissei-data.nl.
For those with an account on this platform: This dataset is published by the most recent TriplyETL pipeline listed here: internal link
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General information on the data set
The data set was generated at the ZeMA testbed. A working cycle lasts 2.8s and consists of a forward stroke, a waiting time and a return stroke. The data set does not consist of the entire working cycles. Only one second of the return stroke of each working cycle is used.
Structure of the data
Allocation of the pages to the sensors
page 1: microphone
page 2: acceleration plain bearing
page 3: acceleration piston rod
page 4: acceleration ball bearing
page 5: axial force
page 6: pressure
page 7: velocity
page 8: active current
page 9: motor current phase 1
page 10: motor current phase 2
page 11: motor current phase 3
Remark
The datasets are not in SI units. For conversion, you can use the PDF documentation.
Further information
For an introduction and tutorial to this data, a set of Jupyter notebooks is available here. These notebooks contain Python code and a documentation of example machine learning tasks and analysis of this data set. In the near future, these will be extended to also include uncertainties in the input data.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description This dataset contains a collection of short reviews extracted from Letterboxd, a popular movie social networking site. Each review is a brief commentary, consisting of user-generated text expressing opinions and sentiments about various movies. The dataset covers a wide range of films from different genres and time periods, making it a valuable resource for sentiment analysis, natural language processing, and other related tasks.
Dataset Highlights - Short reviews from real users: The dataset comprises genuine reviews shared by users, reflecting diverse opinions and emotions towards the movies they have watched. - Movie diversity: The reviews cover a vast array of films, including classics, recent releases, and cult favorites, providing a rich and diverse dataset for analysis. - Potential Applications: This dataset can be used for sentiment analysis, emotion detection, and other text-based machine learning tasks, enabling researchers and practitioners to gain insights into movie reception and audience preferences.
Scraping code Here
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The WWE Superstar Popularity Prediction Dataset is a comprehensive collection of professional wrestling data designed for machine learning, data analysis, and sports analytics projects. This dataset captures the complete ecosystem of WWE superstars, their careers, and performance metrics in a unified structure.
There are no current WWE Superstar datasets uploaded here on Kaggle, that's why I wanted it to create a new dataset for the current WWE roster that will determine their popularity tier such as main eventer, mid-carder or jobber base on their career statistics and performance metrics using machine learning algorithms.
It is my first published dataset . Upvotes are really appreciated.
This is the full github repository containing all notebooks with a clear documentation : Github
This dataset was specifically designed to meet the needs of modern data science projects:
Fighters: 70+ WWE superstars with detailed profiles
Fights: Career match statistics and performance metrics
Events: Pay-per-view main events and championship history
Context: Brand affiliations, weight classes, and career timelines
Raw Data Structure: Contains natural variations and real-world data challenges
Missing Value Opportunities: Some fields intentionally sparse for cleaning practice
Data Type Diversity: Mixed numerical, categorical, and encoded features
Outlier Detection: Natural variations in career statistics
Current Roster: 2025 WWE superstars from RAW and SmackDown
Active Champions: Current title holders and recent changes
Modern Metrics: Social media integration and digital presence
Career Progression: Ongoing career tracking
Machine Learning: Classification, regression, clustering
Data Analysis: Statistical analysis and trend identification
Data Visualization: Rich feature set for comprehensive charts
Sports Analytics: Talent evaluation and performance prediction
The dataset combines multiple data domains into a single, unified structure:
| Domain | Features Included | Description |
|---|---|---|
| Fighter Profiles | wrestler_name, age, weight_class, brand | Personal and physical attributes |
| Career Statistics | total_matches, years_active, win_percentage | Long-term performance metrics |
| Championship History | world_title_reigns, secondary_titles, tag_titles | Success and achievement tracking |
| Event Participation | main_evented_ppv, avg_matches_per_month | Schedule and exposure metrics |
| Popularity Metrics | social_media_followers, current_champion | Modern success indicators |
# Physical and Career Profile
['wrestler_name', 'age', 'weight_class', 'brand',
'debut_year', 'years_active', 'experience_level']
# In-Ring Performance
['total_matches', 'career_win_percentage', 'avg_matches_per_month',
'main_evented_ppv', 'current_champion']
# Championship Success
['world_title_reigns', 'secondary_title_reigns', 'tag_title_reigns',
'total_title_reigns', 'title_impact']
# Success Metrics
['social_media_followers_millions', 'popularity_tier',
'main_evented_ppv', 'current_champion']
This dataset provides realistic data cleaning scenarios:
# Raw categorical features needing encoding
['brand', 'weight_class', 'popularity_tier', 'experience_level']
# Derived features to create
df['title_impact'] = (df['world_title_reigns'] * 2 +
df['secondary_title_reigns'] * 1.5 +
df['tag_title_reigns'])
df['career_longevity'] = df['years_active'] / df['age']
Career length variations (R-Truth: 25 years)
Match frequency differences
Chronological validation (debut_year + years_active)
Statistical boundary checks
# Primary: Popularity Tier Prediction
target = 'popularity_tier' # ['Main Event...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sensor data set, radial forging at AFRC testbed
General information on the data set
Radial forging is widely used in industry to manufacture components for a broad range of sectors including automotive, medical, aerospace, rail and industrial. The Advanced Forming Research Centre (AFRC) at the University of Strathclyde, Glasgow, houses a GFM SKK10/R radial forge that has been used as a testbed for this project. Using two pairs of hammers operating at 1200 strokes/min, and providing a maximum forging force per hammer of 150 tons, the radial forge is capable of processing a range of metals, including steel, titanium and inconel. Both hollow and solid material can be formed with the added benefit of creating internal features on hollow parts using a mandrel. Parts can be formed at a range of temperatures from ambient temperature to 1200 °C.
For the provided data set, a total of 81 parts were forged over one day of operation. A machine failure occurred during the forging of part number 70, and this part was re-run once the malfunction had been fixed. Each forged part was then measured using a CMM to provide dimensional output relative to a target specification and tolerances. The CMM records 18 dimensional measurements.
The aim of the measurement setup is to predict the quality (in terms of dimensional properties) of the forged part from the sensor measurements during the forging process.
Structure of the data
The sensor readings for the forging of the parts are provided in 81 csv files in the folder “Scope Traces”, named “Scope0001.csv” to “Scope0081.csv”. Each file contains the readings (columns) against time (rows). The first column displays the clock times (in milliseconds).
A commentary on the sensors is provided in the file “ForgedPartDataStructureSummaryv3.xlsx” (NOTE: Some columns do not have sensor descriptions as this information is not available).
The CMM data is provided in the file “CMMData.xlsx”.
Further Information
For an introduction and tutorial to this data, a set of Jupyter notebooks is available here:
https://github.com/harislulic/Strathcylde_AFRC_machine_learning_tutorials/releases/tag/v2.0
These notebooks contain Python code and a documentation of example machine learning tasks and analysis of this data set.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ego-nets of Eastern European users collected from the music streaming service Deezer in February 2020. Nodes are users and edges are mutual follower relationships. The related task is the prediction of gender for the ego node in the graph.
The social networks of developers who starred popular machine learning and web development repositories (with at least 10 stars) until 2019 August. Nodes are users and links are follower relationships. The task is to decide whether a social network belongs to web or machine learning developers. We only included the largest component (at least with 10 users) of graphs.
Discussion and non-discussion based threads from Reddit which we collected in May 2018. Nodes are Reddit users who participate in a discussion and links are replies between them. The task is to predict whether a thread is discussion based or not (binary classification).
The ego-nets of Twitch users who participated in the partnership program in April 2018. Nodes are users and links are friendships. The binary classification task is to predict using the ego-net whether the ego user plays a single or multple games. Players who play a single game usually have a more dense ego-net.
Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.
The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.
SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn