Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset captures cultural engagement, social behavior, and interaction patterns of individuals in smart communities. Designed with privacy at its core, it aggregates anonymized data from smart devices, social media activity, and event participation logs.
It includes behavioral metrics such as event attendance frequency, social interactions, and cultural practices, along with contextual data like language usage, time-based activity patterns, and anonymized location zones. Privacy features, such as user consent and anonymization flags, ensure ethical data usage.
The dataset supports the development of culturally aware recommendation systems and can be used for tasks like event participation prediction and personalized cultural content recommendation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed for cutting-edge NLP research in resume parsing, job classification, and ATS system development. Below are extensive details and several ready-made diagrams you can include in your Kaggle upload (just save and upload as “Additional Files” or use them in your dataset description).
| Field | Description | Example/Data Type |
|---|---|---|
| ResumeID | Unique, anonymized string | "DIS4JE91Z..." (string) |
| Category | Tech job category/label | "DevOps Engineer" |
| Name | Anonymized (Faker-generated) name | "Jordan Patel" |
| Anonymized email address | "jpatel@example.com" | |
| Phone | Anonymized phone number | "+1-555-343-2123" |
| Location | City, country or region (anonymized) | "Austin, TX, USA" |
| Summary | Professional summary/intro | String (3-6 sentences) |
| Skills | List or comma-separated tech/soft skills | "Python, Kubernetes..." |
| Experience | Work chronology, organizations, bullet-point details | String (multiline) |
| Education | Universities, degrees, certs | String (multiline) |
| Source | "real", "template", "llm", "faker" | String |
https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5b5a057-7265-4428-9827-0a4c92f88d19/0e26c38c.png" alt="Dataset Schema Overview with Field Descriptions and Data Types">
Dataset Schema Overview with Field Descriptions and Data Types
MMM-YYYY)Composition by Data Source:
https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5aafe90-c5b6-4d07-ad9c-cf5244266561/5723c094.png" alt="Composition of Tech Resume Dataset by Data Source">
Composition of Tech Resume Dataset by Data Source
Role Cluster Diversity:
https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/8c6ba5d6-f676-4213-b4f7-16a133081e00/e9cc61b6.png" alt="Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset">
Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset
Alternative: Dataset by Source Type (Pie Chart):
https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/2325f133-7fe5-4294-9a9d-4db19be3584f/b85a47bd.png" alt="Resume Dataset Composition by Source Type">
Resume Dataset Composition by Source Type
Each line in tech_resumes_dataset.jsonl is a single, fully structured resume object:
import json
with open('tech_resumes_dataset.jsonl', 'r', encoding='utf-8') as f:
resumes = [json.loads(line) for line in f]
# Each record is now a Python dictionary
If you use this dataset, credit it as “[your Kaggle dataset URL]” and mention original sources (ResumeAtlas, Resume_Classification, Kaggle Resume Dataset, and synthetic methodology as described).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In modern anesthesia, multiple medical devices are used simultaneously to comprehensively monitor real-time vital signs to optimize patient care and improve surgical outcomes. However, interpreting the dynamic changes of time-series biosignals and their correlations is a difficult task even for experienced anesthesiologists. Recent advanced machine learning technologies have shown promising results in biosignal analysis, however, research and development in this area is relatively slow due to the lack of biosignal datasets for machine learning. The VitalDB (Vital Signs DataBase) is an open dataset created specifically to facilitate machine learning studies related to monitoring vital signs in surgical patients. This dataset contains high-resolution multi-parameter data from 6,388 cases, including 486,451 waveform and numeric data tracks of 196 intraoperative monitoring parameters, 73 perioperative clinical parameters, and 34 time-series laboratory result parameters. All data is stored in the public cloud after anonymization. The dataset can be freely accessed and analysed using application programming interfaces and Python library. The VitalDB public dataset is expected to be a valuable resource for biosignal research and development.
Facebook
TwitterThis dataset contains the raw data used in the article “Using AI to detect misinformation and emotions on Telegram: a comparison with the media”, accepted for publication in index.comunicación. The data includes: • Telegram dataset (tg_messages.csv): 54,456 posts extracted from 33 public Telegram channels between 23 July and 16 November 2023, related to the political debate around the Amnesty Law in Spain. Each entry includes message metadata such as channel, date, views, and content. • News headlines dataset (Titulares.csv): 46,022 news headlines mentioning “amnesty”, extracted from 377 Spanish national media outlets indexed in MediaCloud, during the same period. • Analysis scripts: Available upon request or pending publication in the article’s supplementary materials.
The data was used for topic modelling, sentiment and emotion detection with NLP techniques based on Python libraries like BERTopic and pysentimiento. All data is anonymized and publicly accessible or derived from open sources.
Facebook
TwitterThe files are of two formats: .npy and .stc.
.npy files can be read using the Numpy module in python, e.g.:
import numpy as np
data = np.load('file_name.npy')
https://numpy.org/doc/stable/reference/generated/numpy.load.html
.stc files can be read using the MNE module in python, e.g.:
from mne import read_source_estimate
stc = read_source_estimate('stc_name-lh.stc')
note that reading in the data from just one hemisphere file will automatically read the data for the other one too.
https://mne.tools/stable/generated/mne.read_source_estimate.html
Facebook
TwitterThe data file is based on a copy of the Hugging Face data file: buruzaemon/amazon_reviews_multi. Ten je kopií původního datového souboru defunct-datasets/amazon_reviews_multi. The dataset was published by the community Open Data on AWS
In our modification, we removed unnecessary columns and thus anonymized the data file, and at the same time we added columns describing the lengths of the strings of the single columns, see Multilingual_Amazon_Reviews_Corpus_analysis. Next, the dataset was re-partitioned:
The original *.jsonl data format has been changed to the more modern *.parquet format see Apacha Arrow
The data file was created for the purpose of testing the Hugging Face tutorial Summarization, because the older version of the dataset is not compatible with the new version of the datasets library.
This dataset is comprehensive, derived datasets for the tutorial can be found here:
"We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.
For each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.
Note that the language of a review does not necessarily match the language of its marketplace (e.g. reviews from amazon.de are primarily written in German, but could also be written in English, etc.). For this reason, we applied a language detection algorithm based on the work in Bojanowski et al. (2017) to determine the language of the review text and we removed reviews that were not written in the expected language." source
Documentation of the authors of the original dataset: The Multilingual Amazon Reviews Corpus
The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish.
id: record idstars: An int between 1-5 indicating the number of stars.review_body: The text body of the review.review_title: The text title of the review.language: The string identifier of the review language.product_category: String representation of the product's category.lenght_review_body: text length of review_bodylenght_review_title: text lenght of review_titlelenght_product_category: text lenght of product_categoryThis dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. Unfortunately, each of the languages included here is relatively high resource and well studied. The dataset is used for training in NLP, summarization tasks, text generation, and masked text filling. source
The dataset contains only reviews from verified purchases (as described in the paper, section 2.1), and the reviews should conform the Amazon Community Guidelines. source
Amazon has licensed this dataset under its own agreement for non-commercial research usage only. This licenc...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of medical imaging files for use in the "Medical Image Processing with Python" lesson, developed by the Netherlands eScience Center.
The dataset includes:
These files represent various medical imaging modalities and formats commonly used in clinical research and practice. They are intended for educational purposes, allowing students to practice image processing techniques, machine learning applications, and statistical analysis of medical images using Python libraries such as scikit-image, pydicom, and SimpleITK.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Description: The database contains ultrasound images of thyroid nodules that were finally included in the study. As the aim of this study was to identify nodules as benign or malignant, all nodules were placed in two zip files according to their pathological nature: benign_after.zip and malignant_after.zip. After unzipping the zip package and opening the folder, you can see several folders named by "pathological nature + number", each folder corresponds to a thyroid nodule and contains its ultrasound images collected in a single examination.
Ethical Approval: This retrospective study was approved by the institutional Ethics Committees of the First Affiliated Hospital of Jinan University, and the requirement for informed consent was waived.
Sensitive Information Protection: All sensitive information contained in the image, including the patient's personal information, the hospital visited, and the time of the visit, has been removed using the CV2 toolkit from python for the purpose of anonymization.
Processing pipeline and analysis steps: All the annotations in the images and clips were eliminated before review. US images were evaluated in a blinded fashion, with no US or pathology reports available, by two board-certified radiologists (with more than 10 years of experience in thyroid sonography) independently. Nodule size was measured as the maximal dimension on US images and the five gray-scale US categories were reviewed according to the ACR TI-RADS lexicon (5): composition, echogenicity, shape, margin, and echogenic foci. In the ACR TI-RADS, the TI-RADS risk level for nodules was determined by the total score of the five US categories, ranging from TR1 (benign) to TR5 (highly suspicious).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Overview Welcome to Kaggle's second annual Machine Learning and Data Science Survey ― and our first-ever survey data challenge.
This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!
There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.
Challenge This year Kaggle is launching the first Data Science Survey Challenge, where we will be awarding a prize pool of $28,000 to kernel authors who tell a rich story about a subset of the data science and machine learning community..
In our second year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.
The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!
Submissions will be evaluated on the following:
Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and kernel, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one kernel, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.
While the challenge is running, Kaggle will also give a Weekly Kernel Award of $1,500 to recognize excellent kernels that are public analyses of the survey. Weekly Kernel Awards will be announced every Friday between 11/9 and 11/30.
How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.
No submission is necessary for the Weekly Kernels Awards. To be eligible, a kernel must be public and use the 2018 Data Science Survey as a data source.
Timeline All dates are 11:59PM UTC
Submission deadline: December 3rd
Winners announced: December 10th
Weekly Kernels Award prize winners announcements: November 9th, 16th, 23rd, and 30th
All kernels are evaluated after the deadline.
Rules To be eligible to win a prize in either of the above prize tracks, you must be:
a registered account holder at Kaggle.com; the older of 18 years old or the age of majority in your jurisdiction of residence; and not a resident of Crimea, Cuba, Iran, Syria, North Korea, or Sudan Your kernels will only be eligible to win if they have been made public on kaggle.com by the above deadline. All prizes are awarded at the discretion of Kaggle. Kaggle reserves the right to cancel or modify prize criteria.
Unfortunately employees, interns, contractors, officers and directors of Kaggle Inc., and their parent companies, are not eligible to win any prizes.
Survey Methodology ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Information
Hyperreal Talk (Polish clear web message board) messages data.
Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).
The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.
Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).
Data Collection Context
Polish clear web message board called Hyperreal Talk (https://hyperreal.info/talk/).
This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.
The Hyperreal Talk forum emerges as a pivotal online space on the Polish internet, serving as a hub for discussions and the exchange of knowledge and experiences concerning drug use. It plays a crucial role in investigating the narratives and discourses that shape the drug subculture and the broader societal perceptions of drug consumption. The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.
The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.
The data was collected in two periods, i.e., in September 2023 and November 2023.
Data Content
The dataset comprises all messages posted on the Polish-language Hyperreal Talk message board from its inception until November 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories: “hyperreal” and “hyperreal_hidden.” The “hyperreal” directory contains accessible posts without needing to log in to Hyperreal Talk, while the “hyperreal_hidden” directory holds posts that can only be viewed by logged-in users. For each directory, a .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 6,248,842 posts.
The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.
The dataset consists of the following files:
Zipped .txt files (hyperreal.zip) containing messages that are visible without logging into Hyperreal Talk. These files are organized into individual directories that mirror the folder structure found on the Hyperreal Talk message board.
Zipped .txt files (hyperreal_hidden.zip) containing messages that are visible only after logging into Hyperreal Talk. Similar to the first type, these files are organized into directories corresponding to the website’s folder structure.
A .csv file that lists all the messages, including file names and the content of each post.
Accessibility and Usage
The data can be accessed without any restrictions.
Attached are .txt files detailing the tree of folders for “hyperreal.zip” and “hyperreal_hidden.zip.”
Documentation on the Python regular expressions used for scraping, cleaning, processing, and anonymizing the data can be found on GitHub at the following URLs:
https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710
https://github.com/HaitaoShi/Scrapy_hyperreal"
Ethical Considerations
A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:
Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.
The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The CMT1A-BioStampNPoint2023 dataset provides data from a wearable sensor accelerometry study conducted for studying gait, balance, and activity in 15 individuals with Charcot-Marie-Tooth disease Type 1A (CMT1A). In addition to individuals with CMT1A, the dataset also includes data for 15 controls that also went through the same in-clinic study protocol as the CMT1A participants with a substantial fraction (9) of the controls also participating in the in-home study protocol. For the CMT1A participants, data is provided for 15 participants for the baseline visit and associated home recording duration and, additionally, for a subset of 12 of these participants data is also provided for a 12-month longitudinal visit and associated home recording duration. For controls, no longitudinal data is provided as none was recorded. The data were acquired using lightweight MC 10 BioStamp NPoint sensors (MC 10 Inc, Lexington, MA), three of which were attached to each participant for gathering data over a roughly one day interval. For additional details, see the description in the "README.md" included with the dataset. Methods The dataset contains data from wearable sensors and clinical data. The wearable sensor data was acquired using wearable sensors and the clinical data was extracted from the clinical record. The sensor data has not been processed per-se but the start of the recording time has been anonymized to comply with HIPPA requirements. Both the sensor data and the clinical data passed through a Python program for the aforementioned time anonymization and for standard formatting. Additional details of the time anonymization are provided in the file "README.md" included with the dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Images for two separate cohorts of fifteen head and neck cancer patients diagnosed with oral or oropharyngeal squamous cell carcinoma. Images for each patient were acquired before the start of radiotherapy. The first cohort, labeled “heterogeneous” (HET), consisted of patients with images acquired at different institutions and is termed “heterogeneous” because of the variety of acquisition scanners and parameters used in image generation. HET cohort images include T2-weighted MRI only.The second cohort, labeled “homogeneous” (HOM), consisted of patients from a single prospective clinical trial with the same imaging protocol (NCT04265430, PA16-0302) and is termed “homogeneous” because of the consistency in both scanner and acquisition parameters used in image generation. HOM cohort images include T2-weighted MRI for all patients and Dixon T1-w Water Enhanced MRI and CT for a subset of 5 patients. For each image, regions of interest (ROI)s of various healthy tissue types and anatomical locations were manually contoured in the same relative area for five slices by one observer (medical student) using Velocity AI v.3.0.1 (Atlanta, GA, USA), verified by a physician expert (radiologist), and exported as DICOM-RT Structure Set files. The ROIs were: 1. cerebrospinal fluid inferior (CSF_inf), 2. cerebrospinal fluid middle (CSF_mid), 3. cerebrospinal fluid superior (CSF_sup), 4. cheek fat left (Fat_L), 5. cheek fat right (Fat_R), 6. nape fat inferior (NapeFat_inf), 7. nape fat middle (NapeFat_mid), 8. nape fat superior (NapeFat_sup), 9. neck fat (NeckFat), 10. masseter left (Masseter_L), 11. masseter right (Masseter_R), 12. rectus capitus posterior major (RCPM), 13. skull, and 14. cerebellum. All images were retrospectively acquired from the University of Texas MD Anderson Cancer Center clinical databases in agreement with an Institutional Review Board approved protocol designed to collect data from patients with multiple imaging acquisitions (RCR03-0800). The protocol included a waiver of informed consent.DICOM data was anonymized using an in-house Python script that implements the RSNA CRP DICOM Anonymizer software. All files have had any DICOM header info and metadata containing PHI removed or replaced with dummy entries. Note: A zipped file was uploaded, once unzipped, there should be 2 separate folders corresponding to each cohort, which can then be used as inputs to the github code at: https://github.com/kwahid/MRI_Intensity_Standardization.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 200+ images of DJI Agras crop sprayer drones taken by private drone operators throughout South America and the Caribbean. The images are classified by generation and drone model, as shown in the table below.
| Gen | Model | Arms | Rotors | Nozzle Type | Nozzles | Images |
|---|---|---|---|---|---|---|
| 02 | DJI Agras T16 | 6 | 6 | Pressure (Flat fan) | 8 | 15 |
| 02 | DJI Agras T20 | 6 | 6 | Pressure (Flat fan) | 8 | 44 |
| 03 | DJI Agras T10 | 4 | 4 | Pressure (Flat fan) | 4 | 1 |
| 03 | DJI Agras T30 | 6 | 6 | Pressure (Flat fan) | 16 | 75 |
| 04 | DJI Agras T20P | 4 | 4 | Centrifugal | 2 | 8 |
| 04 | DJI Agras T40 | 4 | 8 | Centrifugal | 2 | 67 |
| 05 | DJI Agras T50 | 4 | 8 | Centrifugal | 2/4 | 17 |
A couple of technical notes: * The tank size in liters is given in the model name after the letter T, e.g., the T16 has a 16-liter tank, the T30 has a 30-liter tank, and so on. An exception to this rule is the T50, which has a standard tank size of 40 liters and the option to install a 50-liter tank. * Each rotor is equipped with two propeller blades. Hence, the total number of propeller blades on a drone is twice the number of rotors.
This dataset is obviously too small to train models from scratch, but it is ideal to test fine-tuning methods or few-shot learning methods. Here are a few ideas: * Combine this dataset with one containing camera drones, i.e., small drones used for photography and videography (e.g., DJI Phantom, Mavic, Inspire, Matrice; Autel EVO; etc.). Fine-tune a model to distinguish crop sprayer drones from camera drones. * Fine-tune a model to classify drones by nozzle type: flat fan pressure nozzles (T16/T20/T10/T30) vs. centrifugal nozzles (T20P/T40/T50). * Fine-tune a model to classify by number of arms: 6-arm models (T16/T20/T30) vs. 4-arm models (T10/T20P/T40/T50).
The majority of the images in this dataset come from WhatsApp group chat conversations and were taken with various smartphone cameras. A small number of images were taken by me using my own smartphone when I worked as a crop spraying services provider.
To ensure anonymization, all faces and identifying information (e.g., logos, truck license plates) were blurred using Gaussian kernels.
Additionally, during metadata cleaning, all Exif metadata (including ICC color profiles) was removed. However, all images were originally captured in sRGB or close-to-sRGB color spaces. As a result, standard image viewers (e.g., Ubuntu's default viewer) render them without visible changes. You can safely assume sRGB when loading the images.
If you are using Python libraries such as PIL, PyTorch, or Keras, you can ensure consistent color handling by explicitly converting images to RGB mode and treating pixel values as standard 0–255 sRGB values.
Using PIL (standalone)
python
from PIL import Image
img = Image.open("path/to/image.jpg").convert("RGB") # Force sRGB interpretation
img_array = np.array(img) / 255.0 # Normalize if needed
Using PyTorch with torchvision ```python from torchvision import transforms from PIL import Image
transform = transforms.Compose([ transforms.ConvertImageDtype(torch.float), transforms.ToTensor(), # Converts to [0, 1] and permutes (H, W, C) to (C, H, W) ])
img = Image.open("path/to/image.jpg").convert("RGB") tensor = transform(img) ```
Using Keras ```python from tensorflow.keras.preprocessing.image import load_img, img_to_array
img = load_img("path/to/image.jpg", color_mode='rgb') img_array = img_to_array(img) / 255.0 # Normalize if required by your model ```
For suggestions, questions, or feedback, you can reach me at luis.i.reyes.castro@gmail.com. In case you download this dataset from Kaggle, you can find the original repository here.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Information
Dopek.eu (Polish clear web and dark web message board) messages data.
Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).
The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.
Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).
Data Collection Context
Clear web and dark web message board called dopek.eu (https://dopek.eu/).
This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.
The dopek.eu forum emerges as a pivotal online space on the Polish internet, serving as a hub for trading, discussions, and the exchange of knowledge and experiences concerning the use of the so-called new psychoactive substances (designer drugs). The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.
The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.
The data was collected in October 2023.
Data Content
The dataset comprises all messages posted on dopek.eu from its inception until October 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. A .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 171,121 posts.
The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.
The dataset consists of the following types of files:
Zipped .txt files (dopek.zip) containing all messages (posts).
A .csv file that lists all the messages, including file names and the content of each post.
Accessibility and Usage
The data can be accessed without any restrictions.
Attached are .txt files detailing the tree of folders for “dopek.zip”.
Ethical Considerations
A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:
Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.
The primary safeguard was the early-stage hashing of usernames and identifiers from the posts, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload provides Open Data associated with the publication "A machine learning tool to improve prediction of mediastinal lymph node metastases in non-small cell lung cancer using routinely obtainable [18F]FDG-PET/CT parameters" by Rogasch JMM et al. (2022).
The upload contains the anonymized dataset with 10 features necessary for the final GBM model that was presented in the publication. However, the original full dataset with 40 features was excluded from this Open Data repository because it may not comply with strict rules of data anonymization. The full dataset can be obtained from the corresponding author (julian.rogasch@charite.de) upon reasonable request.
Besides the dataset, this upload provides the original python and R scripts that were used as well as their output.
A description of all files can be found in "content_description_2022_11_19.txt".
A user-friendly web tool that implements the final machine learning model can be found here: PET_LN_calculator
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 1050 and represents anonymized training data collected from college athletes involved in various sports programs. It includes biometric signals, physical performance metrics, recovery patterns, and personalized training feedback. Data was gathered from wearable devices, fitness logs, and observational assessments during regular training sessions.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Overview
The COVID-19 Patient Recovery Dataset is a synthetic collection of anonymized records for around 70,000 COVID-19 patients. It aims to assist with classification tasks in machine learning and epidemiological research. The dataset includes detailed clinical and demographic information, such as symptoms, existing health issues, vaccination status, COVID-19 variants, treatment details, and outcomes related to recovery or mortality. This dataset is great for predicting patient recovery (recovered), mortality (death), disease severity (severity), or the need for intensive care (icu_admission) using algorithms like Logistic Regression, Random Forest, XGBoost, or Neural Networks. It also allows for exploratory data analysis (EDA), statistical modeling, and time-series studies to find patterns in COVID-19 outcomes.
The data is synthetic and reflects realistic trends found in public health data, based on sources like WHO reports. It ensures privacy and follows ethical guidelines. Dates are provided in Excel serial format, meaning 44447 corresponds to September 8, 2021, and can be converted to standard dates using Python’s datetime or Excel. With 70,000 records and 28 columns, this dataset serves as a valuable resource for data scientists, researchers, and students interested in health-related machine learning or pandemic trends.
Data Source and Collection
Source: Synthetic data based on public health patterns from sources like the World Health Organization (WHO). It includes placeholder URLs.
Collection Period: Simulated from early 2020 to mid-2022, covering the Alpha, Delta, and Omicron waves.
Number of Records: 70,000.
File Format: CSV, which works with Pandas, R, Excel, and more.
Data Quality Notes:
About 5% of the values are missing in fields like symptoms_2, symptoms_3, treatment_given_2, and date.
There are rare inconsistencies, such as between recovery/death flags and dates, which may need some preprocessing.
Unique, anonymized patient IDs.
| Column Name | Data Type |
|---|---|
| patient_id | String |
| country | String |
| region/state | String |
| date_reported | Integer |
| age | Integer |
| gender | String |
| comorbidities | String |
| symptoms_1 | String |
| symptoms_2 | String |
| symptoms_3 | String |
| severity | String |
| hospitalized | Integer |
| icu_admission | Integer |
| ventilator_support | Integer |
| vaccination_status | String |
| variant | String |
| treatment_given_1 | String |
| treatment_given_2 | String |
| days_to_recovery | Integer |
| recovered | Integer |
| death | Integer |
| date_of_recovery | Integer |
| date_of_death | Integer |
| tests_conducted | Integer |
| test_type | String |
| hospital_name | String |
| doctor_assigned | String |
| source_url | String |
Key Column Details
patient_id: Unique identifier (e.g., P000001).
country: Reporting country (e.g., India, USA, Brazil, Germany, China, Pakistan, South Africa, UK).
region/state: Sub-national region (e.g., Sindh, California, São Paulo, Beijing).
date_reported, date_of_recovery, date_of_death: Excel serial dates (convert using datetime(1899,12,30) + timedelta(days=value)).
age: Patient age (1–100 years).
gender: Male or Female.
comorbidities: Pre-existing conditions (e.g., Diabetes, Hypertension, Cancer, Heart Disease, Asthma, None).
symptoms_1, symptoms_2, symptoms_3: Reported symptoms (e.g., Cough, Fever, Fatigue, Loss of Smell, Sore Throat, or empty).
severity: Case severity (Mild, Moderate, Severe, Critical).
hospitalized, icu_admission, ventilator_support: Binary (1 = Yes, 0 = No).
vaccination_status: None, Partial, Full, or Booster.
variant: COVID-19 variant (Omicron, Delta, Alpha).
treatment_given_1, treatment_given_2: Treatments administered (e.g., Antibiotics, Remdesivir, Oxygen, Steroids, Paracetamol, or empty).
days_to_recovery: Days from report to recovery (5–30, or empty if not recovered).
recovered, death: Binary outcomes (1 = Yes, 0 = No; generally mutually exclusive).
tests_conducted: Number of tests (1–5).
test_type: PCR or Antigen.
hospital_name: Fictional hospital (e.g., Aga Khan, Mayo Clinic, NHS Trust).
doctor_assigned: Fictional doctor name (e.g., Dr. Smith, Dr. Müller).
source_url: Placeholder.
Summary Statistics
Total Patients: 70,000.
Age: Mean ~50 years, Min 1, Max 100, evenly distributed.
Gender: ~50% Male, ~50% Female.
Top Countries: USA (20%), India (18%), Brazil (15%), China (12%), Germany (10%).
Comorbidities: Diabetes (25%), Hypertension (20%), Cancer (15%), Heart Disease (15%), Asthma (10%), None (15%).
Severity: Mild (60%), Moderate (25%), Severe (10%), Critical (5%).
Recovery Rate: ~60% recovered (recovered=1), ~30% deceased (death=1), ~10% unresolved (both 0).
Vaccination: None (40%), Full (30%), Partial (15%), Booster (15%).
Variants: Omicron (50%), Delt...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
📝 Dataset Overview: This dataset focuses on early warning detection for sepsis, a critical and potentially fatal medical condition. It includes anonymized vital signs, lab results, and clinical indicators of patients admitted to the hospital, structured for real-time monitoring and predictive modeling.
It’s ideal for clinical data analysts, healthcare data scientists, and AI practitioners aiming to develop decision support tools, early warning dashboards, or predictive health models.
🔍 Dataset Features: Column Name Description Patient_ID Unique anonymized identifier Admission_Date Patient’s hospital admission date Temperature_C Body temperature in degrees Celsius BP_Systolic Systolic blood pressure (mmHg) BP_Diastolic Diastolic blood pressure (mmHg) Heart_Rate Beats per minute WBC_Count White blood cell count (x10⁹/L) Lactate_mmol_L Lactate level in mmol/L Sepsis_Flag Binary indicator (1 = Suspected Sepsis, 0 = Normal) Ward Hospital ward/unit Doctor_On_Duty Attending physician name (anonymized)
🎯 Use Cases: Build Power BI dashboards for hospital early warning systems
Train ML classification models to detect early signs of sepsis
Create patient monitoring tools with Python or R
Explore the relationship between vitals & sepsis onset
Perform feature engineering for risk scoring systems
📌 Clinical Relevance: Sepsis is one of the leading causes of in-hospital mortality worldwide. Early detection is crucial to reducing death rates and improving outcomes. This dataset empowers developers and analysts to make a meaningful impact in the healthcare sector.
👤 Created By: Fatolu Peter (Emperor Analytics) A passionate healthcare analyst leveraging data to drive innovation in public health across Nigeria. This is Project 12 in my data-for-good series.
✅ LinkedIn Post: 🚨 New Dataset: Sepsis Early Warning System Data – Now on Kaggle 📊 Clinical vital signs + lab markers + sepsis risk flags 🔗 Explore the dataset here
This dataset enables healthcare data scientists to: ✅ Build real-time hospital dashboards ✅ Predict sepsis risk with machine learning ✅ Explore vitals like BP, lactate, WBC, and temperature ✅ Support early intervention using data insights
Whether you're into: 🧠 Predictive modeling 📈 Power BI clinical dashboards 📉 Risk analytics in healthcare This is for you.
Join me in using data to save lives — one insight at a time. If you build something, tag me. I’ll gladly share it! 💡
Let me know if you’d like help starting a Power BI or Python model for this!
You said: whic license is the best to choose that is global
ChatGPT said: For globally recognized and widely accepted dataset licensing on Kaggle, the best license to choose depends on your intent — especially whether you want others to reuse and build upon your dataset freely.
✅ Recommended License: Creative Commons Attribution 4.0 (CC BY 4.0) Why CC BY 4.0 is best globally: 🌍 Internationally recognized and legally valid worldwide.
✅ Allows others to use, share, adapt, and build on your dataset, even for commercial purposes.
📢 Requires only attribution to you as the creator.
🔐 You still retain copyright while maximizing openness.
When to choose it: If your goal is to:
Share freely with the global community,
Allow use in academic, commercial, or public projects,
Gain credit and visibility as the original creator.
⚠️ Other license types (if needed): CC BY-NC 4.0 (Attribution-NonCommercial): Only for non-commercial use.
CC0 (Public Domain Dedication): Freest use; no attribution required — but not always ideal if you want credit.
GPL / Open Data Commons: More for software or structured databases with specific open-source obligations.
✅ Final Suggestion for Your Datasets: Use CC BY 4.0 for all your uploads, unless you have confidential/private data. It's perfect for:
Healthcare datasets
Sales/retail analytics
Kaggle portfolio building
Global recognition & impact
Facebook
TwitterThis dataset contains Twitter posts with both text and images for depression detection research.
twitter_depression_dataset/
├── images/
│ ├── depresi/ # Images from depression-related posts
│ └── nondepresi/ # Images from non-depression posts
├── metadata/
│ ├── full_dataset.csv # Complete dataset with all metadata
│ └── image_labels.csv # Simple image-label mapping
└── dataset-metadata.json # Kaggle dataset metadata
full_text: Original tweet textlabel_text: Text label (0=non-depression, 1=depression)label_image: Image label (0=non-depression, 1=depression)kaggle_output_path: Path to image file in this datasetcreated_at: Tweet creation timestampfavorite_count: Number of likesimport pandas as pd
# Load the dataset
df = pd.read_csv('metadata/full_dataset.csv')
# Load image-label mapping
labels = pd.read_csv('metadata/image_labels.csv')
If you use this dataset, please cite appropriately and ensure ethical use for research purposes.
This dataset contains social media data. Please ensure: - Respectful use for research purposes - Proper anonymization in publications - Compliance with platform terms of service - Consideration of mental health sensitivity
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
💎 Where Precision Medicine Meets the Vibrance of Data Science Unlock insights, drive innovations, and explore healthcare analytics with a colorful, interactive, and thematically bold dataset.
This dataset delivers fully anonymized laboratory test results with a visually rich and research-ready design. Each element—from clear unit descriptions to color-coded status flags—is crafted for maximum clarity and engagement.
💡 Ideal For:
Format: CSV – ready to use with Python, R, Excel, Tableau, Power BI, or any BI/ML platform.
| Column | Description |
|---|---|
| Date | Test date (YYYY-MM-DD) |
| Test_Name | Laboratory test name |
| Result | Measured value (numeric or qualitative) |
| Unit | Measurement unit abbreviation |
| Reference_Range | Official normal range |
| Status | Normal / High / Low indicator (⚪🟢🔴) |
| Comment | Short medical interpretation |
| Min_Reference | Lower bound of reference range |
| Max_Reference | Upper bound of reference range |
| Unit_Description | Expanded description of the unit |
| Recommended_Followup | Suggested monitoring or medical action |
Here’s what some of the common units mean in a medical context:
These units help clinicians determine how much of a substance is present and compare it with healthy reference ranges.
This dataset is for educational and research purposes only. It is not intended for actual medical diagnosis or treatment.
CC0 1.0 Public Domain Dedication – Free to use, share, remix, and adapt.
Crafted to inspire data-driven healthcare solutions, this dataset empowers researchers, educators, and developers to transform raw lab results into vivid, interactive, and actionable insights.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset captures cultural engagement, social behavior, and interaction patterns of individuals in smart communities. Designed with privacy at its core, it aggregates anonymized data from smart devices, social media activity, and event participation logs.
It includes behavioral metrics such as event attendance frequency, social interactions, and cultural practices, along with contextual data like language usage, time-based activity patterns, and anonymized location zones. Privacy features, such as user consent and anonymization flags, ensure ethical data usage.
The dataset supports the development of culturally aware recommendation systems and can be used for tasks like event participation prediction and personalized cultural content recommendation.