20 datasets found
  1. Cultural Behavior and Engagement Dataset

    • kaggle.com
    zip
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Python Developer (2025). Cultural Behavior and Engagement Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/cultural-behavior-and-engagement-dataset
    Explore at:
    zip(135575 bytes)Available download formats
    Dataset updated
    Apr 22, 2025
    Authors
    Python Developer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset captures cultural engagement, social behavior, and interaction patterns of individuals in smart communities. Designed with privacy at its core, it aggregates anonymized data from smart devices, social media activity, and event participation logs.

    It includes behavioral metrics such as event attendance frequency, social interactions, and cultural practices, along with contextual data like language usage, time-based activity patterns, and anonymized location zones. Privacy features, such as user consent and anonymization flags, ensure ethical data usage.

    The dataset supports the development of culturally aware recommendation systems and can be used for tasks like event participation prediction and personalized cultural content recommendation.

  2. Resume_Dataset

    • kaggle.com
    zip
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RayyanKauchali0 (2025). Resume_Dataset [Dataset]. https://www.kaggle.com/datasets/rayyankauchali0/resume-dataset
    Explore at:
    zip(3616108 bytes)Available download formats
    Dataset updated
    Jul 26, 2025
    Authors
    RayyanKauchali0
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Tech Resume Dataset (3,500+ Samples):

    This dataset is designed for cutting-edge NLP research in resume parsing, job classification, and ATS system development. Below are extensive details and several ready-made diagrams you can include in your Kaggle upload (just save and upload as “Additional Files” or use them in your dataset description).

    Dataset Composition and Sourcing

    • Total Resumes: 3,500+
    • Sources:
      • Real Data: 2,047 resumes (58.5%) from ResumeAtlas and reputable open repositories; all records strictly anonymized.
      • Template-Based Synthetic: 573 resumes featuring varied narratives and realistic achievements for classic, modern, and professional styles.
      • LLM-Generated Variations: 460 unique samples using structured prompts to diversify skills, summaries, and career tracks, focusing on AI, ML, and data.
      • Faker-Seeded Synthetic: 420 resumes, especially for junior/support/cloud/network tracks, populated with robust Faker-generated work and education fields.
    • Role Coverage:
      • 15 major technology clusters (Software Engineering, DevOps, Cloud, AI/ML, Security, Data Engineering, QA, UI/UX, and more)
      • At least 200 samples per primary role group for label balance
      • 60+ subcategories reflecting granular tech job roles

    Key Dataset Fields (JSONL Schema)

    FieldDescriptionExample/Data Type
    ResumeIDUnique, anonymized string"DIS4JE91Z..." (string)
    CategoryTech job category/label"DevOps Engineer"
    NameAnonymized (Faker-generated) name"Jordan Patel"
    EmailAnonymized email address"jpatel@example.com"
    PhoneAnonymized phone number"+1-555-343-2123"
    LocationCity, country or region (anonymized)"Austin, TX, USA"
    SummaryProfessional summary/introString (3-6 sentences)
    SkillsList or comma-separated tech/soft skills"Python, Kubernetes..."
    ExperienceWork chronology, organizations, bullet-point detailsString (multiline)
    EducationUniversities, degrees, certsString (multiline)
    Source"real", "template", "llm", "faker"String

    https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5b5a057-7265-4428-9827-0a4c92f88d19/0e26c38c.png" alt="Dataset Schema Overview with Field Descriptions and Data Types">

    Dataset Schema Overview with Field Descriptions and Data Types

    Technical Validation & Quality Assurance

    • Formatting:
      • Uniform schema, right-tab alignment for dates (MMM-YYYY)
      • Standard ATS/NLP-friendly section headers
    • De-duplication:
      • All records checked with BERT/MinHash for uniqueness (cosine similarity >0.9 removed)
    • PII Scrubbing:
      • Names, contacts, locations anonymized with Python Faker
    • Role/Skill Taxonomy:
      • Job titles & skills mapped to ESCO, O*NET, NIST NICE, CNCF lexicons for research alignment
    • Quality Checks:
      • Automatic and manual validation for section presence, data type conformity, and format alignment

    Role & Source Coverage Visualizations

    Composition by Data Source:

    https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5aafe90-c5b6-4d07-ad9c-cf5244266561/5723c094.png" alt="Composition of Tech Resume Dataset by Data Source">

    Composition of Tech Resume Dataset by Data Source

    Role Cluster Diversity:

    https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/8c6ba5d6-f676-4213-b4f7-16a133081e00/e9cc61b6.png" alt="Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset">

    Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset

    Alternative: Dataset by Source Type (Pie Chart):

    https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/2325f133-7fe5-4294-9a9d-4db19be3584f/b85a47bd.png" alt="Resume Dataset Composition by Source Type">

    Resume Dataset Composition by Source Type

    Typical Use Cases

    • Resume parsing & sectioning (training for models like BERT, RoBERTa, spaCy)
    • Fine-tuning for NER, job classification (60+ labels), skill extraction, and ATS research
    • Development or benchmarking of AI-powered job matching, candidate ranking, and automated tracking tools
    • ML/data science education and demo pipelines

    How to Use the JSONL File

    Each line in tech_resumes_dataset.jsonl is a single, fully structured resume object:

    import json
    
    with open('tech_resumes_dataset.jsonl', 'r', encoding='utf-8') as f:
      resumes = [json.loads(line) for line in f]
    # Each record is now a Python dictionary
    

    Citing and Sharing

    If you use this dataset, credit it as “[your Kaggle dataset URL]” and mention original sources (ResumeAtlas, Resume_Classification, Kaggle Resume Dataset, and synthetic methodology as described).

  3. p

    Data from: VitalDB, a high-fidelity multi-parameter vital signs database in...

    • physionet.org
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyung-Chul Lee; Chul-Woo Jung (2022). VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients [Dataset]. http://doi.org/10.13026/czw8-9p62
    Explore at:
    Dataset updated
    Sep 21, 2022
    Authors
    Hyung-Chul Lee; Chul-Woo Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In modern anesthesia, multiple medical devices are used simultaneously to comprehensively monitor real-time vital signs to optimize patient care and improve surgical outcomes. However, interpreting the dynamic changes of time-series biosignals and their correlations is a difficult task even for experienced anesthesiologists. Recent advanced machine learning technologies have shown promising results in biosignal analysis, however, research and development in this area is relatively slow due to the lack of biosignal datasets for machine learning. The VitalDB (Vital Signs DataBase) is an open dataset created specifically to facilitate machine learning studies related to monitoring vital signs in surgical patients. This dataset contains high-resolution multi-parameter data from 6,388 cases, including 486,451 waveform and numeric data tracks of 196 intraoperative monitoring parameters, 73 perioperative clinical parameters, and 34 time-series laboratory result parameters. All data is stored in the public cloud after anonymization. The dataset can be freely accessed and analysed using application programming interfaces and Python library. The VitalDB public dataset is expected to be a valuable resource for biosignal research and development.

  4. u

    Data from: Dataset for: "Using AI to detect misinformation and emotions on...

    • investigacion.ujaen.es
    • zenodo.org
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Montoro Montarroso, Andrés; Cantón-Correa, Javier; Montoro Montarroso, Andrés; Cantón-Correa, Javier (2025). Dataset for: "Using AI to detect misinformation and emotions on Telegram: a comparison with the media" [Dataset]. https://investigacion.ujaen.es/documentos/6856992d6364e456d3a66e70
    Explore at:
    Dataset updated
    2025
    Authors
    Montoro Montarroso, Andrés; Cantón-Correa, Javier; Montoro Montarroso, Andrés; Cantón-Correa, Javier
    Description

    This dataset contains the raw data used in the article “Using AI to detect misinformation and emotions on Telegram: a comparison with the media”, accepted for publication in index.comunicación. The data includes: • Telegram dataset (tg_messages.csv): 54,456 posts extracted from 33 public Telegram channels between 23 July and 16 November 2023, related to the political debate around the Amnesty Law in Spain. Each entry includes message metadata such as channel, date, views, and content. • News headlines dataset (Titulares.csv): 46,022 news headlines mentioning “amnesty”, extracted from 377 Spanish national media outlets indexed in MediaCloud, during the same period. • Analysis scripts: Available upon request or pending publication in the article’s supplementary materials.

    The data was used for topic modelling, sentiment and emotion detection with NLP techniques based on Python libraries like BERTopic and pysentimiento. All data is anonymized and publicly accessible or derived from open sources.

  5. d

    Anonymized source data files for figures in: Recurrent processes support a...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Sep 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Gwilliams; Jean-Remi King (2020). Anonymized source data files for figures in: Recurrent processes support a cascade of hierarchical decisions [Dataset]. http://doi.org/10.5061/dryad.70rxwdbtr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 4, 2020
    Dataset provided by
    Dryad
    Authors
    Laura Gwilliams; Jean-Remi King
    Time period covered
    Sep 1, 2020
    Description

    The files are of two formats: .npy and .stc.

    .npy files can be read using the Numpy module in python, e.g.:

    import numpy as np

    data = np.load('file_name.npy')

    https://numpy.org/doc/stable/reference/generated/numpy.load.html

    .stc files can be read using the MNE module in python, e.g.:

    from mne import read_source_estimate

    stc = read_source_estimate('stc_name-lh.stc')

    note that reading in the data from just one hemisphere file will automatically read the data for the other one too.

    https://mne.tools/stable/generated/mne.read_source_estimate.html

  6. edit_amazon_reviews_multi

    • kaggle.com
    zip
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radim Közl (2025). edit_amazon_reviews_multi [Dataset]. https://www.kaggle.com/datasets/radimkzl/edit-amazon-reviews-multi
    Explore at:
    zip(167232183 bytes)Available download formats
    Dataset updated
    Aug 21, 2025
    Authors
    Radim Közl
    Description

    Dataset Summary

    The data file is based on a copy of the Hugging Face data file: buruzaemon/amazon_reviews_multi. Ten je kopií původního datového souboru defunct-datasets/amazon_reviews_multi. The dataset was published by the community Open Data on AWS

    In our modification, we removed unnecessary columns and thus anonymized the data file, and at the same time we added columns describing the lengths of the strings of the single columns, see Multilingual_Amazon_Reviews_Corpus_analysis. Next, the dataset was re-partitioned:

    • train: 95% (199500)
    • validation: 2.5% (5250)
    • test: 2.5% (5250)

    The original *.jsonl data format has been changed to the more modern *.parquet format see Apacha Arrow

    The data file was created for the purpose of testing the Hugging Face tutorial Summarization, because the older version of the dataset is not compatible with the new version of the datasets library.

    This dataset is comprehensive, derived datasets for the tutorial can be found here:

    Description of the original dataset - Hugging Face Datasets

    "We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.

    For each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.

    Note that the language of a review does not necessarily match the language of its marketplace (e.g. reviews from amazon.de are primarily written in German, but could also be written in English, etc.). For this reason, we applied a language detection algorithm based on the work in Bojanowski et al. (2017) to determine the language of the review text and we removed reviews that were not written in the expected language." source

    Documentation of the authors of the original dataset: The Multilingual Amazon Reviews Corpus

    Languages

    The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish.

    Dataset Structure

    • id: record id
    • stars: An int between 1-5 indicating the number of stars.
    • review_body: The text body of the review.
    • review_title: The text title of the review.
    • language: The string identifier of the review language.
    • product_category: String representation of the product's category.
    • lenght_review_body: text length of review_body
    • lenght_review_title: text lenght of review_title
    • lenght_product_category: text lenght of product_category

    Social Impact of Dataset

    This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. Unfortunately, each of the languages included here is relatively high resource and well studied. The dataset is used for training in NLP, summarization tasks, text generation, and masked text filling. source

    Discussion of Biases of origin dataset

    The dataset contains only reviews from verified purchases (as described in the paper, section 2.1), and the reviews should conform the Amazon Community Guidelines. source

    Licensing Information

    Licensing of origin dataset

    Amazon has licensed this dataset under its own agreement for non-commercial research usage only. This licenc...

  7. Multi-modality medical image dataset for medical image processing in Python...

    • zenodo.org
    zip
    Updated Aug 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Candace Moore; Candace Moore; Giulia Crocioni; Giulia Crocioni (2024). Multi-modality medical image dataset for medical image processing in Python lesson [Dataset]. http://doi.org/10.5281/zenodo.13305760
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Candace Moore; Candace Moore; Giulia Crocioni; Giulia Crocioni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of medical imaging files for use in the "Medical Image Processing with Python" lesson, developed by the Netherlands eScience Center.

    The dataset includes:

    1. SimpleITK compatible files: MRI T1 and CT scans (training_001_mr_T1.mha, training_001_ct.mha), digital X-ray (digital_xray.dcm in DICOM format), neuroimaging data (A1_grayT1.nrrd, A1_grayT2.nrrd). Data have been downloaded from here.
    2. MRI data: a T2-weighted image (OBJECT_phantom_T2W_TSE_Cor_14_1.nii in NIfTI-1 format). Data have been downloaded from here.
    3. Example images for the machine learning lesson: chest X-rays (rotatechest.png, other_op.png), cardiomegaly example (cardiomegaly_cc0.png).
    4. Additional anonymized data: TBA

    These files represent various medical imaging modalities and formats commonly used in clinical research and practice. They are intended for educational purposes, allowing students to practice image processing techniques, machine learning applications, and statistical analysis of medical images using Python libraries such as scikit-image, pydicom, and SimpleITK.

  8. Data from: Thy-Wise: An interpretable machine learning model for the...

    • figshare.com
    zip
    Updated Aug 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhe Jin; Shufang Pei; Lizhu Ouyang; Lu Zhang; Xiaokai Mo; Qiuying Chen; Jingjing You; Luyan Chen; Bin Zhang; Shuixing Zhang (2022). Thy-Wise: An interpretable machine learning model for the evaluation of thyroid nodules [Dataset]. http://doi.org/10.6084/m9.figshare.20417895.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 2, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Zhe Jin; Shufang Pei; Lizhu Ouyang; Lu Zhang; Xiaokai Mo; Qiuying Chen; Jingjing You; Luyan Chen; Bin Zhang; Shuixing Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Description: The database contains ultrasound images of thyroid nodules that were finally included in the study. As the aim of this study was to identify nodules as benign or malignant, all nodules were placed in two zip files according to their pathological nature: benign_after.zip and malignant_after.zip. After unzipping the zip package and opening the folder, you can see several folders named by "pathological nature + number", each folder corresponds to a thyroid nodule and contains its ultrasound images collected in a single examination.

    Ethical Approval: This retrospective study was approved by the institutional Ethics Committees of the First Affiliated Hospital of Jinan University, and the requirement for informed consent was waived.

    Sensitive Information Protection: All sensitive information contained in the image, including the patient's personal information, the hospital visited, and the time of the visit, has been removed using the CV2 toolkit from python for the purpose of anonymization.

    Processing pipeline and analysis steps: All the annotations in the images and clips were eliminated before review. US images were evaluated in a blinded fashion, with no US or pathology reports available, by two board-certified radiologists (with more than 10 years of experience in thyroid sonography) independently. Nodule size was measured as the maximal dimension on US images and the five gray-scale US categories were reviewed according to the ACR TI-RADS lexicon (5): composition, echogenicity, shape, margin, and echogenic foci. In the ACR TI-RADS, the TI-RADS risk level for nodules was determined by the total score of the five US categories, ranging from TR1 (benign) to TR5 (highly suspicious).

  9. 2025 Kaggle Machine Learning & Data Science Survey

    • kaggle.com
    Updated Jan 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hina Ismail (2025). 2025 Kaggle Machine Learning & Data Science Survey [Dataset]. https://www.kaggle.com/datasets/sonialikhan/2025-kaggle-machine-learning-and-data-science-survey
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hina Ismail
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview Welcome to Kaggle's second annual Machine Learning and Data Science Survey ― and our first-ever survey data challenge.

    This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!

    There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.

    Challenge This year Kaggle is launching the first Data Science Survey Challenge, where we will be awarding a prize pool of $28,000 to kernel authors who tell a rich story about a subset of the data science and machine learning community..

    In our second year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.

    The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

    Submissions will be evaluated on the following:

    Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and kernel, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one kernel, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.

    While the challenge is running, Kaggle will also give a Weekly Kernel Award of $1,500 to recognize excellent kernels that are public analyses of the survey. Weekly Kernel Awards will be announced every Friday between 11/9 and 11/30.

    How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.

    No submission is necessary for the Weekly Kernels Awards. To be eligible, a kernel must be public and use the 2018 Data Science Survey as a data source.

    Timeline All dates are 11:59PM UTC

    Submission deadline: December 3rd

    Winners announced: December 10th

    Weekly Kernels Award prize winners announcements: November 9th, 16th, 23rd, and 30th

    All kernels are evaluated after the deadline.

    Rules To be eligible to win a prize in either of the above prize tracks, you must be:

    a registered account holder at Kaggle.com; the older of 18 years old or the age of majority in your jurisdiction of residence; and not a resident of Crimea, Cuba, Iran, Syria, North Korea, or Sudan Your kernels will only be eligible to win if they have been made public on kaggle.com by the above deadline. All prizes are awarded at the discretion of Kaggle. Kaggle reserves the right to cancel or modify prize criteria.

    Unfortunately employees, interns, contractors, officers and directors of Kaggle Inc., and their parent companies, are not eligible to win any prizes.

    Survey Methodology ...

  10. Z

    Hyperreal Talk (Polish clear web message board) messages data

    • data.niaid.nih.gov
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siuda, Piotr; Shi, Haitao; Świeca, Leszek (2024). Hyperreal Talk (Polish clear web message board) messages data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10810250
    Explore at:
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    University of Edinburgh
    Kazimierz Wielki University in Bydgoszcz
    Authors
    Siuda, Piotr; Shi, Haitao; Świeca, Leszek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Information

    1. Title of Dataset

    Hyperreal Talk (Polish clear web message board) messages data.

    1. Data Collectors

    Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

    1. Funding Information

    The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

    Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

    Data Collection Context

    1. Data Source

    Polish clear web message board called Hyperreal Talk (https://hyperreal.info/talk/).

    1. Purpose

    This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.

    The Hyperreal Talk forum emerges as a pivotal online space on the Polish internet, serving as a hub for discussions and the exchange of knowledge and experiences concerning drug use. It plays a crucial role in investigating the narratives and discourses that shape the drug subculture and the broader societal perceptions of drug consumption. The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.

    1. Collection Method

    The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.

    1. Collection Date

    The data was collected in two periods, i.e., in September 2023 and November 2023.

    Data Content

    1. Data Description

    The dataset comprises all messages posted on the Polish-language Hyperreal Talk message board from its inception until November 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories: “hyperreal” and “hyperreal_hidden.” The “hyperreal” directory contains accessible posts without needing to log in to Hyperreal Talk, while the “hyperreal_hidden” directory holds posts that can only be viewed by logged-in users. For each directory, a .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 6,248,842 posts.

    1. Data Cleaning, Processing, and Anonymization

    The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

    1. File Formats and Variables/Fields

    The dataset consists of the following files:

    Zipped .txt files (hyperreal.zip) containing messages that are visible without logging into Hyperreal Talk. These files are organized into individual directories that mirror the folder structure found on the Hyperreal Talk message board.

    Zipped .txt files (hyperreal_hidden.zip) containing messages that are visible only after logging into Hyperreal Talk. Similar to the first type, these files are organized into directories corresponding to the website’s folder structure.

    A .csv file that lists all the messages, including file names and the content of each post.

    Accessibility and Usage

    1. Access Conditions

    The data can be accessed without any restrictions.

    1. Related Documentation

    Attached are .txt files detailing the tree of folders for “hyperreal.zip” and “hyperreal_hidden.zip.”

    Documentation on the Python regular expressions used for scraping, cleaning, processing, and anonymizing the data can be found on GitHub at the following URLs:

    https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710

    https://github.com/HaitaoShi/Scrapy_hyperreal"

    Ethical Considerations

    1. Ethics Statement

    A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

    Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

    The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.

  11. n

    CMT1A-BioStampNPoint2023: Charcot-Marie-Tooth disease type 1A accelerometry...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthik Dinesh; Nicole White; Lindsay Baker; Janet Sowden; Steffen Behrens-Spraggins; Elizabeth P Wood; Julie L Charles; David Herrmann; Gaurav Sharma; Katy Eichinger (2023). CMT1A-BioStampNPoint2023: Charcot-Marie-Tooth disease type 1A accelerometry dataset from three wearable sensor study [Dataset]. http://doi.org/10.5061/dryad.p5hqbzktr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    University of Rochester
    University of Rochester Medical Center
    Authors
    Karthik Dinesh; Nicole White; Lindsay Baker; Janet Sowden; Steffen Behrens-Spraggins; Elizabeth P Wood; Julie L Charles; David Herrmann; Gaurav Sharma; Katy Eichinger
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The CMT1A-BioStampNPoint2023 dataset provides data from a wearable sensor accelerometry study conducted for studying gait, balance, and activity in 15 individuals with Charcot-Marie-Tooth disease Type 1A (CMT1A). In addition to individuals with CMT1A, the dataset also includes data for 15 controls that also went through the same in-clinic study protocol as the CMT1A participants with a substantial fraction (9) of the controls also participating in the in-home study protocol. For the CMT1A participants, data is provided for 15 participants for the baseline visit and associated home recording duration and, additionally, for a subset of 12 of these participants data is also provided for a 12-month longitudinal visit and associated home recording duration. For controls, no longitudinal data is provided as none was recorded. The data were acquired using lightweight MC 10 BioStamp NPoint sensors (MC 10 Inc, Lexington, MA), three of which were attached to each participant for gathering data over a roughly one day interval. For additional details, see the description in the "README.md" included with the dataset. Methods The dataset contains data from wearable sensors and clinical data. The wearable sensor data was acquired using wearable sensors and the clinical data was extracted from the clinical record. The sensor data has not been processed per-se but the start of the recording time has been anonymized to comply with HIPPA requirements. Both the sensor data and the clinical data passed through a Python program for the aforementioned time anonymization and for standard formatting. Additional details of the time anonymization are provided in the file "README.md" included with the dataset.

  12. Anonymized Imaging Data for Two Small Cohorts of Head and Neck Cancer...

    • figshare.com
    zip
    Updated Jan 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kareem Wahid (2022). Anonymized Imaging Data for Two Small Cohorts of Head and Neck Cancer Patients [Dataset]. http://doi.org/10.6084/m9.figshare.13525481.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 6, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kareem Wahid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Images for two separate cohorts of fifteen head and neck cancer patients diagnosed with oral or oropharyngeal squamous cell carcinoma. Images for each patient were acquired before the start of radiotherapy. The first cohort, labeled “heterogeneous” (HET), consisted of patients with images acquired at different institutions and is termed “heterogeneous” because of the variety of acquisition scanners and parameters used in image generation. HET cohort images include T2-weighted MRI only.The second cohort, labeled “homogeneous” (HOM), consisted of patients from a single prospective clinical trial with the same imaging protocol (NCT04265430, PA16-0302) and is termed “homogeneous” because of the consistency in both scanner and acquisition parameters used in image generation. HOM cohort images include T2-weighted MRI for all patients and Dixon T1-w Water Enhanced MRI and CT for a subset of 5 patients. For each image, regions of interest (ROI)s of various healthy tissue types and anatomical locations were manually contoured in the same relative area for five slices by one observer (medical student) using Velocity AI v.3.0.1 (Atlanta, GA, USA), verified by a physician expert (radiologist), and exported as DICOM-RT Structure Set files. The ROIs were: 1. cerebrospinal fluid inferior (CSF_inf), 2. cerebrospinal fluid middle (CSF_mid), 3. cerebrospinal fluid superior (CSF_sup), 4. cheek fat left (Fat_L), 5. cheek fat right (Fat_R), 6. nape fat inferior (NapeFat_inf), 7. nape fat middle (NapeFat_mid), 8. nape fat superior (NapeFat_sup), 9. neck fat (NeckFat), 10. masseter left (Masseter_L), 11. masseter right (Masseter_R), 12. rectus capitus posterior major (RCPM), 13. skull, and 14. cerebellum. All images were retrospectively acquired from the University of Texas MD Anderson Cancer Center clinical databases in agreement with an Institutional Review Board approved protocol designed to collect data from patients with multiple imaging acquisitions (RCR03-0800). The protocol included a waiver of informed consent.DICOM data was anonymized using an in-house Python script that implements the RSNA CRP DICOM Anonymizer software. All files have had any DICOM header info and metadata containing PHI removed or replaced with dummy entries. Note: A zipped file was uploaded, once unzipped, there should be 2 separate folders corresponding to each cohort, which can then be used as inputs to the github code at: https://github.com/kwahid/MRI_Intensity_Standardization.

  13. Crop Sprayer Drone

    • kaggle.com
    zip
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis (2025). Crop Sprayer Drone [Dataset]. https://www.kaggle.com/datasets/lireyesc/crop-sprayer-drone
    Explore at:
    zip(103826093 bytes)Available download formats
    Dataset updated
    Jul 21, 2025
    Authors
    Luis
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains 200+ images of DJI Agras crop sprayer drones taken by private drone operators throughout South America and the Caribbean. The images are classified by generation and drone model, as shown in the table below.

    GenModelArmsRotorsNozzle TypeNozzlesImages
    02DJI Agras T1666Pressure (Flat fan)815
    02DJI Agras T2066Pressure (Flat fan)844
    03DJI Agras T1044Pressure (Flat fan)41
    03DJI Agras T3066Pressure (Flat fan)1675
    04DJI Agras T20P44Centrifugal28
    04DJI Agras T4048Centrifugal267
    05DJI Agras T5048Centrifugal2/417

    A couple of technical notes: * The tank size in liters is given in the model name after the letter T, e.g., the T16 has a 16-liter tank, the T30 has a 30-liter tank, and so on. An exception to this rule is the T50, which has a standard tank size of 40 liters and the option to install a 50-liter tank. * Each rotor is equipped with two propeller blades. Hence, the total number of propeller blades on a drone is twice the number of rotors.

    Purpose

    This dataset is obviously too small to train models from scratch, but it is ideal to test fine-tuning methods or few-shot learning methods. Here are a few ideas: * Combine this dataset with one containing camera drones, i.e., small drones used for photography and videography (e.g., DJI Phantom, Mavic, Inspire, Matrice; Autel EVO; etc.). Fine-tune a model to distinguish crop sprayer drones from camera drones. * Fine-tune a model to classify drones by nozzle type: flat fan pressure nozzles (T16/T20/T10/T30) vs. centrifugal nozzles (T20P/T40/T50). * Fine-tune a model to classify by number of arms: 6-arm models (T16/T20/T30) vs. 4-arm models (T10/T20P/T40/T50).

    Data Provenance, Anonymization and ICC Profiles

    The majority of the images in this dataset come from WhatsApp group chat conversations and were taken with various smartphone cameras. A small number of images were taken by me using my own smartphone when I worked as a crop spraying services provider.

    To ensure anonymization, all faces and identifying information (e.g., logos, truck license plates) were blurred using Gaussian kernels.

    Additionally, during metadata cleaning, all Exif metadata (including ICC color profiles) was removed. However, all images were originally captured in sRGB or close-to-sRGB color spaces. As a result, standard image viewers (e.g., Ubuntu's default viewer) render them without visible changes. You can safely assume sRGB when loading the images.

    If you are using Python libraries such as PIL, PyTorch, or Keras, you can ensure consistent color handling by explicitly converting images to RGB mode and treating pixel values as standard 0–255 sRGB values.

    Examples of Safe Image Loading

    Using PIL (standalone) python from PIL import Image img = Image.open("path/to/image.jpg").convert("RGB") # Force sRGB interpretation img_array = np.array(img) / 255.0 # Normalize if needed

    Using PyTorch with torchvision ```python from torchvision import transforms from PIL import Image

    transform = transforms.Compose([ transforms.ConvertImageDtype(torch.float), transforms.ToTensor(), # Converts to [0, 1] and permutes (H, W, C) to (C, H, W) ])

    img = Image.open("path/to/image.jpg").convert("RGB") tensor = transform(img) ```

    Using Keras ```python from tensorflow.keras.preprocessing.image import load_img, img_to_array

    Load image in RGB mode (do not resize unless required)

    img = load_img("path/to/image.jpg", color_mode='rgb') img_array = img_to_array(img) / 255.0 # Normalize if required by your model ```

    Contact

    For suggestions, questions, or feedback, you can reach me at luis.i.reyes.castro@gmail.com. In case you download this dataset from Kaggle, you can find the original repository here.

  14. Z

    Dopek.eu (Polish clear web and dark web message board) messages data

    • data.niaid.nih.gov
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siuda, Piotr; Shi, Haitao; Świeca, Leszek (2024). Dopek.eu (Polish clear web and dark web message board) messages data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10810554
    Explore at:
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    University of Edinburgh
    Kazimierz Wielki University in Bydgoszcz
    Authors
    Siuda, Piotr; Shi, Haitao; Świeca, Leszek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Information

    1. Title of Dataset

    Dopek.eu (Polish clear web and dark web message board) messages data.

    1. Data Collectors

    Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

    1. Funding Information

    The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

    Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

    Data Collection Context

    1. Data Source

    Clear web and dark web message board called dopek.eu (https://dopek.eu/).

    1. Purpose

    This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.

    The dopek.eu forum emerges as a pivotal online space on the Polish internet, serving as a hub for trading, discussions, and the exchange of knowledge and experiences concerning the use of the so-called new psychoactive substances (designer drugs). The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.

    1. Collection Method

    The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.

    1. Collection Date

    The data was collected in October 2023.

    Data Content

    1. Data Description

    The dataset comprises all messages posted on dopek.eu from its inception until October 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. A .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 171,121 posts.

    1. Data Cleaning, Processing, and Anonymization

    The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

    1. File Formats and Variables/Fields

    The dataset consists of the following types of files:

    Zipped .txt files (dopek.zip) containing all messages (posts).

    A .csv file that lists all the messages, including file names and the content of each post.

    Accessibility and Usage

    1. Access Conditions

    The data can be accessed without any restrictions.

    1. Related Documentation

    Attached are .txt files detailing the tree of folders for “dopek.zip”.

    Ethical Considerations

    1. Ethics Statement

    A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

    Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

    The primary safeguard was the early-stage hashing of usernames and identifiers from the posts, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.

  15. Z

    Dataset for a machine learning tool to improve lymph node staging with...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rogasch, Julian M.M. (2024). Dataset for a machine learning tool to improve lymph node staging with FDG-PET/CT [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7094286
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Charité-Universitätsmedizin Berlin
    Authors
    Rogasch, Julian M.M.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload provides Open Data associated with the publication "A machine learning tool to improve prediction of mediastinal lymph node metastases in non-small cell lung cancer using routinely obtainable [18F]FDG-PET/CT parameters" by Rogasch JMM et al. (2022).

    The upload contains the anonymized dataset with 10 features necessary for the final GBM model that was presented in the publication. However, the original full dataset with 40 features was excluded from this Open Data repository because it may not comply with strict rules of data anonymization. The full dataset can be obtained from the corresponding author (julian.rogasch@charite.de) upon reasonable request.

    Besides the dataset, this upload provides the original python and R scripts that were used as well as their output.

    A description of all files can be found in "content_description_2022_11_19.txt".

    A user-friendly web tool that implements the final machine learning model can be found here: PET_LN_calculator

  16. College Athlete Training and Performance Data

    • kaggle.com
    zip
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Python Developer (2025). College Athlete Training and Performance Data [Dataset]. https://www.kaggle.com/datasets/programmer3/college-athlete-training-and-performance-data/discussion
    Explore at:
    zip(25856 bytes)Available download formats
    Dataset updated
    Jul 24, 2025
    Authors
    Python Developer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 1050 and represents anonymized training data collected from college athletes involved in various sports programs. It includes biometric signals, physical performance metrics, recovery patterns, and personalized training feedback. Data was gathered from wearable devices, fitness logs, and observational assessments during regular training sessions.

  17. COVID-19 Recovery Dataset

    • kaggle.com
    zip
    Updated Oct 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eshaal Malik (2025). COVID-19 Recovery Dataset [Dataset]. https://www.kaggle.com/datasets/eshaalnmalik/covid-19-recovery-dataset
    Explore at:
    zip(1761581 bytes)Available download formats
    Dataset updated
    Oct 4, 2025
    Authors
    Eshaal Malik
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Overview

    The COVID-19 Patient Recovery Dataset is a synthetic collection of anonymized records for around 70,000 COVID-19 patients. It aims to assist with classification tasks in machine learning and epidemiological research. The dataset includes detailed clinical and demographic information, such as symptoms, existing health issues, vaccination status, COVID-19 variants, treatment details, and outcomes related to recovery or mortality. This dataset is great for predicting patient recovery (recovered), mortality (death), disease severity (severity), or the need for intensive care (icu_admission) using algorithms like Logistic Regression, Random Forest, XGBoost, or Neural Networks. It also allows for exploratory data analysis (EDA), statistical modeling, and time-series studies to find patterns in COVID-19 outcomes.
    The data is synthetic and reflects realistic trends found in public health data, based on sources like WHO reports. It ensures privacy and follows ethical guidelines. Dates are provided in Excel serial format, meaning 44447 corresponds to September 8, 2021, and can be converted to standard dates using Python’s datetime or Excel. With 70,000 records and 28 columns, this dataset serves as a valuable resource for data scientists, researchers, and students interested in health-related machine learning or pandemic trends.

    Data Source and Collection

    Source: Synthetic data based on public health patterns from sources like the World Health Organization (WHO). It includes placeholder URLs.
    Collection Period: Simulated from early 2020 to mid-2022, covering the Alpha, Delta, and Omicron waves.
    Number of Records: 70,000.
    File Format: CSV, which works with Pandas, R, Excel, and more.
    Data Quality Notes:

    About 5% of the values are missing in fields like symptoms_2, symptoms_3, treatment_given_2, and date.
    There are rare inconsistencies, such as between recovery/death flags and dates, which may need some preprocessing.
    Unique, anonymized patient IDs.

    Column NameData Type
    patient_idString
    countryString
    region/stateString
    date_reportedInteger
    ageInteger
    genderString
    comorbiditiesString
    symptoms_1String
    symptoms_2String
    symptoms_3String
    severityString
    hospitalizedInteger
    icu_admissionInteger
    ventilator_supportInteger
    vaccination_statusString
    variantString
    treatment_given_1String
    treatment_given_2String
    days_to_recoveryInteger
    recoveredInteger
    deathInteger
    date_of_recoveryInteger
    date_of_deathInteger
    tests_conductedInteger
    test_typeString
    hospital_nameString
    doctor_assignedString
    source_urlString

    Key Column Details

    patient_id: Unique identifier (e.g., P000001).
    country: Reporting country (e.g., India, USA, Brazil, Germany, China, Pakistan, South Africa, UK).
    region/state: Sub-national region (e.g., Sindh, California, São Paulo, Beijing).
    date_reported, date_of_recovery, date_of_death: Excel serial dates (convert using datetime(1899,12,30) + timedelta(days=value)).
    age: Patient age (1–100 years).
    gender: Male or Female.
    comorbidities: Pre-existing conditions (e.g., Diabetes, Hypertension, Cancer, Heart Disease, Asthma, None).
    symptoms_1, symptoms_2, symptoms_3: Reported symptoms (e.g., Cough, Fever, Fatigue, Loss of Smell, Sore Throat, or empty).
    severity: Case severity (Mild, Moderate, Severe, Critical).
    hospitalized, icu_admission, ventilator_support: Binary (1 = Yes, 0 = No).
    vaccination_status: None, Partial, Full, or Booster.
    variant: COVID-19 variant (Omicron, Delta, Alpha).
    treatment_given_1, treatment_given_2: Treatments administered (e.g., Antibiotics, Remdesivir, Oxygen, Steroids, Paracetamol, or empty).
    days_to_recovery: Days from report to recovery (5–30, or empty if not recovered).
    recovered, death: Binary outcomes (1 = Yes, 0 = No; generally mutually exclusive).
    tests_conducted: Number of tests (1–5).
    test_type: PCR or Antigen.
    hospital_name: Fictional hospital (e.g., Aga Khan, Mayo Clinic, NHS Trust).
    doctor_assigned: Fictional doctor name (e.g., Dr. Smith, Dr. Müller).
    source_url: Placeholder.

    Summary Statistics

    Total Patients: 70,000.
    Age: Mean ~50 years, Min 1, Max 100, evenly distributed.
    Gender: ~50% Male, ~50% Female.
    Top Countries: USA (20%), India (18%), Brazil (15%), China (12%), Germany (10%).
    Comorbidities: Diabetes (25%), Hypertension (20%), Cancer (15%), Heart Disease (15%), Asthma (10%), None (15%).
    Severity: Mild (60%), Moderate (25%), Severe (10%), Critical (5%).
    Recovery Rate: ~60% recovered (recovered=1), ~30% deceased (death=1), ~10% unresolved (both 0).
    Vaccination: None (40%), Full (30%), Partial (15%), Booster (15%).
    Variants: Omicron (50%), Delt...

  18. Sepsis Dataset –

    • kaggle.com
    zip
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatolu Peter (2025). Sepsis Dataset – [Dataset]. https://www.kaggle.com/datasets/olagokeblissman/sepsis-dataset/suggestions
    Explore at:
    zip(21559 bytes)Available download formats
    Dataset updated
    May 31, 2025
    Authors
    Fatolu Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📝 Dataset Overview: This dataset focuses on early warning detection for sepsis, a critical and potentially fatal medical condition. It includes anonymized vital signs, lab results, and clinical indicators of patients admitted to the hospital, structured for real-time monitoring and predictive modeling.

    It’s ideal for clinical data analysts, healthcare data scientists, and AI practitioners aiming to develop decision support tools, early warning dashboards, or predictive health models.

    🔍 Dataset Features: Column Name Description Patient_ID Unique anonymized identifier Admission_Date Patient’s hospital admission date Temperature_C Body temperature in degrees Celsius BP_Systolic Systolic blood pressure (mmHg) BP_Diastolic Diastolic blood pressure (mmHg) Heart_Rate Beats per minute WBC_Count White blood cell count (x10⁹/L) Lactate_mmol_L Lactate level in mmol/L Sepsis_Flag Binary indicator (1 = Suspected Sepsis, 0 = Normal) Ward Hospital ward/unit Doctor_On_Duty Attending physician name (anonymized)

    🎯 Use Cases: Build Power BI dashboards for hospital early warning systems

    Train ML classification models to detect early signs of sepsis

    Create patient monitoring tools with Python or R

    Explore the relationship between vitals & sepsis onset

    Perform feature engineering for risk scoring systems

    📌 Clinical Relevance: Sepsis is one of the leading causes of in-hospital mortality worldwide. Early detection is crucial to reducing death rates and improving outcomes. This dataset empowers developers and analysts to make a meaningful impact in the healthcare sector.

    👤 Created By: Fatolu Peter (Emperor Analytics) A passionate healthcare analyst leveraging data to drive innovation in public health across Nigeria. This is Project 12 in my data-for-good series.

    ✅ LinkedIn Post: 🚨 New Dataset: Sepsis Early Warning System Data – Now on Kaggle 📊 Clinical vital signs + lab markers + sepsis risk flags 🔗 Explore the dataset here

    This dataset enables healthcare data scientists to: ✅ Build real-time hospital dashboards ✅ Predict sepsis risk with machine learning ✅ Explore vitals like BP, lactate, WBC, and temperature ✅ Support early intervention using data insights

    Whether you're into: 🧠 Predictive modeling 📈 Power BI clinical dashboards 📉 Risk analytics in healthcare This is for you.

    Join me in using data to save lives — one insight at a time. If you build something, tag me. I’ll gladly share it! 💡

    HealthcareAnalytics #SepsisAwareness #EarlyWarningSystems #KaggleDataset #PowerBI #DataForGood #FatoluPeter #EmperorAnalytics #PublicHealth #Project12 #RealWorldData

    Let me know if you’d like help starting a Power BI or Python model for this!

    You said: whic license is the best to choose that is global

    ChatGPT said: For globally recognized and widely accepted dataset licensing on Kaggle, the best license to choose depends on your intent — especially whether you want others to reuse and build upon your dataset freely.

    ✅ Recommended License: Creative Commons Attribution 4.0 (CC BY 4.0) Why CC BY 4.0 is best globally: 🌍 Internationally recognized and legally valid worldwide.

    ✅ Allows others to use, share, adapt, and build on your dataset, even for commercial purposes.

    📢 Requires only attribution to you as the creator.

    🔐 You still retain copyright while maximizing openness.

    When to choose it: If your goal is to:

    Share freely with the global community,

    Allow use in academic, commercial, or public projects,

    Gain credit and visibility as the original creator.

    ⚠️ Other license types (if needed): CC BY-NC 4.0 (Attribution-NonCommercial): Only for non-commercial use.

    CC0 (Public Domain Dedication): Freest use; no attribution required — but not always ideal if you want credit.

    GPL / Open Data Commons: More for software or structured databases with specific open-source obligations.

    ✅ Final Suggestion for Your Datasets: Use CC BY 4.0 for all your uploads, unless you have confidential/private data. It's perfect for:

    Healthcare datasets

    Sales/retail analytics

    Kaggle portfolio building

    Global recognition & impact

  19. twitter-depression-v2

    • kaggle.com
    zip
    Updated Oct 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wisnu Satrio (2025). twitter-depression-v2 [Dataset]. https://www.kaggle.com/datasets/wisnusatrio/twitter-depression-v2
    Explore at:
    zip(2069607776 bytes)Available download formats
    Dataset updated
    Oct 23, 2025
    Authors
    Wisnu Satrio
    Description

    Twitter Depression Detection Dataset

    Overview

    This dataset contains Twitter posts with both text and images for depression detection research.

    Dataset Statistics

    • Total samples: 32363
    • Depression samples: 2335
    • Non-depression samples: 30028
    • Total images: 32363

    File Structure

    twitter_depression_dataset/
    ├── images/
    │  ├── depresi/     # Images from depression-related posts
    │  └── nondepresi/    # Images from non-depression posts
    ├── metadata/
    │  ├── full_dataset.csv # Complete dataset with all metadata
    │  └── image_labels.csv # Simple image-label mapping
    └── dataset-metadata.json # Kaggle dataset metadata
    

    Data Fields

    • full_text: Original tweet text
    • label_text: Text label (0=non-depression, 1=depression)
    • label_image: Image label (0=non-depression, 1=depression)
    • kaggle_output_path: Path to image file in this dataset
    • created_at: Tweet creation timestamp
    • favorite_count: Number of likes
    • Additional Twitter metadata...

    Usage

    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('metadata/full_dataset.csv')
    
    # Load image-label mapping
    labels = pd.read_csv('metadata/image_labels.csv')
    

    Citation

    If you use this dataset, please cite appropriately and ensure ethical use for research purposes.

    Ethical Considerations

    This dataset contains social media data. Please ensure: - Respectful use for research purposes - Proper anonymization in publications - Compliance with platform terms of service - Consideration of mental health sensitivity

  20. 🧪 Laboratory Test Results – Anonymized Dataset

    • kaggle.com
    zip
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pinar Topuz (2025). 🧪 Laboratory Test Results – Anonymized Dataset [Dataset]. https://www.kaggle.com/pinuto/laboratory-test-results-anonymized-dataset
    Explore at:
    zip(2152 bytes)Available download formats
    Dataset updated
    Aug 12, 2025
    Authors
    Pinar Topuz
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    🧪🌈 Laboratory Test Results – Anonymized Dataset

    💎 Where Precision Medicine Meets the Vibrance of Data Science Unlock insights, drive innovations, and explore healthcare analytics with a colorful, interactive, and thematically bold dataset.

    🌟 Overview

    This dataset delivers fully anonymized laboratory test results with a visually rich and research-ready design. Each element—from clear unit descriptions to color-coded status flags—is crafted for maximum clarity and engagement.

    💡 Ideal For:

    • 📊 Data Analysis – Spot trends, detect anomalies.
    • 🤖 Machine Learning – Build predictive healthcare models.
    • 🎓 Education – Train students in medical data interpretation.
    • 🖥 Dashboards – Create vibrant, widget-based visual analytics.

    📂 Dataset Structure

    Format: CSV – ready to use with Python, R, Excel, Tableau, Power BI, or any BI/ML platform.

    ColumnDescription
    DateTest date (YYYY-MM-DD)
    Test_NameLaboratory test name
    ResultMeasured value (numeric or qualitative)
    UnitMeasurement unit abbreviation
    Reference_RangeOfficial normal range
    StatusNormal / High / Low indicator (⚪🟢🔴)
    CommentShort medical interpretation
    Min_ReferenceLower bound of reference range
    Max_ReferenceUpper bound of reference range
    Unit_DescriptionExpanded description of the unit
    Recommended_FollowupSuggested monitoring or medical action

    🧾 Common Test Units & Meanings

    Here’s what some of the common units mean in a medical context:

    • ug/L (Microgram per Liter) – Common for Ferritin; measures very small concentrations in blood.
    • % (Percentage) – Used for HbA1c to express average blood sugar over time.
    • KU/L (Kilo Unit per Liter) – For Total IgE; measures antibodies in blood.
    • mU/L (Milli Unit per Liter) – For Insulin or TSH; measures hormone activity.
    • ng/dL (Nanogram per Deciliter) – For Free T4; measures tiny amounts of thyroid hormone.
    • g/dL (Gram per Deciliter) – Common for Hemoglobin; measures hemoglobin concentration in blood.
    • 10^3/uL (Thousand per Microliter) – Used for White Blood Cell or Platelet count.
    • fL (Femtoliter) – For MCV, RDW; measures cell size.
    • mg/dL (Milligram per Deciliter) – Used for glucose, bilirubin; measures substance concentration in blood/urine.

    These units help clinicians determine how much of a substance is present and compare it with healthy reference ranges.

    🎯 Why It Stands Out

    • 🌈 Color-coded Status Flags – Instantly spot outliers.
    • 📌 Detailed Annotations – Context for every measurement.
    • 📊 Widget & Dashboard Ready – Perfect for embedding in BI tools.
    • 🔒 Privacy Assured – 100% anonymized.
    • 📚 Educational Value – Includes unit definitions and usage.

    ⚠️ Disclaimer

    This dataset is for educational and research purposes only. It is not intended for actual medical diagnosis or treatment.

    📜 License

    CC0 1.0 Public Domain Dedication – Free to use, share, remix, and adapt.

    💡 Inspiration

    Crafted to inspire data-driven healthcare solutions, this dataset empowers researchers, educators, and developers to transform raw lab results into vivid, interactive, and actionable insights.

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Python Developer (2025). Cultural Behavior and Engagement Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/cultural-behavior-and-engagement-dataset
Organization logo

Cultural Behavior and Engagement Dataset

Smart Community Interaction Insights

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(135575 bytes)Available download formats
Dataset updated
Apr 22, 2025
Authors
Python Developer
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset captures cultural engagement, social behavior, and interaction patterns of individuals in smart communities. Designed with privacy at its core, it aggregates anonymized data from smart devices, social media activity, and event participation logs.

It includes behavioral metrics such as event attendance frequency, social interactions, and cultural practices, along with contextual data like language usage, time-based activity patterns, and anonymized location zones. Privacy features, such as user consent and anonymization flags, ensure ethical data usage.

The dataset supports the development of culturally aware recommendation systems and can be used for tasks like event participation prediction and personalized cultural content recommendation.

Search
Clear search
Close search
Google apps
Main menu