24 datasets found
  1. Gender Detection & Classification - Face Dataset

    • kaggle.com
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2023). Gender Detection & Classification - Face Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/gender-detection-and-classification-image-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Gender Detection & Classification - face recognition dataset

    The dataset is created on the basis of Face Mask Detection dataset

    Dataset Description:

    The dataset comprises a collection of photos of people, organized into folders labeled "women" and "men." Each folder contains a significant number of images to facilitate training and testing of gender detection algorithms or models.

    The dataset contains a variety of images capturing female and male individuals from diverse backgrounds, age groups, and ethnicities.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F1c4708f0b856f7889e3c0eea434fe8e2%2FFrame%2045%20(1).png?generation=1698764294000412&alt=media" alt="">

    This labeled dataset can be utilized as training data for machine learning models, computer vision applications, and gender detection algorithms.

    💴 For Commercial Usage: Full version of the dataset includes 376 000+ photos of people, leave a request on TrainingData to buy the dataset

    Metadata for the full dataset:

    • assignment_id - unique identifier of the media file
    • worker_id - unique identifier of the person
    • age - age of the person
    • true_gender - gender of the person
    • country - country of the person
    • ethnicity - ethnicity of the person
    • photo_1_extension, photo_2_extension, photo_3_extension, photo_4_extension - photo extensions in the dataset
    • photo_1_resolution, photo_2_resolution, photo_3_extension, photo_4_resolution - photo resolution in the dataset

    OTHER BIOMETRIC DATASETS:

    💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

    Content

    The dataset is split into train and test folders, each folder includes: - folders women and men - folders with images of people with the corresponding gender, - .csv file - contains information about the images and people in the dataset

    File with the extension .csv

    • file: link to access the file,
    • gender: gender of a person in the photo (woman/man),
    • split: classification on train and test

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: biometric system, biometric system attacks, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, gender detection, supervised learning dataset, gender classification dataset, gender recognition dataset

  2. I

    Hype and Diversity - PubMed dataset

    • aws-databank-alb.library.illinois.edu
    • databank.illinois.edu
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Apratim Mishra (2025). Hype and Diversity - PubMed dataset [Dataset]. http://doi.org/10.13012/B2IDB-5692759_V1
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    Apratim Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset captures ‘Hype’ and 'Diversity', including article-level (pmid) and author-level (auid) data within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1991 and 2014, totaling 421,580 (merged_df). The classification of hype relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences. Diversity is classified for ethnicity, gender, academic age, and topical expertise for authors based on the Rao-Sterling Diversity index. File1: merged_auids.csv (Important columns defined) • AUID: a unique ID for each author • Genni: gender prediction • Ethnea: ethnicity prediction ################################################# File2: merged_df.csv (Important columns defined) - pmid: unique paper - auid: all unique auids (author-name unique identification) - year: Year of paper publication - no_authors: Author count - journal: Journal name - years: first year of publication for every author - Country-temporal: Country of affiliation for every author - h_index: Journal h-index - TimeNovelty: Paper Time novelty - nih_funded: Binary variable indicating funding for any author - prior_cites_mean: Mean of all authors’ prior citation rate - insti_impact: All unique institutions’ citation rate - mesh_vals: Top MeSH values for every author of that paper - hype_word: Candidate hype word, such as ‘novel' - hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location - hype_percentile: Abstract relative position of hype word - relative_citation_ratio: RCR

  3. p

    Intersectional Lens on Leaders_Study_Rawdata.csv

    • psycharchives.org
    Updated Oct 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Intersectional Lens on Leaders_Study_Rawdata.csv [Dataset]. https://www.psycharchives.org/handle/20.500.12034/7527
    Explore at:
    Dataset updated
    Oct 6, 2022
    License

    https://doi.org/10.23668/psycharchives.4988https://doi.org/10.23668/psycharchives.4988

    Description

    Younger men and especially younger women are excluded from leadership roles or obstructed from succeeding in these positions by facing backlash. Our project aims to build a more gender-specific understanding of the backlash that younger individuals in leadership positions face. We predict an interactive backlash for younger women and younger men that is rooted in intersectional stereotypes compared to the stereotypes based on single demographic categories (i.e., age or gender stereotypes). To test our hypotheses, we collect data from a heterogeneous sample (N = 900) of U.S. citizens between 25 and 69 years. We conduct an experimental online study with a between-participant design to examine the backlash against younger women and younger men. Dataset for: Daldrop, C., Buengeler, C., & Homan, A. C. (2022). An Intersectional Lens on Leadership: Prescriptive Stereotypes towards Younger Women and Younger Men and their Effect on Leadership Perception. PsychArchives. https://doi.org/10.23668/psycharchives.5404 Dataset for: Daldrop, C., Buengeler, C., & Homan, A. C. (2023). An intersectional lens on young leaders: bias toward young women and young men in leadership positions. In Frontiers in Psychology (Vol. 14). Frontiers Media SA. https://doi.org/10.3389/fpsyg.2023.120454 Research has recognized age biases against young leaders, yet understanding of how gender, the most frequently studied demographic leader characteristic, influences this bias remains limited. In this study, we examine the gender-specific age bias toward young female and young male leaders through an intersectional lens. By integrating intersectionality theory with insights on status beliefs associated with age and gender, we test whether young female and male leaders face an interactive rather than an additive form of bias. We conducted two preregistered experimental studies (N1 = 918 and N2 = 985), where participants evaluated leaders based on age, gender, or a combination of both. Our analysis reveals a negative age bias in leader status ascriptions toward young leaders compared to middle-aged and older leaders. This bias persists when gender information is added, as demonstrated in both intersectional categories of young female and young male leaders. This bias pattern does not extend to middle-aged or older female and male leaders, thereby supporting the age bias against young leaders specifically. Interestingly, we also examined whether social dominance orientation strengthens the bias against young (male) leaders, but our results (reported in the SOM) are not as hypothesized. In sum, our results emphasize the importance of young age as a crucial demographic characteristic in leadership perceptions that can even overshadow the role of gender.: Raw Data File

  4. Smartwatch Purchase Data

    • kaggle.com
    Updated Dec 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aayush Chourasiya (2022). Smartwatch Purchase Data [Dataset]. https://www.kaggle.com/datasets/albedo0/smartwatch-purchase-data/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aayush Chourasiya
    Description

    Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.

    The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.

    ----- Version 1 -------

    trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".

    The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.

    Following Python script was used to generate this dataset

    import random
    import csv
    
    # Set the number of examples to generate
    numExamples = 100000
    
    # Generate the training data
    with open("trainingData.csv", "w", newline="") as csvfile:
      fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"]
      writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
      writer.writeheader()
    
      for i in range(numExamples):
        age = random.randint(18, 70)
        income = random.randint(25000, 200000)
        gender = random.choice(["male", "female"])
        maritalStatus = random.choice(["single", "married", "divorced"])
        hour = random.randint(0, 23)
        weekend = random.choice([True, False])
    
        # Randomly assign the label based on some arbitrary conditions
        # assuming 40% of divorcees won't buy a smart watch
        if maritalStatus == "divorced" and random.random() < 0.4:
          buySmartWatch = "no"
        # assuming sales are 30% more likely to occur on weekends.
        elif weekend == True and random.random() < 1.3:
          buySmartWatch = "yes"
        elif gender == "male" and age < 30 and income > 75000:
          buySmartWatch = "yes"
        elif gender == "female" and age >= 30 and income > 100000:
          buySmartWatch = "yes"
        else:
          buySmartWatch = "no"
    
        writer.writerow({
          "age": age,
          "income": income,
          "gender": gender,
          "maritalStatus": maritalStatus,
          "hour": hour,
          "weekend": weekend,
          "buySmartWatch": buySmartWatch
        })
    

    ----- Version 2 -------

    trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)

    Python script used to generate the data:

    import random
    import csv
    
    # Set the number of examples to generate
    numExamples = 100000
    
    with open("t...
    
  5. f

    Data_Sheet_2_You’re Prettier When You Smile: Construction and Validation of...

    • frontiersin.figshare.com
    txt
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mona Algner; Timo Lorenz (2023). Data_Sheet_2_You’re Prettier When You Smile: Construction and Validation of a Questionnaire to Assess Microaggressions Against Women in the Workplace.CSV [Dataset]. http://doi.org/10.3389/fpsyg.2022.809862.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Mona Algner; Timo Lorenz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gender microaggressions, especially its subtler forms microinsults and microinvalidations are by definition hard to discern. We aim to construct and validate a scale reflecting two facets of the microaggression taxonomy: microinsults and microinvalidations toward women in the workplace, the MIMI-16. Two studies were conducted (N1 = 500, N2 = 612). Using a genetic algorithm, a 16-item scale was developed and consequently validated via confirmatory factor analyses (CFA) in three separate validation samples. Correlational analyses with organizational outcome measures were performed. The MIMI-16 exhibits good model fit in all validation samples (CFI = 0.936–0.960, TLI = 0.926–0.954, RMSEA = 0.046–0.062, SRMR = 0.042–0.049). Multigroup-CFA suggested strict measurement invariance between all validation samples. Correlations were as expected and indicate internal and external validity. Scholars on gender microaggressions have mostly used qualitative research. With the newly developed MIMI-16 we provide a reliable and valid quantitative instrument to measure gender microaggressions in the workplace.

  6. S

    Depression prediction dataset based on online medical consultation

    • scidb.cn
    Updated Mar 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nie Hui (2023). Depression prediction dataset based on online medical consultation [Dataset]. http://doi.org/10.57760/sciencedb.07706
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Nie Hui
    Description

    The relevant features of the LIWC psychological dictionary are extracted from the consultation text after preprocessing the depression consultation data collected from the online consultation platform File name: DepressionLevelPrediction-LIWC-Processed.csv Creation time: 2022-12-20 Function: explore the relationship between LIWC-based features and depression Data volume: 3859 Data format: utf8 Field description: ID: consultation record code Depression: degree of depression (3: severe; 2: moderate; 1: mild; 0: undiagnosed) Age: age Gender: gender (1: male 0: female) Region: Region (temporarily unused) Identity: Identity (not used temporarily) Socialize: sociality Emotion: Emotion Cognition: cognition Perception: Perception Physiology: physiology Gains or losses

  7. d

    Replication Data for 'Gender (im)balance in the Russian cinema: on the...

    • search.dataone.org
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leontyeva, Xenia (2024). Replication Data for 'Gender (im)balance in the Russian cinema: on the screen and behind the camera' [Dataset]. http://doi.org/10.7910/DVN/ISVTB4
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Leontyeva, Xenia
    Description

    There are two CSV datasets in this publication used initially in the master thesis in sociology of Xenia Leontyeva at HSE University Saint Petersburg, titled "Popularity Factors of Domestic Films: Gender Characteristics and State Support Measures" (2022), and lately for the article by Leontyeva, Xenia, Olessia Koltsova, and Deb Verhoeven, titled "Gender (Im)Balance in Russian Cinema: On the Screen and behind the Camera" (Accepted in January 2024 in The Journal of Cultural Analytics). The first dataset (N=1285) includes all Russian films produced between 2008 and 2019 and theatrically released between December 1, 2008, and December 31, 2019. Distribution statistics cover the territory of the CIS, of which the Russian Federation is the biggest market. Budget information is available for 644 films. The second dataset contains the Bechdel-Wallace test modified by Leontyeva markup for 243 films, 193 of which have budget information. There is also a supplement with a detailed description of all variables and R-code producing tables, plots, and models for the article. The database was collected by Xenia Leontyeva while working at Nevafilm Research (until 2018) and later. In terms of distribution data, it is based on sources such as the open base Russian Cinema Fund Analytics – RCFA (since 2015), the closed base comScore/Rentrak ("International Box Office Essential") serving major Hollywood studios (data from it has been used since 2008 to fill gaps in open databases), Bookers' Bulletin (since 2011), and Russian Film Business Today magazines (since 2004), as well as self-collected by Nevafilm Research employees from film distributors and producers; the rights to use and continue this dataset have been received from Nevafilm company. In terms of production data, the information was taken from the State register of film distribution certificates, Kinopoisk.ru, and from the films' credits.

  8. f

    Data_Sheet_2_Action Sounds Informing Own Body Perception Influence Gender...

    • frontiersin.figshare.com
    txt
    Updated Jun 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sünje Clausen; Ana Tajadura-Jiménez; Christian P. Janssen; Nadia Bianchi-Berthouze (2023). Data_Sheet_2_Action Sounds Informing Own Body Perception Influence Gender Identity and Social Cognition.csv [Dataset]. http://doi.org/10.3389/fnhum.2021.688170.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Sünje Clausen; Ana Tajadura-Jiménez; Christian P. Janssen; Nadia Bianchi-Berthouze
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sensory information can temporarily affect mental body representations. For example, in Virtual Reality (VR), visually swapping into a body with another sex can temporarily alter perceived gender identity. Outside of VR, real-time auditory changes to walkers’ footstep sounds can affect perceived body weight and masculinity/femininity. Here, we investigate whether altered footstep sounds also impact gender identity and relation to gender groups. In two experiments, cisgender participants (26 females, 26 males) walked with headphones which played altered versions of their own footstep sounds that sounded more typically male or female. Baseline and post-intervention measures quantified gender identity [Implicit Association Test (IAT)], relation to gender groups [Inclusion of the Other-in-the-Self (IOS)], and perceived masculinity/femininity. Results show that females felt more feminine and closer to the group of women (IOS) directly after walking with feminine sounding footsteps. Similarly, males felt more feminine after walking with feminine sounding footsteps and associated themselves relatively stronger with “female” (IAT). The findings suggest that gender identity is temporarily malleable through auditory-induced own body illusions. Furthermore, they provide evidence for a connection between body perception and an abstract representation of the Self, supporting the theory that bodily illusions affect social cognition through changes in the self-concept.

  9. Regression analysis in Galaxy with car purchase price prediction dataset

    • zenodo.org
    • data.niaid.nih.gov
    tsv
    Updated Aug 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaivan Kamali; Kaivan Kamali (2022). Regression analysis in Galaxy with car purchase price prediction dataset [Dataset]. http://doi.org/10.5281/zenodo.4660497
    Explore at:
    tsvAvailable download formats
    Dataset updated
    Aug 4, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kaivan Kamali; Kaivan Kamali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source/Credit: Michael Grogan
    https://github.com/MGCodesandStats
    https://github.com/MGCodesandStats/datasets/blob/master/cars.csv

    Sample dataset for regression analysis. Given 5 attributes (age, gender, miles driven per day, debt, and income) predict how much someone will spend on purchasing a car. All 5 of the input attributes have been scaled to be in 0 to 1 range. Training set has 723 training examples. Test set has 242 test examples.

    This dataset will be used in an upcoming Galaxy Training Network tutorial (https://training.galaxyproject.org/training-material/topics/statistics/) on use of feedforward neural networks for regression analysis.

  10. Future medical event

    • kaggle.com
    Updated Jul 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohit Chaturvedi (2018). Future medical event [Dataset]. https://www.kaggle.com/tango911/future-medical-event/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohit Chaturvedi
    Description

    Problem Statement

    Insurance Plus++, a premium payer, wants to use predictive modeling on healthcare data to predict the occurrence of future events among their covered patients. They want to use existing data about their patients’ previous medical events to predict future events in their patient journey. Events are recorded in the standardized ICD-9 format (details here). In this challenge, the goal is to predict the next 10 events in 2014 for each patient in order of occurrence.

    Data Description

    There are three files available for download: train.csv, test.csv and sample_submission.csv

    The “train.csv” file contains historical patient information from Jan 2011 to Dec 2013. The “test.csv” file contains a list of Patient IDs for which we aim to predict the next 10 events for in the year 2014. Event codes should be considered to be categorical in nature, not continuous.

    Variable-Description

    UID-Unique Patient ID

    Age-Age of the patient

    Gender-Gender of the patient

    Date-Date of Event

    Event_Code-Event Code (ICD-9 format, the target variable of this challenge)

  11. m

    Heart Attack Dataset

    • data.mendeley.com
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarik A. Rashid (2022). Heart Attack Dataset [Dataset]. http://doi.org/10.17632/wmhctcrt5v.1
    Explore at:
    Dataset updated
    Nov 23, 2022
    Authors
    Tarik A. Rashid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The heart attack datasets were collected at Zheen hospital in Erbil, Iraq, from January 2019 to May 2019. The attributes of this dataset are: age, gender, heart rate, systolic blood pressure, diastolic blood pressure, blood sugar, ck-mb and troponin with negative or positive output. According to the provided information, the medical dataset classifies either heart attack or none. The gender column in the data is normalized: the male is set to 1 and the female to 0. The glucose column is set to 1 if it is > 120; otherwise, 0. As for the output, positive is set to 1 and negative to 0.

  12. m

    Diabetes Dataset

    • data.mendeley.com
    Updated Jul 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahlam Rashid (2020). Diabetes Dataset [Dataset]. http://doi.org/10.17632/wj9rwkp9c2.1
    Explore at:
    Dataset updated
    Jul 18, 2020
    Authors
    Ahlam Rashid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The construction of diabetes dataset was explained. The data were collected from the Iraqi society, as they data were acquired from the laboratory of Medical City Hospital and (the Specializes Center for Endocrinology and Diabetes-Al-Kindy Teaching Hospital). Patients' files were taken and data extracted from them and entered in to the database to construct the diabetes dataset. The data consist of medical information, laboratory analysis. The data attribute are: The data consist of medical information, laboratory analysis… etc. The data that have been entered initially into the system are: No. of Patient, Sugar Level Blood, Age, Gender, Creatinine ratio(Cr), Body Mass Index (BMI), Urea, Cholesterol (Chol), Fasting lipid profile, including total, LDL, VLDL, Triglycerides(TG) and HDL Cholesterol , HBA1C, Class (the patient's diabetes disease class may be Diabetic, Non-Diabetic, or Predict-Diabetic).

  13. f

    Data_Sheet_2_Construction and Verification of a Predictive Model for Risk...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated Jun 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaqiong He; Peng Liu; Leyun Xie; Saizhen Zeng; Huashan Lin; Bing Zhang; Jianbin Liu (2023). Data_Sheet_2_Construction and Verification of a Predictive Model for Risk Factors in Children With Severe Adenoviral Pneumonia.CSV [Dataset]. http://doi.org/10.3389/fped.2022.874822.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    Frontiers
    Authors
    Yaqiong He; Peng Liu; Leyun Xie; Saizhen Zeng; Huashan Lin; Bing Zhang; Jianbin Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveTo construct and validate a predictive model for risk factors in children with severe adenoviral pneumonia based on chest low-dose CT imaging and clinical features.MethodsA total of 177 patients with adenoviral pneumonia who underwent low-dose CT examination were collected between January 2019 and August 2019. The assessment criteria for severe pneumonia were divided into mild group (N = 125) and severe group (N = 52). All cases divided into training cohort (N = 125) and validation cohort (N = 52). We constructed a prediction model by drawing a nomogram and verified the predictive efficacy of the model through the ROC curve, calibration curve and decision curve analysis.ResultsThe difference was statistically significant (P < 0.05) between the mild adenovirus pneumonia group and the severe adenovirus pneumonia group in gender, age, weight, body temperature, L/N ratio, LDH, ALT, AST, CK-MB, ADV DNA, bronchial inflation sign, emphysema, ground glass sign, bronchial wall thickening, bronchiectasis, pleural effusion, consolidation score, and lobular inflammation score. Multivariate logistic regression analysis showed that gender, LDH value, emphysema, consolidation score, and lobular inflammation score were severe independent risk factors for adenovirus pneumonia in children. Logistic regression was employed to construct clinical model, imaging semantic feature model, and combined model. The AUC values of the training sets of the three models were 0.85 (0.77–0.94), 0.83 (0.75–0.91), and 0.91 (0.85–0.97). The AUC of the validation set was 0.77 (0.64–0.91), 0.83 (0.71–0.94), and 0.85 (0.73–0.96), respectively. The calibration curve fit good of the three models. The clinical decision curve analysis demonstrates the clinical application value of the nomogram prediction model.ConclusionThe prediction model based on chest low-dose CT image characteristics and clinical characteristics has relatively clear predictive value in distinguishing mild adenovirus pneumonia from severe adenovirus pneumonia in children and might provide a new method for early clinical prediction of the outcome of adenovirus pneumonia in children.

  14. f

    DataSheet1_Putting external validation performance of major bleeding risk...

    • frontiersin.figshare.com
    txt
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clair Blacketer; Jenna M. Reps; Lu Wang; Patrick B. Ryan; Zhong Yuan (2023). DataSheet1_Putting external validation performance of major bleeding risk models into context.CSV [Dataset]. http://doi.org/10.3389/fdsfr.2022.1034677.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Frontiers
    Authors
    Clair Blacketer; Jenna M. Reps; Lu Wang; Patrick B. Ryan; Zhong Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    When developing predictive models, model simplicity and performance often need to be balanced. We propose a novel methodology to put the performance of bleeding risk prediction models ORBIT, ATRIA, HAS-BLED, CHADS2, and CHA2DS2-VASc into perspective. Instead of comparing the existing models’ performance against the 0.5–1 AUROC scale, we suggest estimating a prediction task specific AUROC scale, lower bound AUROC (lbAUROC) and upper bound AUROC (ubAUROC), to help assess the balance between model simplicity and performance and determine whether more complex models could significantly improve the ability to predict the outcome. We validate the existing bleeding risk prediction models by applying them to a cohort of new users of warfarin and a cohort of new users of direct oral anticoagulants (DOACs) separately, across a set of four observational databases. Then, we develop the lbAUROC-ubAUROC scale by using the validation data to train regularized logistic regression models. The internal validation AUROC of the model that includes only age and gender variables was used to estimate the lbAUROC. The internal validation AUROC of the model that includes thousands of candidate variables was used to estimate the ubAUROC. The age and gender only models achieved AUROCs between 0.50 and 0.56 (lower bound) and the large-scale models achieved AUROCs between 0.67 and 0.72 and between 0.70 and 0.77 (upper bound) within the target cohorts of warfarin new users and DOACs new users, respectively. The AUROC of existing bleeding risk prediction models fall between the upper-bound and lower-bound of predictive models. Our study showed that this context of the predictability of the outcome is essential when evaluating risk prediction models to be administered in actual practice.

  15. Janatahack cross-sell prediction

    • kaggle.com
    Updated Sep 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurav Mishra (2020). Janatahack cross-sell prediction [Dataset]. https://www.kaggle.com/msaurav/janatahack-crosssell-prediction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Saurav Mishra
    Description

    Context

    Your client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from the past year will also be interested in Vehicle Insurance provided by the company.

    An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

    For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if God forbid, you fall ill and need to be hospitalized in that year, the insurance provider company will bear the cost of hospitalization, etc. for up to Rs. 200,000. Now if you are wondering how can company bear such high hospitalization cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalized that year and not everyone. This way everyone shares the risk of everyone else.

    Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of a certain amount to the insurance provider company so that in case of an unfortunate accident by the vehicle, the insurance provider company will provide compensation (called ‘sum assured’) to the customer.

    Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.

    Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel), etc.

    Content

    Data Description. id- Unique ID for the customer Gender- Gender of the customer Age- Age of the customer Driving_License- 0: Customer does not have DL, 1: Customer already has DL Region_Code- Unique code for the region of the customer Previously_Insured- 1: Customer already has Vehicle Insurance, 0: Customer doesn't have Vehicle Insurance Vehicle_Age- Age of the Vehicle Vehicle_Damage- 1: Customer got his/her vehicle damaged in the past, 0: Customer didn't get his/her vehicle damaged in the past. Annual_Premium- The amount customer needs to pay as premium in the year Policy_Sales_Channel- Anonymised Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc. Vintage- Number of Days, Customer has been associated with the company Response- 1: Customer is interested, 0: Customer is not interested

    test.csv- test data train.csv- train data sample_submission_iA3afxn.csv- sample of submission file

    Acknowledgements

    Analytics Vidya Competition

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  16. f

    Data Sheet 1_Global burden and trends of self-harm from 1990 to 2021, with...

    • frontiersin.figshare.com
    csv
    Updated May 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Xie; Liangchen Tang; Yixin Liu; Zhenchao Dong; Xiaojun Zhang (2025). Data Sheet 1_Global burden and trends of self-harm from 1990 to 2021, with predictions to 2050.csv [Dataset]. http://doi.org/10.3389/fpubh.2025.1571579.s002
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    Frontiers
    Authors
    Li Xie; Liangchen Tang; Yixin Liu; Zhenchao Dong; Xiaojun Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundSelf-harm has become a major public health problem globally. Data on the burden of self-harm in this study were taken from the GBD 2021. This study aimed to quantify historical trends (1990–2021) in the global burden of self-harm across genders, age groups, and regions, and project future changes (2022–2050) through Bayesian forecasting models.MethodsBased on the seven GBD super-regions, the burden of self-harm was analyzed by region, age, and gender from 1990 to 2021. Hierarchical statistical approach was used to predict trends in global and regional changes in the burden of self-harm, 2022-2050.ResultIn 2021, the global DALYs and death counts from self-harm were 33.5 million (95% UI: 31.3-35.8) and 746.4 thousand (95% UI: 691.8-799.8). The region with the highest number of DALYs and deaths is South Asia and the highest age-standardized rates of DALYs and mortality were in central Europe, eastern Europe, and central Asia. Globally, the burden of self-harm was higher for males than for females. DALYs rates were highest among adolescents and young adults (20-29 years), whereas mortality rates showed a predominantly age-progressive pattern with the highest burden observed in middle-aged and older populations, albeit with a modest decline in the oldest age groups. Forecasting models showed a sustained decline in the global burden of self-harm from 2022-2050.ConclusionThe results highlight the need for policymakers to allocate resources to high-burden regions (e.g., South Asia and Eastern Europe), to implement gender- and age-specific prevention programs, and to strengthen cross-sectoral collaboration to address the underlying social determinants of self-harm. The findings call for strengthened mental health services and targeted interventions to effectively respond to and reduce the devastating impact of self-harm on individuals and the global community.

  17. f

    Measurement invariance test across gender.

    • plos.figshare.com
    xls
    Updated May 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tian-Jiao Song; Hao Zhao (2025). Measurement invariance test across gender. [Dataset]. http://doi.org/10.1371/journal.pone.0323215.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 21, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Tian-Jiao Song; Hao Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundSmartphone addiction among college students is a common problem of concern, especially in China, and is associated with numerous psychological challenges. Nevertheless, a valid instrument to measure smartphone addiction in Chinese college students remains underdeveloped.ObjectiveTo provide a valid research instrument for assessing smartphone addiction among Chinese college students, this study conducted a cross-cultural investigation by evaluating the psychometric properties of the Chinese Version of the Smartphone Application-Based Addiction Scale (SABAS) and its measurement invariance across gender among Chinese college students.MethodsThe SABAS was translated into Chinese using the forward-backward method and the Chinese version of the SABAS (SABAS-CV) was completed by 644 Chinese college students. A random selection of 80 college students was made from the total sample, and they were assessed twice with a one-month interval. The reliability of the SABAS-CV was analyzed through internal consistency and test-retest reliability, while the validity was assessed via content validity, structural validity, and convergent validity. Additionally, this study tested the measurement invariance of the SABAS-CV across gender.ResultsThe SABAS-CV demonstrated strong content validity, high internal consistency (α = 0.828 for sample 1, α = 0.856 for sample 2), and good test-retest reliability (ICC = 0.968, 95% CI: 0.952–0.977). Exploratory factor analysis revealed one component with eigenvalue (3.440) greater than 1, explaining 57.336% of the variance. Confirmatory factor analysis showed good model fit (χ2/df = 2.462, RMSEA = 0.054, SRMR = 0.029, CFI = 0.968, TLI = 0.956). The factor loadings of the 6 items ranged from 0.549 to 0.853, all exceeding 0.50, with the lower bounds of their confidence intervals also above 0.50. The SABAS-CV had a strong correlation with the Chinese version of the Nomophobia Questionnaire (r = 0.715) and the SAS-CSV (r = 0.826). Measurement invariance test across gender demonstrated that the SABAS-CV was measurement invariant for male and female college students.ConclusionThe SABAS-CV serves as a valid instrument for assessing smartphone addiction in Chinese college students, indicating that the SABAS has high applicability in the Chinese cultural context.

  18. AmExpert 2021 – Machine Learning Hackathon

    • kaggle.com
    Updated Nov 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya Sharma (2021). AmExpert 2021 – Machine Learning Hackathon [Dataset]. https://www.kaggle.com/adityasharma95/amexpert-2021-machine-learning-hackathon/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aditya Sharma
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction:

    American Express and Analytics Vidhya present “AmExpert 2021 – Machine Learning Hackathon”, an amazing opportunity to showcase your analytical abilities and talent!

    Get a taste of the kind of challenges we face here at American Express on a day-to-day basis.

    Acknowledgements:

    https://datahack.analyticsvidhya.com/contest/amexpert-2021-machine-learning-hackathon/

    Problem Statement:

    XYZ Bank is a mid-sized private bank that includes a variety of banking products, such as savings accounts, current accounts, investment products, credit products, and home loans.

    The Bank wants to predict the next set of products for a set of customers to optimize their marketing and communication campaigns.

    The data available in this problem contains the following information: * User Demographic Details : Gender, Age, Vintage, Customer Category etc. * Current Product Holdings * Product Holding in Next 6 Months (only for Train dataset)

    Here, our task is to predict the next set of products (upto 3 products) for a set of customers (Test data) based on their demographics and current product holdings.

    Data Description:

    Train csv -

    Customer_ID - Unique ID for the customer

    Gender - Gender of the Customer

    Age - Age of the Customer (in Years)

    Vintage - Vintage for the Customer (In Months)

    Is_Active - Activity Index, 0 : Less frequent customer, 1 : More frequent customer

    City_Category - Encoded Category of customer's city

    Customer_Category - Encoded Category of the customer

    Product_Holding_B1 - Current Product Holding (Encoded)

    Product_Holding_B2 - Product Holding in next six months (Encoded) - Target Column

    Test csv -

    Customer_ID - Unique ID for the customer

    Gender - Gender of the Customer

    Age - Age of the Customer (in Years)

    Vintage - Vintage for the Customer (In Months)

    Is_Active - Activity Index, 0 : Less frequent customer, 1 : More frequent customer

    City_Category - Encoded Category of customer's city

    Customer_Category - Encoded Category of the customer

    Product_Holding_B1 - Current Product Holding (Encoded)

    Evaluation Metric:

    The evaluation metric is Mean Average Precision (MAP) at K (K = 3). MAP is a well-known metric used to evaluate ranked retrieval results. E.g. Let’s say for a given customer, we recommended 3 products and only 1st and 3rd products are correct. So, the result would look like — 1, 0, 1

    In this case, The precision at 1 will be: 1*1/1 = 1 The precision at 2 will be: 0*1/2 The precision at 3 will be: 1*2/3 = 0.67 Average Precision will be: (1 + 0 + 0.67)/3 = 0.556.

  19. JantaHack: Cross sell Prediction

    • kaggle.com
    Updated Sep 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pawan Sharma (2020). JantaHack: Cross sell Prediction [Dataset]. https://www.kaggle.com/pawan2905/jantahack-cross-sell-prediction/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Pawan Sharma
    Description

    Context

    Jantahack: Cross-sell Prediction

    Cross-selling identifies products or services that satisfy additional, complementary needs that are unfulfilled by the original product that a customer possesses. As an example, a mouse could be cross-sold to a customer purchasing a keyboard. Oftentimes, cross-selling points users to products they would have purchased anyways; by showing them at the right time, a store ensures they make the sale.

    Cross-selling is prevalent in various domains and industries including banks. For example, credit cards are cross-sold to people registering a savings account. In ecommerce, cross-selling is often utilized on product pages, during the checkout process, and in lifecycle campaigns. It is a highly-effective tactic for generating repeat purchases, demonstrating the breadth of a catalog to customers. Cross-selling can alert users to products they didn't previously know you offered, further earning their confidence as the best retailer to satisfy a particular need.

    This weekend we invite you to participate in another Janatahack with the theme of Cross-sell prediction. Stay tuned for the problem statement and datasets this Friday and get a chance to work on a real industry case study along with 250 AV points at stake.

    Content

    Your client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

    An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

    For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

    Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

    Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

    Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

    Acknowledgements

    train.csv Variable Definition id Unique ID for the customer Gender Gender of the customer Age Age of the customer Driving_License 0 : Customer does not have DL, 1 : Customer already has DL Region_Code Unique code for the region of the customer Previously_Insured 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance Vehicle_Age Age of the Vehicle Vehicle_Damage 1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past. Annual_Premium The amount customer needs to pay as premium in the year Policy_Sales_Channel Anonymised Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc. Vintage Number of Days, Customer has been associated with the company Response 1 : Customer is interested, 0 : Customer is not interested

    test.csv Variable Definition id Unique ID for the customer Gender Gender of the customer Age Age of the customer Driving_License 0 : Customer does not have DL, 1 : Customer already has DL Region_Code Unique code for the region of the customer Previously_Insured 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance Vehicle_Age Age of the Vehicle Vehicle_Damage 1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past. Annual_Premium The amount customer needs to pay as premium in the year Policy_Sales_Channel Anonymised Code f...

  20. Hospital Management Dataset

    • kaggle.com
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kanak Baghel
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

    Dataset Overview

    This dataset includes five CSV files:

    1. patients.csv – Patient demographics, contact details, registration info, and insurance data

    2. doctors.csv – Doctor profiles with specializations, experience, and contact information

    3. appointments.csv – Appointment dates, times, visit reasons, and statuses

    4. treatments.csv – Treatment types, descriptions, dates, and associated costs

    5. billing.csv – Billing amounts, payment methods, and status linked to treatments

    📁 Files & Column Descriptions

    ** patients.csv**

    Contains patient demographic and registration details.

    Column Description

    patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

    ** doctors.csv**

    Details about the doctors working in the hospital.

    Column Description

    doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

    appointments.csv

    Records of scheduled and completed patient appointments.

    Column Description

    appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

    treatments.csv

    Information about the treatments given during appointments.

    Column Description

    treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

    ** billing.csv**

    Billing and payment details for treatments.

    Column Description

    bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

    Possible Use Cases

    SQL queries and relational database design

    Exploratory data analysis (EDA) and dashboarding

    Machine learning projects (e.g., cost prediction, no-show analysis)

    Feature engineering and data cleaning practice

    End-to-end healthcare analytics workflows

    Recommended Tools & Resources

    SQL (joins, filters, window functions)

    Pandas and Matplotlib/Seaborn for EDA

    Scikit-learn for ML models

    Pandas Profiling for automated EDA

    Plotly for interactive visualizations

    Please Note that :

    All data is synthetically generated for educational and project use. No real patient information is included.

    If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Training Data (2023). Gender Detection & Classification - Face Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/gender-detection-and-classification-image-dataset
Organization logo

Gender Detection & Classification - Face Dataset

Photos of people - face recognition dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Gender Detection & Classification - face recognition dataset

The dataset is created on the basis of Face Mask Detection dataset

Dataset Description:

The dataset comprises a collection of photos of people, organized into folders labeled "women" and "men." Each folder contains a significant number of images to facilitate training and testing of gender detection algorithms or models.

The dataset contains a variety of images capturing female and male individuals from diverse backgrounds, age groups, and ethnicities.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F1c4708f0b856f7889e3c0eea434fe8e2%2FFrame%2045%20(1).png?generation=1698764294000412&alt=media" alt="">

This labeled dataset can be utilized as training data for machine learning models, computer vision applications, and gender detection algorithms.

💴 For Commercial Usage: Full version of the dataset includes 376 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

  • assignment_id - unique identifier of the media file
  • worker_id - unique identifier of the person
  • age - age of the person
  • true_gender - gender of the person
  • country - country of the person
  • ethnicity - ethnicity of the person
  • photo_1_extension, photo_2_extension, photo_3_extension, photo_4_extension - photo extensions in the dataset
  • photo_1_resolution, photo_2_resolution, photo_3_extension, photo_4_resolution - photo resolution in the dataset

OTHER BIOMETRIC DATASETS:

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

Content

The dataset is split into train and test folders, each folder includes: - folders women and men - folders with images of people with the corresponding gender, - .csv file - contains information about the images and people in the dataset

File with the extension .csv

  • file: link to access the file,
  • gender: gender of a person in the photo (woman/man),
  • split: classification on train and test

TrainingData provides high-quality data annotation tailored to your needs

keywords: biometric system, biometric system attacks, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, gender detection, supervised learning dataset, gender classification dataset, gender recognition dataset

Search
Clear search
Close search
Google apps
Main menu