24 datasets found

Gender Detection & Classification - Face Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). Gender Detection & Classification - Face Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/gender-detection-and-classification-image-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Gender Detection & Classification - face recognition dataset

The dataset is created on the basis of Face Mask Detection dataset

Dataset Description:

The dataset comprises a collection of photos of people, organized into folders labeled "women" and "men." Each folder contains a significant number of images to facilitate training and testing of gender detection algorithms or models.

The dataset contains a variety of images capturing female and male individuals from diverse backgrounds, age groups, and ethnicities.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F1c4708f0b856f7889e3c0eea434fe8e2%2FFrame%2045%20(1).png?generation=1698764294000412&alt=media" alt="">

This labeled dataset can be utilized as training data for machine learning models, computer vision applications, and gender detection algorithms.

💴 For Commercial Usage: Full version of the dataset includes 376 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

assignment_id - unique identifier of the media file

worker_id - unique identifier of the person

age - age of the person

true_gender - gender of the person

country - country of the person

ethnicity - ethnicity of the person

photo_1_extension, photo_2_extension, photo_3_extension, photo_4_extension - photo extensions in the dataset

photo_1_resolution, photo_2_resolution, photo_3_extension, photo_4_resolution - photo resolution in the dataset

OTHER BIOMETRIC DATASETS:

Anti Spoofing Real Dataset

Antispoofing Replay Dataset

Selfies, ID Images dataset (5591 sets of 15 files)

Selfies and video dataset (4 052 sets)

Dataset of bald people, 5000 images

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

Content

The dataset is split into train and test folders, each folder includes: - folders women and men - folders with images of people with the corresponding gender, - .csv file - contains information about the images and people in the dataset

File with the extension .csv

file: link to access the file,

gender: gender of a person in the photo (woman/man),

split: classification on train and test

TrainingData provides high-quality data annotation tailored to your needs

keywords: biometric system, biometric system attacks, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, gender detection, supervised learning dataset, gender classification dataset, gender recognition dataset
I
Hype and Diversity - PubMed dataset
aws-databank-alb.library.illinois.edu
databank.illinois.edu
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Apratim Mishra (2025). Hype and Diversity - PubMed dataset [Dataset]. http://doi.org/10.13012/B2IDB-5692759_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5692759_V1
Dataset updated
May 28, 2025
Authors
Apratim Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset captures ‘Hype’ and 'Diversity', including article-level (pmid) and author-level (auid) data within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1991 and 2014, totaling 421,580 (merged_df). The classification of hype relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences. Diversity is classified for ethnicity, gender, academic age, and topical expertise for authors based on the Rao-Sterling Diversity index. File1: merged_auids.csv (Important columns defined) • AUID: a unique ID for each author • Genni: gender prediction • Ethnea: ethnicity prediction ################################################# File2: merged_df.csv (Important columns defined) - pmid: unique paper - auid: all unique auids (author-name unique identification) - year: Year of paper publication - no_authors: Author count - journal: Journal name - years: first year of publication for every author - Country-temporal: Country of affiliation for every author - h_index: Journal h-index - TimeNovelty: Paper Time novelty - nih_funded: Binary variable indicating funding for any author - prior_cites_mean: Mean of all authors’ prior citation rate - insti_impact: All unique institutions’ citation rate - mesh_vals: Top MeSH values for every author of that paper - hype_word: Candidate hype word, such as ‘novel' - hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location - hype_percentile: Abstract relative position of hype word - relative_citation_ratio: RCR
p
Intersectional Lens on Leaders_Study_Rawdata.csv
psycharchives.org
Updated Oct 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Intersectional Lens on Leaders_Study_Rawdata.csv [Dataset]. https://www.psycharchives.org/handle/20.500.12034/7527
Explore at:
Dataset updated
Oct 6, 2022
License
https://doi.org/10.23668/psycharchives.4988https://doi.org/10.23668/psycharchives.4988
Description
Younger men and especially younger women are excluded from leadership roles or obstructed from succeeding in these positions by facing backlash. Our project aims to build a more gender-specific understanding of the backlash that younger individuals in leadership positions face. We predict an interactive backlash for younger women and younger men that is rooted in intersectional stereotypes compared to the stereotypes based on single demographic categories (i.e., age or gender stereotypes). To test our hypotheses, we collect data from a heterogeneous sample (N = 900) of U.S. citizens between 25 and 69 years. We conduct an experimental online study with a between-participant design to examine the backlash against younger women and younger men. Dataset for: Daldrop, C., Buengeler, C., & Homan, A. C. (2022). An Intersectional Lens on Leadership: Prescriptive Stereotypes towards Younger Women and Younger Men and their Effect on Leadership Perception. PsychArchives. https://doi.org/10.23668/psycharchives.5404 Dataset for: Daldrop, C., Buengeler, C., & Homan, A. C. (2023). An intersectional lens on young leaders: bias toward young women and young men in leadership positions. In Frontiers in Psychology (Vol. 14). Frontiers Media SA. https://doi.org/10.3389/fpsyg.2023.120454 Research has recognized age biases against young leaders, yet understanding of how gender, the most frequently studied demographic leader characteristic, influences this bias remains limited. In this study, we examine the gender-specific age bias toward young female and young male leaders through an intersectional lens. By integrating intersectionality theory with insights on status beliefs associated with age and gender, we test whether young female and male leaders face an interactive rather than an additive form of bias. We conducted two preregistered experimental studies (N1 = 918 and N2 = 985), where participants evaluated leaders based on age, gender, or a combination of both. Our analysis reveals a negative age bias in leader status ascriptions toward young leaders compared to middle-aged and older leaders. This bias persists when gender information is added, as demonstrated in both intersectional categories of young female and young male leaders. This bias pattern does not extend to middle-aged or older female and male leaders, thereby supporting the age bias against young leaders specifically. Interestingly, we also examined whether social dominance orientation strengthens the bias against young (male) leaders, but our results (reported in the SOM) are not as hypothesized. In sum, our results emphasize the importance of young age as a crucial demographic characteristic in leadership perceptions that can even overshadow the role of gender.: Raw Data File
Smartwatch Purchase Data
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aayush Chourasiya (2022). Smartwatch Purchase Data [Dataset]. https://www.kaggle.com/datasets/albedo0/smartwatch-purchase-data/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aayush Chourasiya
Description
Disclaimer: This is an artificially generated data using a python script based on arbitrary assumptions listed down.

The data consists of 100,000 examples of training data and 10,000 examples of test data, each representing a user who may or may not buy a smart watch.

----- Version 1 -------

trainingDataV1.csv, testDataV1.csv or trainingData.csv, testData.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. hour: The hour of the day (integer, 0-23) 1. weekend: A boolean indicating whether it is the weekend (True or False) 1. The data also includes a label for each user indicating whether they are likely to buy a smart watch or not (string, "yes" or "no"). The label is determined based on the following arbitrary conditions: - If the user is divorced and a random number generated by the script is less than 0.4, the label is "no" (i.e., assuming 40% of divorcees are not likely to buy a smart watch) - If it is the weekend and a random number generated by the script is less than 1.3, the label is "yes". (i.e., assuming sales are 30% more likely to occur on weekends) - If the user is male and under 30 with an income over 75,000, the label is "yes". - If the user is female and 30 or over with an income over 100,000, the label is "yes". Otherwise, the label is "no".

The training data is intended to be used to build and train a classification model, and the test data is intended to be used to evaluate the performance of the trained model.

Following Python script was used to generate this dataset

import random import csv # Set the number of examples to generate numExamples = 100000 # Generate the training data with open("trainingData.csv", "w", newline="") as csvfile: fieldnames = ["age", "income", "gender", "maritalStatus", "hour", "weekend", "buySmartWatch"] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() for i in range(numExamples): age = random.randint(18, 70) income = random.randint(25000, 200000) gender = random.choice(["male", "female"]) maritalStatus = random.choice(["single", "married", "divorced"]) hour = random.randint(0, 23) weekend = random.choice([True, False]) # Randomly assign the label based on some arbitrary conditions # assuming 40% of divorcees won't buy a smart watch if maritalStatus == "divorced" and random.random() < 0.4: buySmartWatch = "no" # assuming sales are 30% more likely to occur on weekends. elif weekend == True and random.random() < 1.3: buySmartWatch = "yes" elif gender == "male" and age < 30 and income > 75000: buySmartWatch = "yes" elif gender == "female" and age >= 30 and income > 100000: buySmartWatch = "yes" else: buySmartWatch = "no" writer.writerow({ "age": age, "income": income, "gender": gender, "maritalStatus": maritalStatus, "hour": hour, "weekend": weekend, "buySmartWatch": buySmartWatch })

----- Version 2 -------

trainingDataV2.csv, testDataV2.csv The data includes the following features for each user: 1. age: The age of the user (integer, 18-70) 1. income: The income of the user (integer, 25,000-200,000) 1. gender: The gender of the user (string, "male" or "female") 1. maritalStatus: The marital status of the user (string, "single", "married", or "divorced") 1. educationLevel: The education level of the user (string, "high school", "associate's degree", "bachelor's degree", "master's degree", or "doctorate") 1. occupation: The occupation of the user (string, "tech worker", "manager", "executive", "sales", "customer service", "creative", "manual labor", "healthcare", "education", "government", "unemployed", or "student") 1. familySize: The number of people in the user's family (integer, 1-5) 1. fitnessInterest: A boolean indicating whether the user is interested in fitness (True or False) 1. priorSmartwatchOwnership: A boolean indicating whether the user has owned a smartwatch in the past (True or False) 1. hour: The hour of the day when the user was surveyed (integer, 0-23) 1. weekend: A boolean indicating whether the user was surveyed on a weekend (True or False) 1. buySmartWatch: A boolean indicating whether the user purchased a smartwatch (True or False)

Python script used to generate the data:

import random import csv # Set the number of examples to generate numExamples = 100000 with open("t...
f
Data_Sheet_2_You’re Prettier When You Smile: Construction and Validation of...
frontiersin.figshare.com
txt
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mona Algner; Timo Lorenz (2023). Data_Sheet_2_You’re Prettier When You Smile: Construction and Validation of a Questionnaire to Assess Microaggressions Against Women in the Workplace.CSV [Dataset]. http://doi.org/10.3389/fpsyg.2022.809862.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2022.809862.s002
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Mona Algner; Timo Lorenz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Gender microaggressions, especially its subtler forms microinsults and microinvalidations are by definition hard to discern. We aim to construct and validate a scale reflecting two facets of the microaggression taxonomy: microinsults and microinvalidations toward women in the workplace, the MIMI-16. Two studies were conducted (N1 = 500, N2 = 612). Using a genetic algorithm, a 16-item scale was developed and consequently validated via confirmatory factor analyses (CFA) in three separate validation samples. Correlational analyses with organizational outcome measures were performed. The MIMI-16 exhibits good model fit in all validation samples (CFI = 0.936–0.960, TLI = 0.926–0.954, RMSEA = 0.046–0.062, SRMR = 0.042–0.049). Multigroup-CFA suggested strict measurement invariance between all validation samples. Correlations were as expected and indicate internal and external validity. Scholars on gender microaggressions have mostly used qualitative research. With the newly developed MIMI-16 we provide a reliable and valid quantitative instrument to measure gender microaggressions in the workplace.
S
Depression prediction dataset based on online medical consultation
scidb.cn
Updated Mar 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nie Hui (2023). Depression prediction dataset based on online medical consultation [Dataset]. http://doi.org/10.57760/sciencedb.07706
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.07706
Dataset updated
Mar 13, 2023
Dataset provided by
Science Data Bank
Authors
Nie Hui
Description
The relevant features of the LIWC psychological dictionary are extracted from the consultation text after preprocessing the depression consultation data collected from the online consultation platform File name: DepressionLevelPrediction-LIWC-Processed.csv Creation time: 2022-12-20 Function: explore the relationship between LIWC-based features and depression Data volume: 3859 Data format: utf8 Field description: ID: consultation record code Depression: degree of depression (3: severe; 2: moderate; 1: mild; 0: undiagnosed) Age: age Gender: gender (1: male 0: female) Region: Region (temporarily unused) Identity: Identity (not used temporarily) Socialize: sociality Emotion: Emotion Cognition: cognition Perception: Perception Physiology: physiology Gains or losses
d
Replication Data for 'Gender (im)balance in the Russian cinema: on the...
search.dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leontyeva, Xenia (2024). Replication Data for 'Gender (im)balance in the Russian cinema: on the screen and behind the camera' [Dataset]. http://doi.org/10.7910/DVN/ISVTB4
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ISVTB4
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Leontyeva, Xenia
Description
There are two CSV datasets in this publication used initially in the master thesis in sociology of Xenia Leontyeva at HSE University Saint Petersburg, titled "Popularity Factors of Domestic Films: Gender Characteristics and State Support Measures" (2022), and lately for the article by Leontyeva, Xenia, Olessia Koltsova, and Deb Verhoeven, titled "Gender (Im)Balance in Russian Cinema: On the Screen and behind the Camera" (Accepted in January 2024 in The Journal of Cultural Analytics). The first dataset (N=1285) includes all Russian films produced between 2008 and 2019 and theatrically released between December 1, 2008, and December 31, 2019. Distribution statistics cover the territory of the CIS, of which the Russian Federation is the biggest market. Budget information is available for 644 films. The second dataset contains the Bechdel-Wallace test modified by Leontyeva markup for 243 films, 193 of which have budget information. There is also a supplement with a detailed description of all variables and R-code producing tables, plots, and models for the article. The database was collected by Xenia Leontyeva while working at Nevafilm Research (until 2018) and later. In terms of distribution data, it is based on sources such as the open base Russian Cinema Fund Analytics – RCFA (since 2015), the closed base comScore/Rentrak ("International Box Office Essential") serving major Hollywood studios (data from it has been used since 2008 to fill gaps in open databases), Bookers' Bulletin (since 2011), and Russian Film Business Today magazines (since 2004), as well as self-collected by Nevafilm Research employees from film distributors and producers; the rights to use and continue this dataset have been received from Nevafilm company. In terms of production data, the information was taken from the State register of film distribution certificates, Kinopoisk.ru, and from the films' credits.
f
Data_Sheet_2_Action Sounds Informing Own Body Perception Influence Gender...
frontiersin.figshare.com
txt
Updated Jun 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sünje Clausen; Ana Tajadura-Jiménez; Christian P. Janssen; Nadia Bianchi-Berthouze (2023). Data_Sheet_2_Action Sounds Informing Own Body Perception Influence Gender Identity and Social Cognition.csv [Dataset]. http://doi.org/10.3389/fnhum.2021.688170.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fnhum.2021.688170.s002
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Sünje Clausen; Ana Tajadura-Jiménez; Christian P. Janssen; Nadia Bianchi-Berthouze
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sensory information can temporarily affect mental body representations. For example, in Virtual Reality (VR), visually swapping into a body with another sex can temporarily alter perceived gender identity. Outside of VR, real-time auditory changes to walkers’ footstep sounds can affect perceived body weight and masculinity/femininity. Here, we investigate whether altered footstep sounds also impact gender identity and relation to gender groups. In two experiments, cisgender participants (26 females, 26 males) walked with headphones which played altered versions of their own footstep sounds that sounded more typically male or female. Baseline and post-intervention measures quantified gender identity [Implicit Association Test (IAT)], relation to gender groups [Inclusion of the Other-in-the-Self (IOS)], and perceived masculinity/femininity. Results show that females felt more feminine and closer to the group of women (IOS) directly after walking with feminine sounding footsteps. Similarly, males felt more feminine after walking with feminine sounding footsteps and associated themselves relatively stronger with “female” (IAT). The findings suggest that gender identity is temporarily malleable through auditory-induced own body illusions. Furthermore, they provide evidence for a connection between body perception and an abstract representation of the Self, supporting the theory that bodily illusions affect social cognition through changes in the self-concept.
Regression analysis in Galaxy with car purchase price prediction dataset
zenodo.org
data.niaid.nih.gov
tsv
Updated Aug 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaivan Kamali; Kaivan Kamali (2022). Regression analysis in Galaxy with car purchase price prediction dataset [Dataset]. http://doi.org/10.5281/zenodo.4660497
Explore at:
tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4660497
Dataset updated
Aug 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kaivan Kamali; Kaivan Kamali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Source/Credit: Michael Grogan
https://github.com/MGCodesandStats
https://github.com/MGCodesandStats/datasets/blob/master/cars.csv

Sample dataset for regression analysis. Given 5 attributes (age, gender, miles driven per day, debt, and income) predict how much someone will spend on purchasing a car. All 5 of the input attributes have been scaled to be in 0 to 1 range. Training set has 723 training examples. Test set has 242 test examples.

This dataset will be used in an upcoming Galaxy Training Network tutorial (https://training.galaxyproject.org/training-material/topics/statistics/) on use of feedforward neural networks for regression analysis.
Future medical event
kaggle.com
Updated Jul 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohit Chaturvedi (2018). Future medical event [Dataset]. https://www.kaggle.com/tango911/future-medical-event/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 24, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohit Chaturvedi
Description
Problem Statement

Insurance Plus++, a premium payer, wants to use predictive modeling on healthcare data to predict the occurrence of future events among their covered patients. They want to use existing data about their patients’ previous medical events to predict future events in their patient journey. Events are recorded in the standardized ICD-9 format (details here). In this challenge, the goal is to predict the next 10 events in 2014 for each patient in order of occurrence.

Data Description

There are three files available for download: train.csv, test.csv and sample_submission.csv

The “train.csv” file contains historical patient information from Jan 2011 to Dec 2013. The “test.csv” file contains a list of Patient IDs for which we aim to predict the next 10 events for in the year 2014. Event codes should be considered to be categorical in nature, not continuous.

Variable-Description

UID-Unique Patient ID

Age-Age of the patient

Gender-Gender of the patient

Date-Date of Event

Event_Code-Event Code (ICD-9 format, the target variable of this challenge)
m
Heart Attack Dataset
data.mendeley.com
Updated Nov 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarik A. Rashid (2022). Heart Attack Dataset [Dataset]. http://doi.org/10.17632/wmhctcrt5v.1
Explore at:
Unique identifier
https://doi.org/10.17632/wmhctcrt5v.1
Dataset updated
Nov 23, 2022
Authors
Tarik A. Rashid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The heart attack datasets were collected at Zheen hospital in Erbil, Iraq, from January 2019 to May 2019. The attributes of this dataset are: age, gender, heart rate, systolic blood pressure, diastolic blood pressure, blood sugar, ck-mb and troponin with negative or positive output. According to the provided information, the medical dataset classifies either heart attack or none. The gender column in the data is normalized: the male is set to 1 and the female to 0. The glucose column is set to 1 if it is > 120; otherwise, 0. As for the output, positive is set to 1 and negative to 0.
m
Diabetes Dataset
data.mendeley.com
Updated Jul 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahlam Rashid (2020). Diabetes Dataset [Dataset]. http://doi.org/10.17632/wj9rwkp9c2.1
Explore at:
Unique identifier
https://doi.org/10.17632/wj9rwkp9c2.1
Dataset updated
Jul 18, 2020
Authors
Ahlam Rashid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The construction of diabetes dataset was explained. The data were collected from the Iraqi society, as they data were acquired from the laboratory of Medical City Hospital and (the Specializes Center for Endocrinology and Diabetes-Al-Kindy Teaching Hospital). Patients' files were taken and data extracted from them and entered in to the database to construct the diabetes dataset. The data consist of medical information, laboratory analysis. The data attribute are: The data consist of medical information, laboratory analysis… etc. The data that have been entered initially into the system are: No. of Patient, Sugar Level Blood, Age, Gender, Creatinine ratio(Cr), Body Mass Index (BMI), Urea, Cholesterol (Chol), Fasting lipid profile, including total, LDL, VLDL, Triglycerides(TG) and HDL Cholesterol , HBA1C, Class (the patient's diabetes disease class may be Diabetic, Non-Diabetic, or Predict-Diabetic).
f
Data_Sheet_2_Construction and Verification of a Predictive Model for Risk...
figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated Jun 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yaqiong He; Peng Liu; Leyun Xie; Saizhen Zeng; Huashan Lin; Bing Zhang; Jianbin Liu (2023). Data_Sheet_2_Construction and Verification of a Predictive Model for Risk Factors in Children With Severe Adenoviral Pneumonia.CSV [Dataset]. http://doi.org/10.3389/fped.2022.874822.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fped.2022.874822.s002
Dataset updated
Jun 14, 2023
Dataset provided by
Frontiers
Authors
Yaqiong He; Peng Liu; Leyun Xie; Saizhen Zeng; Huashan Lin; Bing Zhang; Jianbin Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveTo construct and validate a predictive model for risk factors in children with severe adenoviral pneumonia based on chest low-dose CT imaging and clinical features.MethodsA total of 177 patients with adenoviral pneumonia who underwent low-dose CT examination were collected between January 2019 and August 2019. The assessment criteria for severe pneumonia were divided into mild group (N = 125) and severe group (N = 52). All cases divided into training cohort (N = 125) and validation cohort (N = 52). We constructed a prediction model by drawing a nomogram and verified the predictive efficacy of the model through the ROC curve, calibration curve and decision curve analysis.ResultsThe difference was statistically significant (P < 0.05) between the mild adenovirus pneumonia group and the severe adenovirus pneumonia group in gender, age, weight, body temperature, L/N ratio, LDH, ALT, AST, CK-MB, ADV DNA, bronchial inflation sign, emphysema, ground glass sign, bronchial wall thickening, bronchiectasis, pleural effusion, consolidation score, and lobular inflammation score. Multivariate logistic regression analysis showed that gender, LDH value, emphysema, consolidation score, and lobular inflammation score were severe independent risk factors for adenovirus pneumonia in children. Logistic regression was employed to construct clinical model, imaging semantic feature model, and combined model. The AUC values of the training sets of the three models were 0.85 (0.77–0.94), 0.83 (0.75–0.91), and 0.91 (0.85–0.97). The AUC of the validation set was 0.77 (0.64–0.91), 0.83 (0.71–0.94), and 0.85 (0.73–0.96), respectively. The calibration curve fit good of the three models. The clinical decision curve analysis demonstrates the clinical application value of the nomogram prediction model.ConclusionThe prediction model based on chest low-dose CT image characteristics and clinical characteristics has relatively clear predictive value in distinguishing mild adenovirus pneumonia from severe adenovirus pneumonia in children and might provide a new method for early clinical prediction of the outcome of adenovirus pneumonia in children.
f
DataSheet1_Putting external validation performance of major bleeding risk...
frontiersin.figshare.com
txt
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clair Blacketer; Jenna M. Reps; Lu Wang; Patrick B. Ryan; Zhong Yuan (2023). DataSheet1_Putting external validation performance of major bleeding risk models into context.CSV [Dataset]. http://doi.org/10.3389/fdsfr.2022.1034677.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fdsfr.2022.1034677.s001
Dataset updated
Jun 4, 2023
Dataset provided by
Frontiers
Authors
Clair Blacketer; Jenna M. Reps; Lu Wang; Patrick B. Ryan; Zhong Yuan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
When developing predictive models, model simplicity and performance often need to be balanced. We propose a novel methodology to put the performance of bleeding risk prediction models ORBIT, ATRIA, HAS-BLED, CHADS2, and CHA2DS2-VASc into perspective. Instead of comparing the existing models’ performance against the 0.5–1 AUROC scale, we suggest estimating a prediction task specific AUROC scale, lower bound AUROC (lbAUROC) and upper bound AUROC (ubAUROC), to help assess the balance between model simplicity and performance and determine whether more complex models could significantly improve the ability to predict the outcome. We validate the existing bleeding risk prediction models by applying them to a cohort of new users of warfarin and a cohort of new users of direct oral anticoagulants (DOACs) separately, across a set of four observational databases. Then, we develop the lbAUROC-ubAUROC scale by using the validation data to train regularized logistic regression models. The internal validation AUROC of the model that includes only age and gender variables was used to estimate the lbAUROC. The internal validation AUROC of the model that includes thousands of candidate variables was used to estimate the ubAUROC. The age and gender only models achieved AUROCs between 0.50 and 0.56 (lower bound) and the large-scale models achieved AUROCs between 0.67 and 0.72 and between 0.70 and 0.77 (upper bound) within the target cohorts of warfarin new users and DOACs new users, respectively. The AUROC of existing bleeding risk prediction models fall between the upper-bound and lower-bound of predictive models. Our study showed that this context of the predictability of the outcome is essential when evaluating risk prediction models to be administered in actual practice.
Janatahack cross-sell prediction
kaggle.com
Updated Sep 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saurav Mishra (2020). Janatahack cross-sell prediction [Dataset]. https://www.kaggle.com/msaurav/janatahack-crosssell-prediction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saurav Mishra
Description
Context

Your client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from the past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if God forbid, you fall ill and need to be hospitalized in that year, the insurance provider company will bear the cost of hospitalization, etc. for up to Rs. 200,000. Now if you are wondering how can company bear such high hospitalization cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalized that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of a certain amount to the insurance provider company so that in case of an unfortunate accident by the vehicle, the insurance provider company will provide compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel), etc.

Content

Data Description. id- Unique ID for the customer Gender- Gender of the customer Age- Age of the customer Driving_License- 0: Customer does not have DL, 1: Customer already has DL Region_Code- Unique code for the region of the customer Previously_Insured- 1: Customer already has Vehicle Insurance, 0: Customer doesn't have Vehicle Insurance Vehicle_Age- Age of the Vehicle Vehicle_Damage- 1: Customer got his/her vehicle damaged in the past, 0: Customer didn't get his/her vehicle damaged in the past. Annual_Premium- The amount customer needs to pay as premium in the year Policy_Sales_Channel- Anonymised Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc. Vintage- Number of Days, Customer has been associated with the company Response- 1: Customer is interested, 0: Customer is not interested

test.csv- test data train.csv- train data sample_submission_iA3afxn.csv- sample of submission file

Acknowledgements

Analytics Vidya Competition

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
f
Data Sheet 1_Global burden and trends of self-harm from 1990 to 2021, with...
frontiersin.figshare.com
csv
Updated May 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Xie; Liangchen Tang; Yixin Liu; Zhenchao Dong; Xiaojun Zhang (2025). Data Sheet 1_Global burden and trends of self-harm from 1990 to 2021, with predictions to 2050.csv [Dataset]. http://doi.org/10.3389/fpubh.2025.1571579.s002
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2025.1571579.s002
Dataset updated
May 14, 2025
Dataset provided by
Frontiers
Authors
Li Xie; Liangchen Tang; Yixin Liu; Zhenchao Dong; Xiaojun Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundSelf-harm has become a major public health problem globally. Data on the burden of self-harm in this study were taken from the GBD 2021. This study aimed to quantify historical trends (1990–2021) in the global burden of self-harm across genders, age groups, and regions, and project future changes (2022–2050) through Bayesian forecasting models.MethodsBased on the seven GBD super-regions, the burden of self-harm was analyzed by region, age, and gender from 1990 to 2021. Hierarchical statistical approach was used to predict trends in global and regional changes in the burden of self-harm, 2022-2050.ResultIn 2021, the global DALYs and death counts from self-harm were 33.5 million (95% UI: 31.3-35.8) and 746.4 thousand (95% UI: 691.8-799.8). The region with the highest number of DALYs and deaths is South Asia and the highest age-standardized rates of DALYs and mortality were in central Europe, eastern Europe, and central Asia. Globally, the burden of self-harm was higher for males than for females. DALYs rates were highest among adolescents and young adults (20-29 years), whereas mortality rates showed a predominantly age-progressive pattern with the highest burden observed in middle-aged and older populations, albeit with a modest decline in the oldest age groups. Forecasting models showed a sustained decline in the global burden of self-harm from 2022-2050.ConclusionThe results highlight the need for policymakers to allocate resources to high-burden regions (e.g., South Asia and Eastern Europe), to implement gender- and age-specific prevention programs, and to strengthen cross-sectoral collaboration to address the underlying social determinants of self-harm. The findings call for strengthened mental health services and targeted interventions to effectively respond to and reduce the devastating impact of self-harm on individuals and the global community.
f
Measurement invariance test across gender.
plos.figshare.com
xls
Updated May 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tian-Jiao Song; Hao Zhao (2025). Measurement invariance test across gender. [Dataset]. http://doi.org/10.1371/journal.pone.0323215.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0323215.t009
Dataset updated
May 21, 2025
Dataset provided by
PLOS ONE
Authors
Tian-Jiao Song; Hao Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundSmartphone addiction among college students is a common problem of concern, especially in China, and is associated with numerous psychological challenges. Nevertheless, a valid instrument to measure smartphone addiction in Chinese college students remains underdeveloped.ObjectiveTo provide a valid research instrument for assessing smartphone addiction among Chinese college students, this study conducted a cross-cultural investigation by evaluating the psychometric properties of the Chinese Version of the Smartphone Application-Based Addiction Scale (SABAS) and its measurement invariance across gender among Chinese college students.MethodsThe SABAS was translated into Chinese using the forward-backward method and the Chinese version of the SABAS (SABAS-CV) was completed by 644 Chinese college students. A random selection of 80 college students was made from the total sample, and they were assessed twice with a one-month interval. The reliability of the SABAS-CV was analyzed through internal consistency and test-retest reliability, while the validity was assessed via content validity, structural validity, and convergent validity. Additionally, this study tested the measurement invariance of the SABAS-CV across gender.ResultsThe SABAS-CV demonstrated strong content validity, high internal consistency (α = 0.828 for sample 1, α = 0.856 for sample 2), and good test-retest reliability (ICC = 0.968, 95% CI: 0.952–0.977). Exploratory factor analysis revealed one component with eigenvalue (3.440) greater than 1, explaining 57.336% of the variance. Confirmatory factor analysis showed good model fit (χ2/df = 2.462, RMSEA = 0.054, SRMR = 0.029, CFI = 0.968, TLI = 0.956). The factor loadings of the 6 items ranged from 0.549 to 0.853, all exceeding 0.50, with the lower bounds of their confidence intervals also above 0.50. The SABAS-CV had a strong correlation with the Chinese version of the Nomophobia Questionnaire (r = 0.715) and the SAS-CSV (r = 0.826). Measurement invariance test across gender demonstrated that the SABAS-CV was measurement invariant for male and female college students.ConclusionThe SABAS-CV serves as a valid instrument for assessing smartphone addiction in Chinese college students, indicating that the SABAS has high applicability in the Chinese cultural context.
AmExpert 2021 – Machine Learning Hackathon
kaggle.com
Updated Nov 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditya Sharma (2021). AmExpert 2021 – Machine Learning Hackathon [Dataset]. https://www.kaggle.com/adityasharma95/amexpert-2021-machine-learning-hackathon/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aditya Sharma
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction:

American Express and Analytics Vidhya present “AmExpert 2021 – Machine Learning Hackathon”, an amazing opportunity to showcase your analytical abilities and talent!

Get a taste of the kind of challenges we face here at American Express on a day-to-day basis.

Acknowledgements:

https://datahack.analyticsvidhya.com/contest/amexpert-2021-machine-learning-hackathon/

Problem Statement:

XYZ Bank is a mid-sized private bank that includes a variety of banking products, such as savings accounts, current accounts, investment products, credit products, and home loans.

The Bank wants to predict the next set of products for a set of customers to optimize their marketing and communication campaigns.

The data available in this problem contains the following information: * User Demographic Details : Gender, Age, Vintage, Customer Category etc. * Current Product Holdings * Product Holding in Next 6 Months (only for Train dataset)

Here, our task is to predict the next set of products (upto 3 products) for a set of customers (Test data) based on their demographics and current product holdings.

Data Description:

Train csv -

Customer_ID - Unique ID for the customer

Gender - Gender of the Customer

Age - Age of the Customer (in Years)

Vintage - Vintage for the Customer (In Months)

Is_Active - Activity Index, 0 : Less frequent customer, 1 : More frequent customer

City_Category - Encoded Category of customer's city

Customer_Category - Encoded Category of the customer

Product_Holding_B1 - Current Product Holding (Encoded)

Product_Holding_B2 - Product Holding in next six months (Encoded) - Target Column

Test csv -

Customer_ID - Unique ID for the customer

Gender - Gender of the Customer

Age - Age of the Customer (in Years)

Vintage - Vintage for the Customer (In Months)

Is_Active - Activity Index, 0 : Less frequent customer, 1 : More frequent customer

City_Category - Encoded Category of customer's city

Customer_Category - Encoded Category of the customer

Product_Holding_B1 - Current Product Holding (Encoded)

Evaluation Metric:

The evaluation metric is Mean Average Precision (MAP) at K (K = 3). MAP is a well-known metric used to evaluate ranked retrieval results. E.g. Let’s say for a given customer, we recommended 3 products and only 1st and 3rd products are correct. So, the result would look like — 1, 0, 1

In this case, The precision at 1 will be: 1*1/1 = 1 The precision at 2 will be: 0*1/2 The precision at 3 will be: 1*2/3 = 0.67 Average Precision will be: (1 + 0 + 0.67)/3 = 0.556.
JantaHack: Cross sell Prediction
kaggle.com
Updated Sep 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pawan Sharma (2020). JantaHack: Cross sell Prediction [Dataset]. https://www.kaggle.com/pawan2905/jantahack-cross-sell-prediction/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Pawan Sharma
Description
Context

Jantahack: Cross-sell Prediction

Cross-selling identifies products or services that satisfy additional, complementary needs that are unfulfilled by the original product that a customer possesses. As an example, a mouse could be cross-sold to a customer purchasing a keyboard. Oftentimes, cross-selling points users to products they would have purchased anyways; by showing them at the right time, a store ensures they make the sale.

Cross-selling is prevalent in various domains and industries including banks. For example, credit cards are cross-sold to people registering a savings account. In ecommerce, cross-selling is often utilized on product pages, during the checkout process, and in lifecycle campaigns. It is a highly-effective tactic for generating repeat purchases, demonstrating the breadth of a catalog to customers. Cross-selling can alert users to products they didn't previously know you offered, further earning their confidence as the best retailer to satisfy a particular need.

This weekend we invite you to participate in another Janatahack with the theme of Cross-sell prediction. Stay tuned for the problem statement and datasets this Friday and get a chance to work on a real industry case study along with 250 AV points at stake.

Content

Your client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

Acknowledgements

train.csv Variable Definition id Unique ID for the customer Gender Gender of the customer Age Age of the customer Driving_License 0 : Customer does not have DL, 1 : Customer already has DL Region_Code Unique code for the region of the customer Previously_Insured 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance Vehicle_Age Age of the Vehicle Vehicle_Damage 1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past. Annual_Premium The amount customer needs to pay as premium in the year Policy_Sales_Channel Anonymised Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc. Vintage Number of Days, Customer has been associated with the company Response 1 : Customer is interested, 0 : Customer is not interested

test.csv Variable Definition id Unique ID for the customer Gender Gender of the customer Age Age of the customer Driving_License 0 : Customer does not have DL, 1 : Customer already has DL Region_Code Unique code for the region of the customer Previously_Insured 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance Vehicle_Age Age of the Vehicle Vehicle_Damage 1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past. Annual_Premium The amount customer needs to pay as premium in the year Policy_Sales_Channel Anonymised Code f...
Hospital Management Dataset
kaggle.com
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanak Baghel
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

Dataset Overview

This dataset includes five CSV files:

patients.csv – Patient demographics, contact details, registration info, and insurance data

doctors.csv – Doctor profiles with specializations, experience, and contact information

appointments.csv – Appointment dates, times, visit reasons, and statuses

treatments.csv – Treatment types, descriptions, dates, and associated costs

billing.csv – Billing amounts, payment methods, and status linked to treatments

📁 Files & Column Descriptions

** patients.csv**

Contains patient demographic and registration details.

Column Description

patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

** doctors.csv**

Details about the doctors working in the hospital.

Column Description

doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

appointments.csv

Records of scheduled and completed patient appointments.

Column Description

appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

treatments.csv

Information about the treatments given during appointments.

Column Description

treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

** billing.csv**

Billing and payment details for treatments.

Column Description

bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

Possible Use Cases

SQL queries and relational database design

Exploratory data analysis (EDA) and dashboarding

Machine learning projects (e.g., cost prediction, no-show analysis)

Feature engineering and data cleaning practice

End-to-end healthcare analytics workflows

Recommended Tools & Resources

SQL (joins, filters, window functions)

Pandas and Matplotlib/Seaborn for EDA

Scikit-learn for ML models

Pandas Profiling for automated EDA

Plotly for interactive visualizations

Please Note that :

All data is synthetically generated for educational and project use. No real patient information is included.

If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.

Facebook

Twitter

Click to copy link

Link copied

Cite

Training Data (2023). Gender Detection & Classification - Face Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/gender-detection-and-classification-image-dataset

Gender Detection & Classification - Face Dataset

Photos of people - face recognition dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 31, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Training Data

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Gender Detection & Classification - face recognition dataset

The dataset is created on the basis of Face Mask Detection dataset

Dataset Description:

The dataset comprises a collection of photos of people, organized into folders labeled "women" and "men." Each folder contains a significant number of images to facilitate training and testing of gender detection algorithms or models.

The dataset contains a variety of images capturing female and male individuals from diverse backgrounds, age groups, and ethnicities.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F1c4708f0b856f7889e3c0eea434fe8e2%2FFrame%2045%20(1).png?generation=1698764294000412&alt=media" alt="">

This labeled dataset can be utilized as training data for machine learning models, computer vision applications, and gender detection algorithms.

💴 For Commercial Usage: Full version of the dataset includes 376 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

assignment_id - unique identifier of the media file
worker_id - unique identifier of the person
age - age of the person
true_gender - gender of the person
country - country of the person
ethnicity - ethnicity of the person
photo_1_extension, photo_2_extension, photo_3_extension, photo_4_extension - photo extensions in the dataset
photo_1_resolution, photo_2_resolution, photo_3_extension, photo_4_resolution - photo resolution in the dataset

OTHER BIOMETRIC DATASETS:

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

Content

The dataset is split into train and test folders, each folder includes: - folders women and men - folders with images of people with the corresponding gender, - .csv file - contains information about the images and people in the dataset

File with the extension .csv

file: link to access the file,
gender: gender of a person in the photo (woman/man),
split: classification on train and test

TrainingData provides high-quality data annotation tailored to your needs

keywords: biometric system, biometric system attacks, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, gender detection, supervised learning dataset, gender classification dataset, gender recognition dataset

Clear search

Close search

Google apps

Main menu

Gender Detection & Classification - Face Dataset

Gender Detection & Classification - face recognition dataset

The dataset is created on the basis of Face Mask Detection dataset

💴 For Commercial Usage: Full version of the dataset includes 376 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

OTHER BIOMETRIC DATASETS:

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

Content

File with the extension .csv

TrainingData provides high-quality data annotation tailored to your needs

Hype and Diversity - PubMed dataset

Intersectional Lens on Leaders_Study_Rawdata.csv

Smartwatch Purchase Data

Data_Sheet_2_You’re Prettier When You Smile: Construction and Validation of...

Depression prediction dataset based on online medical consultation

Replication Data for 'Gender (im)balance in the Russian cinema: on the...

Data_Sheet_2_Action Sounds Informing Own Body Perception Influence Gender...

Regression analysis in Galaxy with car purchase price prediction dataset

Future medical event

Problem Statement

Data Description

Heart Attack Dataset

Diabetes Dataset

Data_Sheet_2_Construction and Verification of a Predictive Model for Risk...

DataSheet1_Putting external validation performance of major bleeding risk...

Janatahack cross-sell prediction

Context

Content

Acknowledgements

Inspiration

Data Sheet 1_Global burden and trends of self-harm from 1990 to 2021, with...

Measurement invariance test across gender.

AmExpert 2021 – Machine Learning Hackathon

Introduction:

Acknowledgements:

Problem Statement:

Data Description:

Train csv -

Test csv -

Evaluation Metric:

JantaHack: Cross sell Prediction

Context

Content

Acknowledgements

Hospital Management Dataset

Gender Detection & Classification - Face Dataset

Photos of people - face recognition dataset

Gender Detection & Classification - face recognition dataset

The dataset is created on the basis of Face Mask Detection dataset

💴 For Commercial Usage: Full version of the dataset includes 376 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

OTHER BIOMETRIC DATASETS:

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

Content

File with the extension .csv

TrainingData provides high-quality data annotation tailored to your needs