By Derek Howard [source]
This dataset provides an essential tool for generating gender-specific datasets from names alone. It contains information on the probability of a person's name belonging to a certain gender, based off of US Social Security records from the last century. This makes it easy to assign genders to datasets that do not natively include this data. All probability values were culled from records with 5 or more people associated with each name - so those individuals with less common monikers can still have their genders correctly predicted! With this resource, users can generate gender-aware data in no time, making gender identification in data sets more accurate and easier than ever
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a helpful resource when you need to accurately identify gender from names. With this dataset, you’ll be able to quickly and accurately assign genders to datasets that contain names but no other information about the person.
To get started, you will need a csv file with two columns: name and probability. The name column should contain the first names of the people in your dataset. The probability column should contain numbers between 0 and 1 indicating the likelihood that each name is associated with one specific gender (0 for male, 1 for female).
In addition to simply assigning genders from these probabilities alone, users of this dataset also have more control over their classifications - they can use it as either a baseline or as an absolute measure of accuracy depending on their exact needs/preferences. Experimentation is highly encouraged here!
Good luck!
Create gender-specific applications - tailor different apps to different genders based on the probability of a particular name belonging to a certain gender.
Generate gender neutral names - use this data to generate random names with no gender bias.
Automate record lookup - quickly and accurately assign genders based on the probability associated with their name
If you use this dataset in your research, please credit the original authors.
License
Unknown License - Please check the dataset description for more information.
File: name_gender.csv | Column name | Description | |:----------------|:--------------------------------------------------------------------| | name | The name of the person. (String) | | gender | The gender of the person. (String) | | probability | The probability of the gender being assigned to the person. (Float) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Derek Howard.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The annual list of first names of newborns is a simple and popular dataset. These data, from the register of civil status, shall contain the following essential data: sex of the newborn, first name of the newborn, number of occurrences of the first name for the corresponding year, year of survey. The dataset consists of the list of first names of children born in Nancy since 2016, in CSV format, with the number of occurrences of each given name, classified by year and sex. The first names declared below an occurrence of five are not published, with a view to protecting personal data. The standardisation of this dataset follows the recommendations of Opendata France following the work around the Common Socle des Data Locales. Definition of headers COLL_NOM: name of the municipality COLL_INSEE: Insee code of the municipality where the first names are registered in the civil status of the place of birth. Note that the place of birth may be different from the place of residence of the parents. CHILD_SEX: Gender corresponding to first name: M or F respectively for men or women CHILD_PRENOM: first name of new born(s) recorded as first name in the civil status documents of the corresponding year. NUMBER_OCCURENCES: occurrence of first name YEAR: year of birth Total births reported to the City of Nancy 2018 Total number of births: 5135 Total number of births of girls: 2692 Total number of births of boys: 2443 2017 Total number of births: 5483 Total number of births of girls: 2704 Total number of births of boys: 2779 2016 Total number of births: 5544 Total number of births of girls: 2692 Total number of births of boys: 2852
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset lists the full publication and biographical records for 3,979 researchers in 6 distinct fields working at top U.S. institutions.The file full_researcher_data.csv contains the list of all 3,979 researchers in csv format with the following fields:field - scientific disciplinetotal_publications - Total number of publications short_name - Researchers abbreviated name as: [last name], [first and middle name initals]. Used for searching in the authors field in the file full_publication_data.csv.full_name - Researcher full name as: [Last name], [First Name] [Middle initials]phd_year - year of PhD completiongender - researcher gender as: M (male) or F (female)author_id - Index of researcher. Matches the author_id field in the file full_publication_data.csvuniversity - Institution of current employmentGender and PhD year are not available for all researchers.Current employments are accurate as of June, 2010. total_pubs field show total number of publications published by the end of 2010.The file full_publication_data.csv contains the list of all 417,609 publications in csv format with the following fields:title - publication titlejournal - publication journalauthors - publication authors as a comma-separated list. Author syntax matches the short_name field in the file full_researcher_data.csv.year - publication year author_id - Index of researcher. Matches the author_id field in the file full_researcher_data.csvabstract - publication abstract
The Gender Statistics database is a comprehensive source for the latest sex-disaggregated data and gender statistics covering demography, education, health, access to economic opportunities, public life and decision-making, and agency.
The data is split into several files, with the main one being Data.csv. The Data.csv contains all the variables of interest in this dataset, while the others are lists of references and general nation-by-nation information.
Data.csv contains the following fields:
I couldn't find any metadata for these, and I'm not qualified to guess at what each of the variables mean. I'll list the variables for each file, and if anyone has any suggestions (or, even better, actual knowledge/citations) as to what they mean, please leave a note in the comments and I'll add your info to the data description.
Country-Series.csv
Country.csv
FootNote.csv
Series-Time.csv
Series.csv
This dataset was downloaded from The World Bank's Open Data project. The summary of the Terms of Use of this data is as follows:
You are free to copy, distribute, adapt, display or include the data in other products for commercial and noncommercial purposes at no cost subject to certain limitations summarized below.
You must include attribution for the data you use in the manner indicated in the metadata included with the data.
You must not claim or imply that The World Bank endorses your use of the data by or use The World Bank’s logo(s) or trademark(s) in conjunction with such use.
Other parties may have ownership interests in some of the materials contained on The World Bank Web site. For example, we maintain a list of some specific data within the Datasets that you may not redistribute or reuse without first contacting the original content provider, as well as information regarding how to contact the original content provider. Before incorporating any data in other products, please check the list: Terms of use: Restricted Data.
-- [ed. note: this last is not applicable to the Gender Statistics database]
The World Bank makes no warranties with respect to the data and you agree The World Bank shall not be liable to you in connection with your use of the data.
This is only a summary of the Terms of Use for Datasets Listed in The World Bank Data Catalogue. Please read the actual agreement that controls your use of the Datasets, which is available here: Terms of use for datasets. Also see World Bank Terms and Conditions.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data contains popular baby names in New York .
Dataset :- 1 file (popular-baby-names.csv)
Columns - Year of Birth : Year of the baby's birth. - Gender : Gender of the baby. - Ethnicity : Types of ethnicity they belong to. - Child's First Name : The first name of the child. - Count : How many babies were named . - Ranking : Ranking of that name.
El següent dataset conté dos carpetes amb dades diferents, les quals inclouen: El conjunt de dades de la carpeta amb nom "Gender" proporciona la distribució per gènere de les persones que han estat destacades a les portades de les versions anglesa i espanyola de Wikipedia, durant el període 2013-2023. Pel que fa a l'edició en castellà, les dades s'han recollit de les seccions "Artículos buenos" i "Artículos destacados" i es mostren en forma agregada. El conjunt de dades de la carpeta amb nom "Intersectionality" proporciona la distribució per diferents atributs sociodemogràfics de les persones que han estat destacades a les portades de les versions en anglès i en espanyol de Wikipedia, en el període del 2013 al 2023. Està estructurat en quatre CSV. Tres d'aquests CSV corresponen a l'edició de Wikipedia en anglès: el CSV English 3C que conté les dades de les seccions "Did you know...", "In the news" i "On this day..."; un CSV dedicat a "English Featured Article", i un altre a "English Featured Picture". El quart CSV conté les dades de l'edició en castellà de la Wikipedia, extretes de les seccions "Artículo Destacado" i "Artículo Bueno". A cada CSV, les dades es presenten en columnes, cadascuna dedicada a un atribut sociodemogràfic. The following dataset contains two folders with different data, which include: The data set of the folder with name "Gender" provides the gender distribution of individuals featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. For the Spanish edition, data has been collected from the "Artículos buenos" and "Artículos destacados" sections and is displayed in an aggregated format. The data set of the folder with name "Intersectionality" provides the distribution based on various sociodemographic attributes of individuals who have been featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. It is structured into four CSV. Three of these CSV correspond to the English Wikipedia edition: the English 3C CSV containing data from the sections "Did you know...", "In the news," and "On this day..."; a CSV dedicated to "English Featured Article," and another to "English Featured Picture." The fourth CSV contains data from the Spanish edition of Wikipedia, extracted from the sections "Artículo Destacado" and "Artículo Bueno." Within each CSV, the data is presented in columns, each dedicated to a sociodemographic attribute.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Worldwide Gender Differences in Public Code Contributions - Replication Package
This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011
This document comes with the software needed to mine and analyze the data presented in the paper.
Prerequisites
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0
Initial data
swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.
names.tab - forenames and surnames per country with their frequency
zones.acc.tab - countries/territories, timezones, population and world zones
c_c.tab - ccTDL entities - world zones matches
Data preparation
Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst sh> ./export.sh
Run the authors cleanup script to create authors--clean.csv.zst sh> ./cleanup.sh authors.csv.zst
Filter out implausible names and create authors--plausible.csv.zst sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
Gender detection
Run the gender guessing script to create author-fullnames-gender.csv.zst sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst
Database creation and data ingestion
Create the PostgreSQL DB sh> createdb gender-commit Notice that from now on when prepending the psql> prompt we assume the execution of psql on the gender-commit database.
Import data into PostgreSQL DB sh> ./import_data.sh
Zone detection
Extract commits data from the DB and create commits.tab, that is used as input for the gender detection script sh> psql -f extract_commits.sql gender-commit
Run the world zone detection script to create commit_zones.tab.zst sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.
Read zones assignment data from the file into the DB psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
Extraction and graphs
Run the script to execute the queries to extract the data to plot from the DB. This creates commits_tz.tab, authors_tz.tab, commits_zones.tab, authors_zones.tab, and authors_zones_1620.tab. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh
Run the script to create the graphs from all the previously extracted tabfiles. This will generate commits_tzs.pdf, authors_tzs.pdf, commits_zones.pdf, authors_zones.pdf, and authors_zones_1620.pdf. sh> ./create_charts.sh
Additional graphs
This package also includes some already-made graphs
authors_zones_1.pdf: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per period
authors_zones_2.pdf: ditto with at least two commits per period
authors_zones_10.pdf: ditto with at least ten commits per period
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.
Dataset Overview
This dataset includes five CSV files:
patients.csv – Patient demographics, contact details, registration info, and insurance data
doctors.csv – Doctor profiles with specializations, experience, and contact information
appointments.csv – Appointment dates, times, visit reasons, and statuses
treatments.csv – Treatment types, descriptions, dates, and associated costs
billing.csv – Billing amounts, payment methods, and status linked to treatments
📁 Files & Column Descriptions
** patients.csv**
Contains patient demographic and registration details.
Column Description
patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address
** doctors.csv**
Details about the doctors working in the hospital.
Column Description
doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address
appointments.csv
Records of scheduled and completed patient appointments.
Column Description
appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)
treatments.csv
Information about the treatments given during appointments.
Column Description
treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given
** billing.csv**
Billing and payment details for treatments.
Column Description
bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)
Possible Use Cases
SQL queries and relational database design
Exploratory data analysis (EDA) and dashboarding
Machine learning projects (e.g., cost prediction, no-show analysis)
Feature engineering and data cleaning practice
End-to-end healthcare analytics workflows
Recommended Tools & Resources
SQL (joins, filters, window functions)
Pandas and Matplotlib/Seaborn for EDA
Scikit-learn for ML models
Pandas Profiling for automated EDA
Plotly for interactive visualizations
Please Note that :
All data is synthetically generated for educational and project use. No real patient information is included.
If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.
Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status. Each sentence pair is a minimal edit of the first sentence: The only words that change between them are those that identify the group. The first sentence can demonstrate or violate a stereotype. The other sentence is a minimal edit of the first sentence: The only words that change between them are those that identify the group. Each example has the following information:
Columns:,**sent_more**,sent_less,**stereo_antistereo**,bias_type,**annotations**,,anon_writer,,anon_annotators,,prompt,,source
The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status. Each sentence pair is a minimal edit of the first sentence: The only words that change between them are those that identify the group. The first sentence can demonstrate or violate a stereotype. The other sentence is a minimal edit of the first sentence: The only words that change between them are those that identify the group. Each example has the following information:
Columns:,**sent_less**sent_more,,stereo_antistereo,,bias_type,,annotations,,anon_writer,,anon_annotators,,,,prompt,,source
This dataset can be used to measure social biases in MLMs by training models on it and evaluating their performance
- Measuring the ability of MLMs to identify and avoid social biases;
- Developing new methods for reducing social biases in MLMs; and
- Investigating the impact of social biases on downstream tasks such as reading comprehension or question answering
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: crows_pairs_anonymized.csv | Column name | Description | |:----------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------| | sent_more | The first sentence in the pair, which can demonstrate or violate a stereotype. (String) | | sent_less | The second sentence in the pair, which is a minimal edit of the first sentence. The only words that change between them are those that identify the group. (String) | | stereo_antistereo | Whether the first sentence demonstrates or violates a stereotype. (String) | | bias_type | The type of bias represented in the sentence pair. (String) | | annotations | The annotations made by the crowdworkers on the sentence pair. (String) | | anon_writer | The anonymous writer of the sentence pair. (String) | | anon_annotators | The anonymous annotators of the sentence pair. (String) |
File: prompts.csv | Column name | Descripti...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the author gender dataset (as a comma-delimited .csv file) originally created in association with the paper entitled 'The Impact of Gender on Conference Authorship in Audio Engineering: Analysis Using a New Data Collection Method', but since extended to include conferences up to the end of 2019. The original dataset is available at: https://doi.org/10.5281/zenodo.1249693. Please cite both the paper and the relevant dataset if used. Visualisation is available at: http://tibbakoi.github.io/aesgender.
---
The dataset was produced using a novel method which used self-identified pronouns, therefore allowing for as many groups as necessary to describe the population.
---
The columns in the dataset are as follows:
NB: Some grouping of the data is required as online conference proceedings are not always consistent (Column 10). Some labelling of the data is required to determine which entries to include in certain types of analysis (Columns 11-13).
---
This dataset is distributed in the hopes that it will prove useful under the Creative Commons Attribution 4.0, with no warranty; or the implied warranty of merchantability or fitness for a particular problem.
---
Dataset curated by: Kat Young and Michael Lovedee-Turner, formerly at the AudioLab, Dept. of Electronic Engineering, University of York.
Contact: kathryn.ae.young@gmail.com
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Airline data holds immense importance as it offers insights into the functioning and efficiency of the aviation industry. It provides valuable information about flight routes, schedules, passenger demographics, and preferences, which airlines can leverage to optimize their operations and enhance customer experiences. By analyzing data on delays, cancellations, and on-time performance, airlines can identify trends and implement strategies to improve punctuality and mitigate disruptions. Moreover, regulatory bodies and policymakers rely on this data to ensure safety standards, enforce regulations, and make informed decisions regarding aviation policies. Researchers and analysts use airline data to study market trends, assess environmental impacts, and develop strategies for sustainable growth within the industry. In essence, airline data serves as a foundation for informed decision-making, operational efficiency, and the overall advancement of the aviation sector.
This dataset comprises diverse parameters relating to airline operations on a global scale. The dataset prominently incorporates fields such as Passenger ID, First Name, Last Name, Gender, Age, Nationality, Airport Name, Airport Country Code, Country Name, Airport Continent, Continents, Departure Date, Arrival Airport, Pilot Name, and Flight Status. These columns collectively provide comprehensive insights into passenger demographics, travel details, flight routes, crew information, and flight statuses. Researchers and industry experts can leverage this dataset to analyze trends in passenger behavior, optimize travel experiences, evaluate pilot performance, and enhance overall flight operations.
https://i.imgur.com/cUFuMeU.png" alt="">
The dataset provided here is a simulated example and was generated using the online platform found at Mockaroo. This web-based tool offers a service that enables the creation of customizable Synthetic datasets that closely resemble real data. It is primarily intended for use by developers, testers, and data experts who require sample data for a range of uses, including testing databases, filling applications with demonstration data, and crafting lifelike illustrations for presentations and tutorials. To explore further details, you can visit their website.
Cover Photo by: Kevin Woblick on Unsplash
Thumbnail by: Airplane icons created by Freepik - Flaticon
This dataset contains records of research articles extracted from the Web of Science (WoS) from 1980 to 2019---in total, 15,642 journals, 28,241,100 articles and 111,980,858 authorships across 153 research areas.
The main dataset (author_address_article_gend_v3.parquet), in Parquet format, contains all the authorships, where an authorship is defined as the tuple article-author. There are 12 variables per authorship (row):
ut: unique article identifier.
daisng_id: unique author identifier.
author_no: author number, as listed in the article.
country: author country (two-letter ISO code).
date: publication date.
gender: gender of the author ("male" or "female"), as provided by the Genderize.io API.
probability: probability of the gender attribute, as provided by the Genderize.io API.
count: number of entries for the author first name, as provided by the Genderize.io API.
jsc: journal subject category.
field: field of research.
research_area: area of research.
n_aut: number of authors in this publication.
journal: journal name.
alphabetical: whether the author list for this article is in alphabetical order.
With the previous dataset, a resampler was applied to generate null homophily values for each year. There are 4 datasets in R Data Serialization (RDS) format:
null_field.rds: null homophily values per country, year and field of research.
null_field_comp.rds: null homophily values per year and field of research (only for complete authorships).
null_research.rds: null homophily values per year and area of research.
null_research_comp.rds: null homophily values per year and area of research (only for complete authorships).
All these datasets have the same structure:
country: country (two-letter ISO code).
year: year.
variable: either field or research area name.
m: average homophily.
s: homophily std. error.
Finally, some supplementary files used in the descriptive analysis and methods:
File null_research_l2019.rds is an example of the output from the resampling algorithm for year 2019.
File wos_category_to_field.csv is a mapping from WoS categories to more general fields.
File jcr_if_2020.csv contains the percentiles of the journal impact factor for the JCR 2020.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains data and analysis code associated with the manuscript: L.P. Schwartz, J. Liénard, S. V. David. (2022) "Impact of gender on formation and outcome of formal mentoring relationships in the life sciences." Figures and tables in the manuscript can be produced by running the make_figures.ipynb notebook. Figures have been marked with headings indicating their position in the manuscript (Figure 1, Figure S1, etc.). In addition, the notebook contains code to reproduce regression analyses that are cited in the text but not directly associated with a figure.
Data on mentoring relationships derives from Academic Family Tree (AFT, www.academictree.org) and public data sources on funding, publications, and awards. Inclusion criteria, public data sources, and procedures for linking across sources are described in the manuscript. Personal identifiers for researchers have been anonymized, but remain consistent across all data in the repository. In other words, the personal identifier "1" refers to the same person in all dataframes in the repository. But, that person is *not* the same researcher identified as "1" on the public AFT website.
Installation
Requires Python 3.x. and Pandas. To load required libraries using Anaconda, run:
`conda create --name aft -c conda-forge pandas numpy scipy ipython jupyterlab scipy scikit-learn pandas matplotlib numpy statsmodels seaborn pytables`
Dataframes
Data is stored as a series of Pandas dataframes within HDF5 or CSV files:
* cng_tc: The primary dataset used in the analysis. The name is an acronym for "connections" (i.e. training relationships, "cn"), "gender" ("g"), and "trainee count" ("tc"). Each row contains data on the mentor and trainee in one training relationship. See manuscript for inclusion criteria.
* mentors: Data on mentors. Each row contains data on one mentor. See manunscript for inclusion criteria.
* mentors_grants, mentors_hindex, mentors_locs_ranked: Subset of mentors with data available for funding (mentors_grants), citation (mentors_hindex), and institution rank (mentors_locs_ranked).
* mentors_nobel, mentors_hhmi, mentors_nas: Subsets of mentors that received a Nobel (mentors_nobel), Howard Hughes Medical Institute grants (mentors_hhmi), or membership in the National Academy of Sciences (mentors_nas). See manuscript for details of data sources and linking procedures.
* cn, cng, first_names, gn, gn_all, locs: Partial data (connections only, inferred gender only, connections and gender only, location only, first names and inferred gender only) for more inclusive sets of researchers in AFT. They are generally not used used for analysis, but have been included here to calculate statistics on the total amount of data included and to screen for data from U.S. locations.
* nsf_gender_phds, nsf_gender_pds: National Science Foundation survey data on gender and fraction PhDs conferred per year (nsf_gender_phds) or fraction postdocs employed per year (nsf_gender_pds). See manuscript for details of data source.
* photo: Data for validation of gender inference method.
Dataframe columns
* amount: Mentor's total funding
* amount_adj: Mentor's total funding (adjusted to 2020 dollars)
* broad_field: Mentor's general research area (e.g., life sciences, engineering, based on National Science Foundation classifications)
* continue: Whether trainee went on to become a mentor (i.e., has trainees listed in AFT)
* country: Country in which mentor's current institution is located
* firstname: First name of researcher (table of first names is not aligned with tables containing anonymized personal identifiers)
* first_grant_year: Year of mentor's first grant
* funding_rate: Mentor's annual funding rate (since first grant)
* funding_rate_adj: Mentor's annual funding rate (since first grant) adjusted to 2020 dollars
* hhmi: Whether mentor was granted HHMI funding
* hindex: Mentor's hindex
* location: Name of mentor's current institution
* locid: Identifier for mentor's institution
* locid_rank: Postion of mentor's institution in 2015 Quacquarelli-Symonds rankings (lower numbers are better)
* locid_rank_rev: Reversed version of "locid_rank" (i.e., higher numbers are better)
* majorarea: Mentor's specific research area (e.g, neuroscience)
* male_mentor, male trainee: Whether the probability that a researcher's first name is used by a person identifying as a man meets threshold (see manuscript for details on gender inference using first names)
* match_score: Score for string match between institution or name of awardee and researcher
* mentor_career_start: The date at which the mentor's academic career began
* mentor_continue_rate: Fraction of mentor's trainees that become mentors
* mentor_continue_rate_ft: Fraction of mentor's woman trainees that become mentors
* mentor_continue_rate_mt: Fraction of mentor's man trainees that become mentors
* mentor_t_p_male0: Fraction of mentor's trainees that are men
* mentor_t_p_male0_gs: Fraction of mentor's trainees that are men (graduate students only)
* mentor_t_p_male0_pd: Fraction of mentor's trainees that are men (postdocs only)
* mentor_tcount0: Mentor's total number of trainees
* nas: Whether mentor is a member of the National Academy of Sciences
* nobel: Whether mentor is a Nobel laureate
* p_male_mentor, p_male_trainee: Probability that a researcher's first name is used by a person identifying as a man
* pid: Anonymized identifier of researcher
* pid_mentor: Anonymized identifier of mentor in training relationship
* pid_trainee: Anonymized identifier of trainee in training relationship
* pq: "1" if data on training relationship is drawn from ProQuest database and has not been manually edited a human AFT user
* relation: Type of training relationship (1: graduate student, 2: postdoc)
* scorer1, scorer2, scorer3: Results of photo validation of gender inference for each scorer
* start: Training start year
* stop: Training end year
* trainee_tcount: Total people that the trainee has trained
* triad: Whether trainee has participated in both a graduate-level and postdoctoral training relationship
The cn dataframe follows slightly different naming conventions, but is not generally used in the analysis (pid1 = pid_trainee, pid2 = pid_mentor, startdate = start, stopdate = stop).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.
- Each entry includes demographic, professional, and platform-related information such as:
- Name, gender, age, and country
- Primary skill and years of experience
- Hourly rate (with mixed formatting), client rating, and satisfaction score
- Language spoken (based on country)
- Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)
- Gender-based names using Faker’s male/female name generators
- Realistic age and experience distribution (with missing and noisy values)
- Country-language pairs mapped using actual linguistic data
- Messy formatting: mixed data types, missing values, inconsistent casing
- Generated entirely in Python using the faker library no real data used
- Practicing data cleaning and preprocessing
- Performing EDA (Exploratory Data Analysis)
- Developing data pipelines: raw → clean → model-ready
- Teaching feature engineering and handling real-world dirty data
- Exercises in data validation, outlier detection, and format standardization
global_freelancers_raw.csv
| Column Name | Description |
| --------------------- | ------------------------------------------------------------------------ |
| `freelancer_ID` | Unique ID starting with `FL` (e.g., FL250001) |
| `name` | Full name of freelancer (based on gender) |
| `gender` | Gender (messy values and case inconsistency) |
| `age` | Age of the freelancer (20–60, with occasional nulls/outliers) |
| `country` | Country name (with random formatting/casing) |
| `language` | Language spoken (mapped from country) |
| `primary_skill` | Key freelance domain (e.g., Web Dev, AI, Cybersecurity) |
| `years_of_experience` | Work experience in years (some missing values or odd values included) |
| `hourly_rate (USD)` | Hourly rate with currency symbols or missing data |
| `rating` | Rating between 1.0–5.0 (some zeros and nulls included) |
| `is_active` | Active status (inconsistently represented as strings, numbers, booleans) |
| `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs) |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset name: asppl_dataset_v2.csv
Version: 2.0
Dataset period: 06/07/2018 - 01/14/2022
Dataset Characteristics: Multivalued
Number of Instances: 8118
Number of Attributes: 9
Missing Values: Yes
Area(s): Health and education
Sources:
Virtual Learning Environment of the Brazilian Health System (AVASUS) (Brasil, 2022a);
Brazilian Occupational Classification (CBO) (Brasil, 2022b);
National Registry of Health Establishments (CNES) (Brasil, 2022c);
Brazilian Institute of Geography and Statistics (IBGE) (Brasil, 2022e).
Description: The data contained in the asppl_dataset_v2.csv dataset (see Table 1) originates from participants of the technology-based educational course “Health Care for People Deprived of Freedom.” The course is available on the AVASUS (Brasil, 2022a). This dataset provides elementary data for analyzing the course’s impact and reach and the profile of its participants. In addition, it brings an update of the data presented in work by Valentim et al. (2021).
Table 1: Description of AVASUS dataset features.
Attributes |
Description |
datatype |
Value |
gender |
Gender of the course participant. |
Categorical. |
Feminino / Masculino / Não Informado. (In English, Female, Male or Uninformed) |
course_progress |
Percentage of completion of the course. |
Numerical. |
Range from 0 to 100. |
course_evaluation |
A score given to the course by the participant. |
Numerical. |
0, 1, 2, 3, 4, 5 or NaN. |
evaluation_commentary |
Comment made by the participant about the course. |
Categorical. |
Free text or NaN. |
region |
Brazilian region in which the participant resides. |
Categorical. |
Brazilian region according to IBGE: Norte, Nordeste, Centro-Oeste, Sudeste or Sul (In English North, Northeast, Midwest, Southeast or South). |
CNES |
The CNES code refers to the health establishment where the participant works. |
Numerical. |
CNES Code or NaN. |
health_care_level |
Identification of the health care network level for which the course participant works. |
Categorical. |
“ATENCAO PRIMARIA”, “MEDIA COMPLEXIDADE”, “ALTA COMPLEXIDADE”, and their possible combinations. |
year_enrollment |
Year in which the course participant registered. |
Numerical. |
Year (YYYY). |
CBO |
Participant occupation. |
Categorical. |
Text coded according to the Brazilian Classification of Occupations or “Indivíduo sem afiliação formal.” (In English “Individual without formal affiliation.”) |
Dataset name: prison_syphilis_and_population_brazil.csv
Dataset period: 2017 - 2020
Dataset Characteristics: Multivalued
Number of Instances: 6
Number of Attributes: 13
Missing Values: No
Source:
National Penitentiary Department (DEPEN) (Brasil, 2022d);
Description: The data contained in the prison_syphilis_and_population_brazil.csv dataset (see Table 2) originate from the National Penitentiary Department Information System (SISDEPEN) (Brasil, 2022d). This dataset provides data on the population and prevalence of syphilis in the Brazilian prison system. In addition, it brings a rate that represents the normalized data for purposes of comparison between the populations of each region and Brazil.
Table 2: Description of DEPEN dataset Features.
Attributes |
Description |
datatype |
Value |
Region |
Brazilian region in which the participant resides. In addition, the sum of the regions, which refers to Brazil. |
Categorical. |
Brazil and Brazilian region according to IBGE: North, Northeast, Midwest, Southeast or South. |
syphilis_2017 |
Number of syphilis cases in the prison system in 2017. |
Numerical. |
Number of syphilis cases. |
syphilis_rate_2017 |
Normalized rate of syphilis cases in 2017. |
Numerical. |
Syphilis case rate. |
syphilis_2018 |
Number of syphilis cases in the prison system in 2018. |
Numerical. |
Number of syphilis cases. |
syphilis_rate_2018 |
Normalized rate of syphilis cases in 2018. |
Numerical. |
Syphilis case rate. |
syphilis_2019 |
Number of syphilis cases in the prison system in 2019. |
Numerical. |
Number of syphilis cases. |
syphilis_rate_2019 |
Normalized rate of syphilis cases in 2019. |
Numerical. |
Syphilis case rate. |
syphilis_2020 |
Number of syphilis cases in the prison system in 2020. |
Numerical. |
Number of syphilis cases. |
syphilis_rate_2020 |
Normalized rate of syphilis cases in 2020. |
Numerical. |
Syphilis case rate. |
pop_2017 |
Prison population in 2017. |
Numerical. |
Population number. |
pop_2018 |
Prison population in 2018. |
Numerical. |
Population number. |
pop_2019 |
Prison population in 2019. |
Numerical. |
Population number. |
pop_2020 |
Prison population in 2020. |
Numerical. |
Population number. |
Dataset name: students_cumulative_sum.csv
Dataset period: 2018 - 2020
Dataset Characteristics: Multivalued
Number of Instances: 6
Number of Attributes: 7
Missing Values: No
Source:
Virtual Learning Environment of the Brazilian Health System (AVASUS) (Brasil, 2022a);
Brazilian Institute of Geography and Statistics (IBGE) (Brasil, 2022e).
Description: The data contained in the students_cumulative_sum.csv dataset (see Table 3) originate mainly from AVASUS (Brasil, 2022a). This dataset provides data on the number of students by region and year. In addition, it brings a rate that represents the normalized data for purposes of comparison between the populations of each region and Brazil. We used population data estimated by the IBGE (Brasil, 2022e) to calculate the rate.
Table 3: Description of Students dataset Features.
By Priyanka Dobhal [source]
This dataset collects information on the Academy Award for Best Director winners from 1930 to 2019, and provides insight into the gender and racial disparity of filmmaking over time. It includes the winner's name, their respective award year, race, gender, nominated/winning film title, and the filmmakers' names. By looking at this data it is possible to identify emerging trends in cinema- such as who is dominating in terms of awards recognition- and consider how much progress has been made when it comes to equal opportunity within Hollywood. Examining Oscar winning directors over time can tell us a lot about its impact on systemic issues in our society as diversity increases among winners. To deepen our understanding of this award’s significance it is necessary to consider all factors included; from awarded directors’ gender to what kind of films are being supported by these awards annually. So come explore with us! Let's take an analysis deep dive into almost nine decades worth of cinematic history - starting from 1930 - and see who won big at the Academy Awards…
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is the perfect resource for anyone looking to conduct an analysis of the Academy Award for Best Director winners from 1930 to 2019. It contains information on the year, gender, race, director(s), film and nomination/winner of each winner.
By using this dataset, you can gain insight into trends in Oscar winning directors over time. For example, you can compare the number of nominees between different years or examine differences in representation of gender and race among directors who have won Oscars over time. Additionally, you can use this data to explore the films that have received an Oscar for best director – which films were most successful from a narrative perspective? Or analyze which films used unique filming techniques or visual designs? Finally, this dataset also makes it possible to conduct more targeted analyses by identifying patterns across multiple aspects such as furthering social issues that are depicted in film through positive filmmaking - such as LGBTQ representation.
To start exploring with this dataset:
2) Open your favorite spreadsheet program ('Microsoft Excel', 'Libre Office', etc.)
3) Load csv file with' File —> Open' command
4) Review column headers and values contained within each row
5) Start creating charts and graphs (pie charts barplots etc.) that show trends over time according to your needs
6) Take notes while analyzing datasets
7) Publish your findings online if desired
The possibilities are endless! If you’d like additional guidance or tips on how to effectively use this data set please subscribe our newsletter at oscarwinningdirectorsanalysisgmail.com
- Analyzing gender and racial disparity in the Academy award for Best Director across different years.
- Investigating if the age of directors has an effect on what film they create and how successful it is at winning an Oscar for Best Director.
- Crafting a recommendation system to recommend movies based on a director's previous Oscar-winning work or even pair users with film recommendations that have similar director/genre preference in order to discover new titles they may enjoy watching
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: Oscar Winners - Director.csv | Column name | Description | |:----------------------|:--------------------------------------------------------------| | Year | The year in which the award was given. (Integer) | | Gender | The gender of the director. (String) | | Race | The race of the director. (String) | | Director(s) | The name of the director(s). (String) | | Film | The title of the film that won the award. (String) | | Nomination/Winner | Whether the director was nominated or won the award. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Priyanka Dobhal.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Companion dataset to "The possible role of resource requirements and academic career-choice risk on gender differences in publication rate and impact" by Duch J, Zeng XHT, Sales-Pardo M, Radicchi F, Otis S, Woodruff TK, Amaral LAN (PLoS ONE 7, e51332, 2012) doi: 10.1371/journal.pone.0051332 This dataset lists the total number of publications by 4,394 faculty members from 7 distinct research fields working at top U.S. institutions. The dataset also contains bibliographic information manualy gathered from the CVs of those faculty members. The publications data was collected from Thomson Reuters' Web of Science according to the procedures described in the published paper.
The data is a single csv file with the following fields: author_name - researcher name as: Last name, Initialsgender - researcher gender as: M (male) or F (female)univ_name - Institution of current employmentfield - scientific disciplinephd_year - year of phd completionnationality - Country of originbackground - List of degreesaffiliations - List of honours and past appointmentstotal_pubs - Total number of publications Some fields are not available for some researchers. Current employments are accurate as of June, 2010.total_pubs field show total number of publications published by the end of 2010.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is a new version of the previous upload and includes the following files: 1. dataset_versions/tell_all.csv: The initial dataset of 1,280,927 extracted speeches, before preprocessing and cleaning. The speeches extend chronologically from July 1989 up to July 2020 and were exported from 5,355 parliamentary sitting record files. The file has a total volume of 2.5 GB and includes the following columns: member_name: the name of the individual who spoke during a sitting. sitting_date: the date the sitting took place. parliamentary_period: the name and/or number of the parliamentary period that the speech took place in. A parliamentary period is defined as the time span between one general election and the next. A parliamentary period includes multiple parliamentary sessions. parliamentary_session: the name and/or number of the parliamentary session that the speech took place in. A session is defined as a time span of usually 10 months within a parliamentary period during which the parliament can convene and function as stipulated by the constitution. A session can fall into the following categories: regular, extraordinary or special. In the intervals between the sessions the parliament is in recess. A parliamentary session includes multiple parliamentary sittings. parliamentary_sitting: the name and/or number of the parliamentary sitting that the speech took place in. A sitting is defined as a meeting of parliament members. political_party: the political party of the speaker. government: the government in force when the speech took place. member_region: the electoral district the speaker belonged to. roles: information about the parliamentary roles and/or government position of the speaker. member_gender: the gender of the speaker speech: the speech that the individual gave during the parliamentary sitting. 2. dataset_versions/tell_all_FILLED.csv: This file is an intermediate version of the dataset that includes improvements in the consistency and completeness of the dataset, with a total volume of 2.5 GB. Specifically, this file is produced by filling the missing names of chairmen of various parliamentary sittings of the "tell_all.csv". It includes the same columns as the "tell_all.csv" file. 3. dataset_versions/tell_all_cleaned.csv: This version of the dataset is the result of further cleaning and preprocessing and is used for our word usage change study. It consists of 1,280,918 speech fragments of Greek parliament members in the order of the conversation that took place, with a total volume of 2.12 GB. It includes the same columns as the aforementioned versions. The preprocessing includes the replacement of all references to political parties with the symbol "@" followed by an abbreviation of the party name, using regular expressions that capture different grammatical cases and variations. It also includes the removal of accents, strings with length less than 2 characters, all punctuation except full stops, and the replacement of stopwords with "@sw". 4. wiki_data: A folder of modern Greek female and male names and surnames and their available grammatical cases crawled from the entries of the Wiktionary Greek names category (https://en.wiktionary.org/wiki/Category:Greek_names). We produced the grammatical cases of the missing grammatical entries according to the rules of the Greek grammar and saved the files in the same folder by adding to their filenames the string "_populated.json". 5. parl_members_activity_1989onwards_with_gender.csv: The Greek Parliament website provides a list of all the elected members of parliament since the fall of the military junta in Greece, in 1974. We collected and cleaned the data, added the gender and kept the elected members from 1989 onwards, matching the available parliament proceeding records. This dataset includes the full names of the members, the date range of their service, the political party they served, the electoral district they belonged to and their gender. 6. formatted_roles_gov_members_data.csv: As government members we refer to individuals in ministerial or other government posts, regardless of whether they were elected in the parliament. This information is available in the website of the Secretariat General for Legal and Parliamentary Affairs. The government members dataset includes the full names of the official individuals, the name of the role they were given, the date range of their service at each specific role and their gender. 7. governments_1989onwards.csv: A dataset of government information including the names of governments since 1989, their start and end dates, and a URL that points to the respective official government web page of each past government. The data is crawled from the website of the Secretariat General for Legal and Parliamentary Affairs. 8. extra_roles_manually_collected.csv: A dataset with manually collected information from Wikipedia about additional government or parliament posts such as Chairman of the Parliament,
By Derek Howard [source]
This dataset provides an essential tool for generating gender-specific datasets from names alone. It contains information on the probability of a person's name belonging to a certain gender, based off of US Social Security records from the last century. This makes it easy to assign genders to datasets that do not natively include this data. All probability values were culled from records with 5 or more people associated with each name - so those individuals with less common monikers can still have their genders correctly predicted! With this resource, users can generate gender-aware data in no time, making gender identification in data sets more accurate and easier than ever
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a helpful resource when you need to accurately identify gender from names. With this dataset, you’ll be able to quickly and accurately assign genders to datasets that contain names but no other information about the person.
To get started, you will need a csv file with two columns: name and probability. The name column should contain the first names of the people in your dataset. The probability column should contain numbers between 0 and 1 indicating the likelihood that each name is associated with one specific gender (0 for male, 1 for female).
In addition to simply assigning genders from these probabilities alone, users of this dataset also have more control over their classifications - they can use it as either a baseline or as an absolute measure of accuracy depending on their exact needs/preferences. Experimentation is highly encouraged here!
Good luck!
Create gender-specific applications - tailor different apps to different genders based on the probability of a particular name belonging to a certain gender.
Generate gender neutral names - use this data to generate random names with no gender bias.
Automate record lookup - quickly and accurately assign genders based on the probability associated with their name
If you use this dataset in your research, please credit the original authors.
License
Unknown License - Please check the dataset description for more information.
File: name_gender.csv | Column name | Description | |:----------------|:--------------------------------------------------------------------| | name | The name of the person. (String) | | gender | The gender of the person. (String) | | probability | The probability of the gender being assigned to the person. (Float) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Derek Howard.