33 datasets found

Gender by Name (Time-series)
kaggle.com
Updated Dec 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Gender by Name (Time-series) [Dataset]. https://www.kaggle.com/datasets/thedevastator/automated-gender-identification-using-name-proba/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 5, 2022
Dataset provided by
Kaggle
Authors
The Devastator
Description
Automated Gender Identification Using Name Probabilities

2019 US Social Security Administration Data

By Derek Howard [source]

About this dataset

This dataset provides an essential tool for generating gender-specific datasets from names alone. It contains information on the probability of a person's name belonging to a certain gender, based off of US Social Security records from the last century. This makes it easy to assign genders to datasets that do not natively include this data. All probability values were culled from records with 5 or more people associated with each name - so those individuals with less common monikers can still have their genders correctly predicted! With this resource, users can generate gender-aware data in no time, making gender identification in data sets more accurate and easier than ever

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a helpful resource when you need to accurately identify gender from names. With this dataset, you’ll be able to quickly and accurately assign genders to datasets that contain names but no other information about the person.

To get started, you will need a csv file with two columns: name and probability. The name column should contain the first names of the people in your dataset. The probability column should contain numbers between 0 and 1 indicating the likelihood that each name is associated with one specific gender (0 for male, 1 for female).

In addition to simply assigning genders from these probabilities alone, users of this dataset also have more control over their classifications - they can use it as either a baseline or as an absolute measure of accuracy depending on their exact needs/preferences. Experimentation is highly encouraged here!
Good luck!

Research Ideas

Create gender-specific applications - tailor different apps to different genders based on the probability of a particular name belonging to a certain gender.

Generate gender neutral names - use this data to generate random names with no gender bias.

Automate record lookup - quickly and accurately assign genders based on the probability associated with their name

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

Unknown License - Please check the dataset description for more information.

Columns

File: name_gender.csv | Column name | Description | |:----------------|:--------------------------------------------------------------------| | name | The name of the person. (String) | | gender | The gender of the person. (String) | | probability | The probability of the gender being assigned to the person. (Float) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Derek Howard.
g
The annual list of first names of newborns — city of Nancy
gimi9.com
data.europa.eu
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). The annual list of first names of newborns — city of Nancy [Dataset]. https://gimi9.com/dataset/eu_5d2c2919634f41429aae86ce/
Explore at:
Dataset updated
Dec 16, 2023
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
The annual list of first names of newborns is a simple and popular dataset. These data, from the register of civil status, shall contain the following essential data: sex of the newborn, first name of the newborn, number of occurrences of the first name for the corresponding year, year of survey. The dataset consists of the list of first names of children born in Nancy since 2016, in CSV format, with the number of occurrences of each given name, classified by year and sex. The first names declared below an occurrence of five are not published, with a view to protecting personal data. The standardisation of this dataset follows the recommendations of Opendata France following the work around the Common Socle des Data Locales. Definition of headers COLL_NOM: name of the municipality COLL_INSEE: Insee code of the municipality where the first names are registered in the civil status of the place of birth. Note that the place of birth may be different from the place of residence of the parents. CHILD_SEX: Gender corresponding to first name: M or F respectively for men or women CHILD_PRENOM: first name of new born(s) recorded as first name in the civil status documents of the corresponding year. NUMBER_OCCURENCES: occurrence of first name YEAR: year of birth Total births reported to the City of Nancy 2018 Total number of births: 5135 Total number of births of girls: 2692 Total number of births of boys: 2443 2017 Total number of births: 5483 Total number of births of girls: 2704 Total number of births of boys: 2779 2016 Total number of births: 5544 Total number of births of girls: 2692 Total number of births of boys: 2852
Publication records for top US researchers in 6 fields
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amaral Lab (2023). Publication records for top US researchers in 6 fields [Dataset]. http://doi.org/10.6084/m9.figshare.3472616.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3472616.v2
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Amaral Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset lists the full publication and biographical records for 3,979 researchers in 6 distinct fields working at top U.S. institutions.The file full_researcher_data.csv contains the list of all 3,979 researchers in csv format with the following fields:field - scientific disciplinetotal_publications - Total number of publications short_name - Researchers abbreviated name as: [last name], [first and middle name initals]. Used for searching in the authors field in the file full_publication_data.csv.full_name - Researcher full name as: [Last name], [First Name] [Middle initials]phd_year - year of PhD completiongender - researcher gender as: M (male) or F (female)author_id - Index of researcher. Matches the author_id field in the file full_publication_data.csvuniversity - Institution of current employmentGender and PhD year are not available for all researchers.Current employments are accurate as of June, 2010. total_pubs field show total number of publications published by the end of 2010.The file full_publication_data.csv contains the list of all 417,609 publications in csv format with the following fields:title - publication titlejournal - publication journalauthors - publication authors as a comma-separated list. Author syntax matches the short_name field in the file full_researcher_data.csv.year - publication year author_id - Index of researcher. Matches the author_id field in the file full_researcher_data.csvabstract - publication abstract
World Gender Statistics
kaggle.com
zip
Updated Nov 28, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank (2016). World Gender Statistics [Dataset]. https://www.kaggle.com/datasets/theworldbank/world-gender-statistics/versions/1
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Nov 28, 2016
Dataset authored and provided by
World Bankhttp://topics.nytimes.com/top/reference/timestopics/organizations/w/world_bank/index.html
Area covered
World
Description
The Gender Statistics database is a comprehensive source for the latest sex-disaggregated data and gender statistics covering demography, education, health, access to economic opportunities, public life and decision-making, and agency.

The Data

The data is split into several files, with the main one being Data.csv. The Data.csv contains all the variables of interest in this dataset, while the others are lists of references and general nation-by-nation information.

Data.csv contains the following fields:

Data.csv

Country.Name: the name of the country

Country.Code: the country's code

Indicator.Name: the name of the variable that this row represents

Indicator.Code: a unique id for the variable

1960 - 2016: one column EACH for the value of the variable in each year it was available

The other files

I couldn't find any metadata for these, and I'm not qualified to guess at what each of the variables mean. I'll list the variables for each file, and if anyone has any suggestions (or, even better, actual knowledge/citations) as to what they mean, please leave a note in the comments and I'll add your info to the data description.

Country-Series.csv

CountryCode

SeriesCode

DESCRIPTION

Country.csv

Country.Code

Short.Name

Table.Name

Long.Name

2-alpha.code

Currency.Unit

Special.Notes

Region

Income.Group

WB-2.code

National.accounts.base.year

National.accounts.reference.year

SNA.price.valuation

Lending.category

Other.groups

System.of.National.Accounts

Alternative.conversion.factor

PPP.survey.year

Balance.of.Payments.Manual.in.use

External.debt.Reporting.status

System.of.trade

Government.Accounting.concept

IMF.data.dissemination.standard

Latest.population.census

Latest.household.survey

Source.of.most.recent.Income.and.expenditure.data

Vital.registration.complete

Latest.agricultural.census

Latest.industrial.data

Latest.trade.data

Latest.water.withdrawal.data

FootNote.csv

CountryCode

SeriesCode

Year

DESCRIPTION

Series-Time.csv

SeriesCode

Year

DESCRIPTION

Series.csv

Series.Code

Topic

Indicator.Name

Short.definition

Long.definition

Unit.of.measure

Periodicity

Base.Period

Other.notes

Aggregation.method

Limitations.and.exceptions

Notes.from.original.source

General.comments

Source

Statistical.concept.and.methodology

Development.relevance

Related.source.links

Other.web.links

Related.indicators

License.Type

Acknowledgements

This dataset was downloaded from The World Bank's Open Data project. The summary of the Terms of Use of this data is as follows:

You are free to copy, distribute, adapt, display or include the data in other products for commercial and noncommercial purposes at no cost subject to certain limitations summarized below.

You must include attribution for the data you use in the manner indicated in the metadata included with the data.

You must not claim or imply that The World Bank endorses your use of the data by or use The World Bank’s logo(s) or trademark(s) in conjunction with such use.

Other parties may have ownership interests in some of the materials contained on The World Bank Web site. For example, we maintain a list of some specific data within the Datasets that you may not redistribute or reuse without first contacting the original content provider, as well as information regarding how to contact the original content provider. Before incorporating any data in other products, please check the list: Terms of use: Restricted Data.

-- [ed. note: this last is not applicable to the Gender Statistics database]

The World Bank makes no warranties with respect to the data and you agree The World Bank shall not be liable to you in connection with your use of the data.

This is only a summary of the Terms of Use for Datasets Listed in The World Bank Data Catalogue. Please read the actual agreement that controls your use of the Datasets, which is available here: Terms of use for datasets. Also see World Bank Terms and Conditions.
Nyc popular baby names
kaggle.com
Updated Jun 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Sarkar (2022). Nyc popular baby names [Dataset]. https://www.kaggle.com/datasets/rahulsarkar221/nyc-popular-baby-names
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 20, 2022
Dataset provided by
Kaggle
Authors
Rahul Sarkar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
New York
Description
This data contains popular baby names in New York .

Dataset :- 1 file (popular-baby-names.csv)

Columns - Year of Birth : Year of the baby's birth. - Gender : Gender of the baby. - Ethnicity : Types of ethnicity they belong to. - Child's First Name : The first name of the child. - Count : How many babies were named . - Ranking : Ranking of that name.
e
Gender and Intersectional Disparities in Biographies on English and Spanish...
b2find.eudat.eu
Updated Jan 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Gender and Intersectional Disparities in Biographies on English and Spanish Wikipedia Front Pages (2013-2023) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/0f9014ea-070d-5f25-8503-e46a38cf3c40
Explore at:
Dataset updated
Jan 9, 2025
Description
El següent dataset conté dos carpetes amb dades diferents, les quals inclouen: El conjunt de dades de la carpeta amb nom "Gender" proporciona la distribució per gènere de les persones que han estat destacades a les portades de les versions anglesa i espanyola de Wikipedia, durant el període 2013-2023. Pel que fa a l'edició en castellà, les dades s'han recollit de les seccions "Artículos buenos" i "Artículos destacados" i es mostren en forma agregada. El conjunt de dades de la carpeta amb nom "Intersectionality" proporciona la distribució per diferents atributs sociodemogràfics de les persones que han estat destacades a les portades de les versions en anglès i en espanyol de Wikipedia, en el període del 2013 al 2023. Està estructurat en quatre CSV. Tres d'aquests CSV corresponen a l'edició de Wikipedia en anglès: el CSV English 3C que conté les dades de les seccions "Did you know...", "In the news" i "On this day..."; un CSV dedicat a "English Featured Article", i un altre a "English Featured Picture". El quart CSV conté les dades de l'edició en castellà de la Wikipedia, extretes de les seccions "Artículo Destacado" i "Artículo Bueno". A cada CSV, les dades es presenten en columnes, cadascuna dedicada a un atribut sociodemogràfic. The following dataset contains two folders with different data, which include: The data set of the folder with name "Gender" provides the gender distribution of individuals featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. For the Spanish edition, data has been collected from the "Artículos buenos" and "Artículos destacados" sections and is displayed in an aggregated format. The data set of the folder with name "Intersectionality" provides the distribution based on various sociodemographic attributes of individuals who have been featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. It is structured into four CSV. Three of these CSV correspond to the English Wikipedia edition: the English 3C CSV containing data from the sections "Did you know...", "In the news," and "On this day..."; a CSV dedicated to "English Featured Article," and another to "English Featured Picture." The fourth CSV contains data from the Spanish edition of Wikipedia, extracted from the sections "Artículo Destacado" and "Artículo Bueno." Within each CSV, the data is presented in columns, each dedicated to a sociodemographic attribute.
Z
Worldwide Gender Differences in Public Code Contributions - Replication...
data.niaid.nih.gov
zenodo.org
Updated Feb 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davide Rossi (2022). Worldwide Gender Differences in Public Code Contributions - Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6020474
Explore at:
Dataset updated
Feb 9, 2022
Dataset provided by
Stefano Zacchiroli
Davide Rossi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Worldwide Gender Differences in Public Code Contributions - Replication Package

This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011

This document comes with the software needed to mine and analyze the data presented in the paper.

Prerequisites

These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0

Initial data

swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.

names.tab - forenames and surnames per country with their frequency

zones.acc.tab - countries/territories, timezones, population and world zones

c_c.tab - ccTDL entities - world zones matches

Data preparation

Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst sh> ./export.sh

Run the authors cleanup script to create authors--clean.csv.zst sh> ./cleanup.sh authors.csv.zst

Filter out implausible names and create authors--plausible.csv.zst sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst

Gender detection

Run the gender guessing script to create author-fullnames-gender.csv.zst sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst

Database creation and data ingestion

Create the PostgreSQL DB sh> createdb gender-commit Notice that from now on when prepending the psql> prompt we assume the execution of psql on the gender-commit database.

Import data into PostgreSQL DB sh> ./import_data.sh

Zone detection

Extract commits data from the DB and create commits.tab, that is used as input for the gender detection script sh> psql -f extract_commits.sql gender-commit

Run the world zone detection script to create commit_zones.tab.zst sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.

Read zones assignment data from the file into the DB psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''

Extraction and graphs

Run the script to execute the queries to extract the data to plot from the DB. This creates commits_tz.tab, authors_tz.tab, commits_zones.tab, authors_zones.tab, and authors_zones_1620.tab. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh

Run the script to create the graphs from all the previously extracted tabfiles. This will generate commits_tzs.pdf, authors_tzs.pdf, commits_zones.pdf, authors_zones.pdf, and authors_zones_1620.pdf. sh> ./create_charts.sh

Additional graphs

This package also includes some already-made graphs

authors_zones_1.pdf: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per period

authors_zones_2.pdf: ditto with at least two commits per period

authors_zones_10.pdf: ditto with at least ten commits per period
Hospital Management Dataset
kaggle.com
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanak Baghel
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

Dataset Overview

This dataset includes five CSV files:

patients.csv – Patient demographics, contact details, registration info, and insurance data

doctors.csv – Doctor profiles with specializations, experience, and contact information

appointments.csv – Appointment dates, times, visit reasons, and statuses

treatments.csv – Treatment types, descriptions, dates, and associated costs

billing.csv – Billing amounts, payment methods, and status linked to treatments

📁 Files & Column Descriptions

** patients.csv**

Contains patient demographic and registration details.

Column Description

patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

** doctors.csv**

Details about the doctors working in the hospital.

Column Description

doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

appointments.csv

Records of scheduled and completed patient appointments.

Column Description

appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

treatments.csv

Information about the treatments given during appointments.

Column Description

treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

** billing.csv**

Billing and payment details for treatments.

Column Description

bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

Possible Use Cases

SQL queries and relational database design

Exploratory data analysis (EDA) and dashboarding

Machine learning projects (e.g., cost prediction, no-show analysis)

Feature engineering and data cleaning practice

End-to-end healthcare analytics workflows

Recommended Tools & Resources

SQL (joins, filters, window functions)

Pandas and Matplotlib/Seaborn for EDA

Scikit-learn for ML models

Pandas Profiling for automated EDA

Plotly for interactive visualizations

Please Note that :

All data is synthetically generated for educational and project use. No real patient information is included.

If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
Baby Names from Social Security Card Applications - National Data
catalog.data.gov
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
Updated Jul 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
Explore at:
Dataset updated
Jul 4, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.
c
Popular Baby Names
data.cityofnewyork.us
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
+2more
application/rdfxml +2
Updated Jun 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Health and Mental Hygiene (DOHMH) (2025). Popular Baby Names [Dataset]. https://data.cityofnewyork.us/Health/Popular-Baby-Names/25th-nujf
Explore at:
application/rdfxml, application/rssxml, xmlAvailable download formats
Dataset updated
Jun 8, 2025
Dataset authored and provided by
Department of Health and Mental Hygiene (DOHMH)
Description
Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
CrowS-Pairs (Social biases in MLMs)
kaggle.com
Updated Nov 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). CrowS-Pairs (Social biases in MLMs) [Dataset]. https://www.kaggle.com/datasets/thedevastator/a-dataset-for-measuring-social-biases-in-mlms
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2022
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
CrowS-Pairs (Social biases in MLMs)

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked LM

By [source]

About this dataset

The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status. Each sentence pair is a minimal edit of the first sentence: The only words that change between them are those that identify the group. The first sentence can demonstrate or violate a stereotype. The other sentence is a minimal edit of the first sentence: The only words that change between them are those that identify the group. Each example has the following information:

Columns:,**sent_more**,sent_less,**stereo_antistereo**,bias_type,**annotations**,,anon_writer,,anon_annotators,,prompt,,source

The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status. Each sentence pair is a minimal edit of the first sentence: The only words that change between them are those that identify the group. The first sentence can demonstrate or violate a stereotype. The other sentence is a minimal edit of the first sentence: The only words that change between them are those that identify the group. Each example has the following information:

Columns:,**sent_less**sent_more,,stereo_antistereo,,bias_type,,annotations,,anon_writer,,anon_annotators,,,,prompt,,source

This dataset can be used to measure social biases in MLMs by training models on it and evaluating their performance

Research Ideas

Measuring the ability of MLMs to identify and avoid social biases;

Developing new methods for reducing social biases in MLMs; and

Investigating the impact of social biases on downstream tasks such as reading comprehension or question answering

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: crows_pairs_anonymized.csv | Column name | Description | |:----------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------| | sent_more | The first sentence in the pair, which can demonstrate or violate a stereotype. (String) | | sent_less | The second sentence in the pair, which is a minimal edit of the first sentence. The only words that change between them are those that identify the group. (String) | | stereo_antistereo | Whether the first sentence demonstrates or violates a stereotype. (String) | | bias_type | The type of bias represented in the sentence pair. (String) | | annotations | The annotations made by the crowdworkers on the sentence pair. (String) | | anon_writer | The anonymous writer of the sentence pair. (String) | | anon_annotators | The anonymous annotators of the sentence pair. (String) |

File: prompts.csv | Column name | Descripti...
Author Gender Representation at Audio Engineering Conferences - An...
zenodo.org
data.niaid.nih.gov
Updated Feb 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kat Young; Kat Young; Michael Lovedee-Turner; Michael Lovedee-Turner; Jude Brereton; Jude Brereton; Helena Daffern; Helena Daffern (2021). Author Gender Representation at Audio Engineering Conferences - An Anonymised Dataset v2 [Dataset]. http://doi.org/10.5281/zenodo.4535610
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4535610
Dataset updated
Feb 12, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kat Young; Kat Young; Michael Lovedee-Turner; Michael Lovedee-Turner; Jude Brereton; Jude Brereton; Helena Daffern; Helena Daffern
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the author gender dataset (as a comma-delimited .csv file) originally created in association with the paper entitled 'The Impact of Gender on Conference Authorship in Audio Engineering: Analysis Using a New Data Collection Method', but since extended to include conferences up to the end of 2019. The original dataset is available at: https://doi.org/10.5281/zenodo.1249693. Please cite both the paper and the relevant dataset if used. Visualisation is available at: http://tibbakoi.github.io/aesgender.

---

The dataset was produced using a novel method which used self-identified pronouns, therefore allowing for as many groups as necessary to describe the population.

A list of authors was generated from conference proceedings.

An email was sent to each author to acquire their pronoun.

If no email was available/no response was received, a pronoun was acquired from a biography.

If no biography was available, a pronoun was inferred from traditional gender markers and gender presentation.

If no gender marker/photograph was available, the entry was labelled as 'Information Unavailable'. For brevity, the label 'Unknown' is used in the paper.

---

The columns in the dataset are as follows:

ID: unique identifier of entry

Pronoun: pronoun of entry

Position (abs): numerical absolute position within author list for entry

Position (relative): relative position within author list for entry (either First, Last, or Middle)

Single/multi-author: whether the publication for that entry has a single author or has multiple authors (single author publications are excluded from author position analysis)

Conference: Full conference name of entry

Topic: Topic of conference of entry, taken from conference name

Year: Year of conference of entry

Type: Type of publication for that entry as listed on the online conference proceedings

Grouped Type: Grouping of publication types for that entry for easier analysis due to inconsistencies in online conference proceedings (groups are: workshop, poster, paper, panel, keynote, invited speaker, invited paper, demo)

Inc. for author pos?: True/False as to whether to include the entry for analysis over author position (included types are: paper, invited paper, poster (all with multiple authors) as these have meaningful author orders)

Inc. for single/multi-author?: True/False as to whether to include the entry for analysis over single/multi author (includes types are: paper, invited paper, poster as these have meaningful author orders)

Invited paper status: Grouping of the types to allow statistical analysis over invited vs non-invited types (invited types are: invited speaker, invited paper, keynote, panel. Non-invited types are: poster, paper, demo, workshop)

NB: Some grouping of the data is required as online conference proceedings are not always consistent (Column 10). Some labelling of the data is required to determine which entries to include in certain types of analysis (Columns 11-13).

---

This dataset is distributed in the hopes that it will prove useful under the Creative Commons Attribution 4.0, with no warranty; or the implied warranty of merchantability or fitness for a particular problem.

---

Dataset curated by: Kat Young and Michael Lovedee-Turner, formerly at the AudioLab, Dept. of Electronic Engineering, University of York.
Contact: kathryn.ae.young@gmail.com
Airline Dataset
kaggle.com
Updated Sep 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sourav Banerjee (2023). Airline Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sourav Banerjee
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Airline data holds immense importance as it offers insights into the functioning and efficiency of the aviation industry. It provides valuable information about flight routes, schedules, passenger demographics, and preferences, which airlines can leverage to optimize their operations and enhance customer experiences. By analyzing data on delays, cancellations, and on-time performance, airlines can identify trends and implement strategies to improve punctuality and mitigate disruptions. Moreover, regulatory bodies and policymakers rely on this data to ensure safety standards, enforce regulations, and make informed decisions regarding aviation policies. Researchers and analysts use airline data to study market trends, assess environmental impacts, and develop strategies for sustainable growth within the industry. In essence, airline data serves as a foundation for informed decision-making, operational efficiency, and the overall advancement of the aviation sector.

Content

This dataset comprises diverse parameters relating to airline operations on a global scale. The dataset prominently incorporates fields such as Passenger ID, First Name, Last Name, Gender, Age, Nationality, Airport Name, Airport Country Code, Country Name, Airport Continent, Continents, Departure Date, Arrival Airport, Pilot Name, and Flight Status. These columns collectively provide comprehensive insights into passenger demographics, travel details, flight routes, crew information, and flight statuses. Researchers and industry experts can leverage this dataset to analyze trends in passenger behavior, optimize travel experiences, evaluate pilot performance, and enhance overall flight operations.

Dataset Glossary (Column-wise)

Passenger ID - Unique identifier for each passenger

First Name - First name of the passenger

Last Name - Last name of the passenger

Gender - Gender of the passenger

Age - Age of the passenger

Nationality - Nationality of the passenger

Airport Name - Name of the airport where the passenger boarded

Airport Country Code - Country code of the airport's location

Country Name - Name of the country the airport is located in

Airport Continent - Continent where the airport is situated

Continents - Continents involved in the flight route

Departure Date - Date when the flight departed

Arrival Airport - Destination airport of the flight

Pilot Name - Name of the pilot operating the flight

Flight Status - Current status of the flight (e.g., on-time, delayed, canceled)

Structure of the Dataset

https://i.imgur.com/cUFuMeU.png" alt="">

Acknowledgement

The dataset provided here is a simulated example and was generated using the online platform found at Mockaroo. This web-based tool offers a service that enables the creation of customizable Synthetic datasets that closely resemble real data. It is primarily intended for use by developers, testers, and data experts who require sample data for a range of uses, including testing databases, filling applications with demonstration data, and crafting lifelike illustrations for presentations and tutorials. To explore further details, you can visit their website.

Cover Photo by: Kevin Woblick on Unsplash

Thumbnail by: Airplane icons created by Freepik - Flaticon
u
Trends in gender homophily in scientific publications (data)
observatorio-investigacion.unavarra.es
data.niaid.nih.gov
+1more
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Torre, Margarita; Prieto-Alonso, Jesús Álvaro; Ucar, Iñaki; Torre, Margarita; Prieto-Alonso, Jesús Álvaro; Ucar, Iñaki (2024). Trends in gender homophily in scientific publications (data) [Dataset]. https://observatorio-investigacion.unavarra.es/documentos/688b600d17bb6239d2d47fa9
Explore at:
Dataset updated
2024
Authors
Torre, Margarita; Prieto-Alonso, Jesús Álvaro; Ucar, Iñaki; Torre, Margarita; Prieto-Alonso, Jesús Álvaro; Ucar, Iñaki
Description
This dataset contains records of research articles extracted from the Web of Science (WoS) from 1980 to 2019---in total, 15,642 journals, 28,241,100 articles and 111,980,858 authorships across 153 research areas.

The main dataset (author_address_article_gend_v3.parquet), in Parquet format, contains all the authorships, where an authorship is defined as the tuple article-author. There are 12 variables per authorship (row):

ut: unique article identifier.

daisng_id: unique author identifier.

author_no: author number, as listed in the article.

country: author country (two-letter ISO code).

date: publication date.

gender: gender of the author ("male" or "female"), as provided by the Genderize.io API.

probability: probability of the gender attribute, as provided by the Genderize.io API.

count: number of entries for the author first name, as provided by the Genderize.io API.

jsc: journal subject category.

field: field of research.

research_area: area of research.

n_aut: number of authors in this publication.

journal: journal name.

alphabetical: whether the author list for this article is in alphabetical order.

With the previous dataset, a resampler was applied to generate null homophily values for each year. There are 4 datasets in R Data Serialization (RDS) format:

null_field.rds: null homophily values per country, year and field of research.

null_field_comp.rds: null homophily values per year and field of research (only for complete authorships).

null_research.rds: null homophily values per year and area of research.

null_research_comp.rds: null homophily values per year and area of research (only for complete authorships).

All these datasets have the same structure:

country: country (two-letter ISO code).

year: year.

variable: either field or research area name.

m: average homophily.

s: homophily std. error.

Finally, some supplementary files used in the descriptive analysis and methods:

File null_research_l2019.rds is an example of the output from the resampling algorithm for year 2019.

File wos_category_to_field.csv is a mapping from WoS categories to more general fields.

File jcr_if_2020.csv contains the percentiles of the journal impact factor for the JCR 2020.
Supporting material for "Impact of gender on the formation and outcome of...
zenodo.org
explore.openaire.eu
bin, csv +1
Updated Jul 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leah P. Schwartz; Leah P. Schwartz; Jean F. Liénard; Jean F. Liénard; Stephen V. David; Stephen V. David (2022). Supporting material for "Impact of gender on the formation and outcome of formal mentoring relationships in the life sciences" [Dataset]. http://doi.org/10.5281/zenodo.6897394
Explore at:
csv, text/x-python, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6897394
Dataset updated
Jul 25, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Leah P. Schwartz; Leah P. Schwartz; Jean F. Liénard; Jean F. Liénard; Stephen V. David; Stephen V. David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains data and analysis code associated with the manuscript: L.P. Schwartz, J. Liénard, S. V. David. (2022) "Impact of gender on formation and outcome of formal mentoring relationships in the life sciences." Figures and tables in the manuscript can be produced by running the make_figures.ipynb notebook. Figures have been marked with headings indicating their position in the manuscript (Figure 1, Figure S1, etc.). In addition, the notebook contains code to reproduce regression analyses that are cited in the text but not directly associated with a figure.

Data on mentoring relationships derives from Academic Family Tree (AFT, www.academictree.org) and public data sources on funding, publications, and awards. Inclusion criteria, public data sources, and procedures for linking across sources are described in the manuscript. Personal identifiers for researchers have been anonymized, but remain consistent across all data in the repository. In other words, the personal identifier "1" refers to the same person in all dataframes in the repository. But, that person is *not* the same researcher identified as "1" on the public AFT website.

Installation

Requires Python 3.x. and Pandas. To load required libraries using Anaconda, run:

`conda create --name aft -c conda-forge pandas numpy scipy ipython jupyterlab scipy scikit-learn pandas matplotlib numpy statsmodels seaborn pytables`

Dataframes

Data is stored as a series of Pandas dataframes within HDF5 or CSV files:

* cng_tc: The primary dataset used in the analysis. The name is an acronym for "connections" (i.e. training relationships, "cn"), "gender" ("g"), and "trainee count" ("tc"). Each row contains data on the mentor and trainee in one training relationship. See manuscript for inclusion criteria.

* mentors: Data on mentors. Each row contains data on one mentor. See manunscript for inclusion criteria.

* mentors_grants, mentors_hindex, mentors_locs_ranked: Subset of mentors with data available for funding (mentors_grants), citation (mentors_hindex), and institution rank (mentors_locs_ranked).

* mentors_nobel, mentors_hhmi, mentors_nas: Subsets of mentors that received a Nobel (mentors_nobel), Howard Hughes Medical Institute grants (mentors_hhmi), or membership in the National Academy of Sciences (mentors_nas). See manuscript for details of data sources and linking procedures.

* cn, cng, first_names, gn, gn_all, locs: Partial data (connections only, inferred gender only, connections and gender only, location only, first names and inferred gender only) for more inclusive sets of researchers in AFT. They are generally not used used for analysis, but have been included here to calculate statistics on the total amount of data included and to screen for data from U.S. locations.

* nsf_gender_phds, nsf_gender_pds: National Science Foundation survey data on gender and fraction PhDs conferred per year (nsf_gender_phds) or fraction postdocs employed per year (nsf_gender_pds). See manuscript for details of data source.

* photo: Data for validation of gender inference method.

Dataframe columns

* amount: Mentor's total funding
* amount_adj: Mentor's total funding (adjusted to 2020 dollars)
* broad_field: Mentor's general research area (e.g., life sciences, engineering, based on National Science Foundation classifications)
* continue: Whether trainee went on to become a mentor (i.e., has trainees listed in AFT)
* country: Country in which mentor's current institution is located
* firstname: First name of researcher (table of first names is not aligned with tables containing anonymized personal identifiers)
* first_grant_year: Year of mentor's first grant
* funding_rate: Mentor's annual funding rate (since first grant)
* funding_rate_adj: Mentor's annual funding rate (since first grant) adjusted to 2020 dollars
* hhmi: Whether mentor was granted HHMI funding
* hindex: Mentor's hindex
* location: Name of mentor's current institution
* locid: Identifier for mentor's institution
* locid_rank: Postion of mentor's institution in 2015 Quacquarelli-Symonds rankings (lower numbers are better)
* locid_rank_rev: Reversed version of "locid_rank" (i.e., higher numbers are better)
* majorarea: Mentor's specific research area (e.g, neuroscience)
* male_mentor, male trainee: Whether the probability that a researcher's first name is used by a person identifying as a man meets threshold (see manuscript for details on gender inference using first names)
* match_score: Score for string match between institution or name of awardee and researcher
* mentor_career_start: The date at which the mentor's academic career began
* mentor_continue_rate: Fraction of mentor's trainees that become mentors
* mentor_continue_rate_ft: Fraction of mentor's woman trainees that become mentors
* mentor_continue_rate_mt: Fraction of mentor's man trainees that become mentors
* mentor_t_p_male0: Fraction of mentor's trainees that are men
* mentor_t_p_male0_gs: Fraction of mentor's trainees that are men (graduate students only)
* mentor_t_p_male0_pd: Fraction of mentor's trainees that are men (postdocs only)
* mentor_tcount0: Mentor's total number of trainees
* nas: Whether mentor is a member of the National Academy of Sciences
* nobel: Whether mentor is a Nobel laureate
* p_male_mentor, p_male_trainee: Probability that a researcher's first name is used by a person identifying as a man
* pid: Anonymized identifier of researcher
* pid_mentor: Anonymized identifier of mentor in training relationship
* pid_trainee: Anonymized identifier of trainee in training relationship
* pq: "1" if data on training relationship is drawn from ProQuest database and has not been manually edited a human AFT user
* relation: Type of training relationship (1: graduate student, 2: postdoc)
* scorer1, scorer2, scorer3: Results of photo validation of gender inference for each scorer
* start: Training start year
* stop: Training end year
* trainee_tcount: Total people that the trainee has trained
* triad: Whether trainee has participated in both a graduate-level and postdoctoral training relationship

The cn dataframe follows slightly different naming conventions, but is not generally used in the analysis (pid1 = pid_trainee, pid2 = pid_mentor, startdate = start, stopdate = stop).
Global Freelancers (Raw) Dataset
kaggle.com
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Urvish Ahir (2025). Global Freelancers (Raw) Dataset [Dataset]. https://www.kaggle.com/datasets/urvishahir/global-freelancers-raw-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Urvish Ahir
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description :

This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.

Each entry includes demographic, professional, and platform-related information such as:

Name, gender, age, and country

Primary skill and years of experience

Hourly rate (with mixed formatting), client rating, and satisfaction score

Language spoken (based on country)

Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)

Key Features :

Gender-based names using Faker’s male/female name generators

Realistic age and experience distribution (with missing and noisy values)

Country-language pairs mapped using actual linguistic data

Messy formatting: mixed data types, missing values, inconsistent casing

Generated entirely in Python using the faker library no real data used

Use Cases :

Practicing data cleaning and preprocessing

Performing EDA (Exploratory Data Analysis)

Developing data pipelines: raw → clean → model-ready

Teaching feature engineering and handling real-world dirty data

Exercises in data validation, outlier detection, and format standardization

File : global_freelancers_raw.csv

| Column Name | Description | | --------------------- | ------------------------------------------------------------------------ | | `freelancer_ID` | Unique ID starting with `FL` (e.g., FL250001) | | `name` | Full name of freelancer (based on gender) | | `gender` | Gender (messy values and case inconsistency) | | `age` | Age of the freelancer (20–60, with occasional nulls/outliers) | | `country` | Country name (with random formatting/casing) | | `language` | Language spoken (mapped from country) | | `primary_skill` | Key freelance domain (e.g., Web Dev, AI, Cybersecurity) | | `years_of_experience` | Work experience in years (some missing values or odd values included) | | `hourly_rate (USD)` | Hourly rate with currency symbols or missing data | | `rating` | Rating between 1.0–5.0 (some zeros and nulls included) | | `is_active` | Active status (inconsistently represented as strings, numbers, booleans) | | `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs) |

Data from: THE RELEVANCY OF MASSIVE HEALTH EDUCATION IN THE BRAZILIAN PRISON...

zenodo.org

csv, pdf

Updated Jul 16, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Janaína L. R. da S. Valentim; Janaína L. R. da S. Valentim; Sara Dias-Trindade; Sara Dias-Trindade; Eloiza da S. G. Oliveira; Eloiza da S. G. Oliveira; José A. M. Moreira; José A. M. Moreira; Felipe Fernandes; Felipe Fernandes; Manoel Honorio Romão; Manoel Honorio Romão; Philippi S. G. de Morais; Philippi S. G. de Morais; Alexandre R. Caitano; Alexandre R. Caitano; Aline P. Dias; Aline P. Dias; Carlos A. P. Oliveira; Carlos A. P. Oliveira; Karilany D. Coutinho; Karilany D. Coutinho; Ricardo B. Ceccim; Ricardo B. Ceccim; Ricardo A. de M. Valentim; Ricardo A. de M. Valentim (2024). THE RELEVANCY OF MASSIVE HEALTH EDUCATION IN THE BRAZILIAN PRISON SYSTEM: THE COURSE "HEALTH CARE FOR PEOPLE DEPRIVED OF FREEDOM" AND ITS IMPACTS [Dataset]. http://doi.org/10.5281/zenodo.6499752

Explore at:

csv, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6499752

Dataset updated

Jul 16, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset name: asppl_dataset_v2.csv

Version: 2.0

Dataset period: 06/07/2018 - 01/14/2022

Dataset Characteristics: Multivalued

Number of Instances: 8118

Number of Attributes: 9

Missing Values: Yes

Area(s): Health and education

Sources:

Virtual Learning Environment of the Brazilian Health System (AVASUS) (Brasil, 2022a);
Brazilian Occupational Classification (CBO) (Brasil, 2022b);
National Registry of Health Establishments (CNES) (Brasil, 2022c);
Brazilian Institute of Geography and Statistics (IBGE) (Brasil, 2022e).

Description: The data contained in the asppl_dataset_v2.csv dataset (see Table 1) originates from participants of the technology-based educational course “Health Care for People Deprived of Freedom.” The course is available on the AVASUS (Brasil, 2022a). This dataset provides elementary data for analyzing the course’s impact and reach and the profile of its participants. In addition, it brings an update of the data presented in work by Valentim et al. (2021).

Table 1: Description of AVASUS dataset features.

Attributes	Description	datatype	Value
gender	Gender of the course participant.	Categorical.	Feminino / Masculino / Não Informado. (In English, Female, Male or Uninformed)
course_progress	Percentage of completion of the course.	Numerical.	Range from 0 to 100.
course_evaluation	A score given to the course by the participant.	Numerical.	0, 1, 2, 3, 4, 5 or NaN.
evaluation_commentary	Comment made by the participant about the course.	Categorical.	Free text or NaN.
region	Brazilian region in which the participant resides.	Categorical.	Brazilian region according to IBGE: Norte, Nordeste, Centro-Oeste, Sudeste or Sul (In English North, Northeast, Midwest, Southeast or South).
CNES	The CNES code refers to the health establishment where the participant works.	Numerical.	CNES Code or NaN.
health_care_level	Identification of the health care network level for which the course participant works.	Categorical.	“ATENCAO PRIMARIA”, “MEDIA COMPLEXIDADE”, “ALTA COMPLEXIDADE”, and their possible combinations. (In English "PRIMARY HEALTH CARE", "SECONDARY HEALTH CARE" AND "TERTIARY HEALTH CARE")
year_enrollment	Year in which the course participant registered.	Numerical.	Year (YYYY).
CBO	Participant occupation.	Categorical.	Text coded according to the Brazilian Classification of Occupations or “Indivíduo sem afiliação formal.” (In English “Individual without formal affiliation.”)

Dataset name: prison_syphilis_and_population_brazil.csv

Dataset period: 2017 - 2020

Dataset Characteristics: Multivalued

Number of Instances: 6

Number of Attributes: 13

Missing Values: No

Source:

National Penitentiary Department (DEPEN) (Brasil, 2022d);

Description: The data contained in the prison_syphilis_and_population_brazil.csv dataset (see Table 2) originate from the National Penitentiary Department Information System (SISDEPEN) (Brasil, 2022d). This dataset provides data on the population and prevalence of syphilis in the Brazilian prison system. In addition, it brings a rate that represents the normalized data for purposes of comparison between the populations of each region and Brazil.

Table 2: Description of DEPEN dataset Features.

Attributes	Description	datatype	Value
Region	Brazilian region in which the participant resides. In addition, the sum of the regions, which refers to Brazil.	Categorical.	Brazil and Brazilian region according to IBGE: North, Northeast, Midwest, Southeast or South.
syphilis_2017	Number of syphilis cases in the prison system in 2017.	Numerical.	Number of syphilis cases.
syphilis_rate_2017	Normalized rate of syphilis cases in 2017.	Numerical.	Syphilis case rate.
syphilis_2018	Number of syphilis cases in the prison system in 2018.	Numerical.	Number of syphilis cases.
syphilis_rate_2018	Normalized rate of syphilis cases in 2018.	Numerical.	Syphilis case rate.
syphilis_2019	Number of syphilis cases in the prison system in 2019.	Numerical.	Number of syphilis cases.
syphilis_rate_2019	Normalized rate of syphilis cases in 2019.	Numerical.	Syphilis case rate.
syphilis_2020	Number of syphilis cases in the prison system in 2020.	Numerical.	Number of syphilis cases.
syphilis_rate_2020	Normalized rate of syphilis cases in 2020.	Numerical.	Syphilis case rate.
pop_2017	Prison population in 2017.	Numerical.	Population number.
pop_2018	Prison population in 2018.	Numerical.	Population number.
pop_2019	Prison population in 2019.	Numerical.	Population number.
pop_2020	Prison population in 2020.	Numerical.	Population number.

Dataset name: students_cumulative_sum.csv

Dataset period: 2018 - 2020

Dataset Characteristics: Multivalued

Number of Instances: 6

Number of Attributes: 7

Missing Values: No

Source:

Virtual Learning Environment of the Brazilian Health System (AVASUS) (Brasil, 2022a);
Brazilian Institute of Geography and Statistics (IBGE) (Brasil, 2022e).

Description: The data contained in the students_cumulative_sum.csv dataset (see Table 3) originate mainly from AVASUS (Brasil, 2022a). This dataset provides data on the number of students by region and year. In addition, it brings a rate that represents the normalized data for purposes of comparison between the populations of each region and Brazil. We used population data estimated by the IBGE (Brasil, 2022e) to calculate the rate.

Table 3: Description of Students dataset Features.

Oscar-Winning Directors Analysis
kaggle.com
Updated Jan 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Oscar-Winning Directors Analysis [Dataset]. https://www.kaggle.com/datasets/thedevastator/oscar-winning-directors-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Description
Oscar-Winning Directors Analysis

Examining Gender, Race and Director Trends from 1930-2019

By Priyanka Dobhal [source]

About this dataset

This dataset collects information on the Academy Award for Best Director winners from 1930 to 2019, and provides insight into the gender and racial disparity of filmmaking over time. It includes the winner's name, their respective award year, race, gender, nominated/winning film title, and the filmmakers' names. By looking at this data it is possible to identify emerging trends in cinema- such as who is dominating in terms of awards recognition- and consider how much progress has been made when it comes to equal opportunity within Hollywood. Examining Oscar winning directors over time can tell us a lot about its impact on systemic issues in our society as diversity increases among winners. To deepen our understanding of this award’s significance it is necessary to consider all factors included; from awarded directors’ gender to what kind of films are being supported by these awards annually. So come explore with us! Let's take an analysis deep dive into almost nine decades worth of cinematic history - starting from 1930 - and see who won big at the Academy Awards…

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is the perfect resource for anyone looking to conduct an analysis of the Academy Award for Best Director winners from 1930 to 2019. It contains information on the year, gender, race, director(s), film and nomination/winner of each winner.

By using this dataset, you can gain insight into trends in Oscar winning directors over time. For example, you can compare the number of nominees between different years or examine differences in representation of gender and race among directors who have won Oscars over time. Additionally, you can use this data to explore the films that have received an Oscar for best director – which films were most successful from a narrative perspective? Or analyze which films used unique filming techniques or visual designs? Finally, this dataset also makes it possible to conduct more targeted analyses by identifying patterns across multiple aspects such as furthering social issues that are depicted in film through positive filmmaking - such as LGBTQ representation.

To start exploring with this dataset:

2) Open your favorite spreadsheet program ('Microsoft Excel', 'Libre Office', etc.)

3) Load csv file with' File —> Open' command

4) Review column headers and values contained within each row

5) Start creating charts and graphs (pie charts barplots etc.) that show trends over time according to your needs

6) Take notes while analyzing datasets

7) Publish your findings online if desired

The possibilities are endless! If you’d like additional guidance or tips on how to effectively use this data set please subscribe our newsletter at oscarwinningdirectorsanalysisgmail.com

Research Ideas

Analyzing gender and racial disparity in the Academy award for Best Director across different years.

Investigating if the age of directors has an effect on what film they create and how successful it is at winning an Oscar for Best Director.

Crafting a recommendation system to recommend movies based on a director's previous Oscar-winning work or even pair users with film recommendations that have similar director/genre preference in order to discover new titles they may enjoy watching

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

See the dataset description for more information.

Columns

File: Oscar Winners - Director.csv | Column name | Description | |:----------------------|:--------------------------------------------------------------| | Year | The year in which the award was given. (Integer) | | Gender | The gender of the director. (String) | | Race | The race of the director. (String) | | Director(s) | The name of the director(s). (String) | | Film | The title of the film that won the award. (String) | | Nomination/Winner | Whether the director was nominated or won the award. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Priyanka Dobhal.
Disambiguated researchers publication data
figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amaral Lab (2023). Disambiguated researchers publication data [Dataset]. http://doi.org/10.6084/m9.figshare.1591864.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1591864.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Amaral Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Companion dataset to "The possible role of resource requirements and academic career-choice risk on gender differences in publication rate and impact" by Duch J, Zeng XHT, Sales-Pardo M, Radicchi F, Otis S, Woodruff TK, Amaral LAN (PLoS ONE 7, e51332, 2012) doi: 10.1371/journal.pone.0051332 This dataset lists the total number of publications by 4,394 faculty members from 7 distinct research fields working at top U.S. institutions. The dataset also contains bibliographic information manualy gathered from the CVs of those faculty members. The publications data was collected from Thomson Reuters' Web of Science according to the procedures described in the published paper.

The data is a single csv file with the following fields: author_name - researcher name as: Last name, Initialsgender - researcher gender as: M (male) or F (female)univ_name - Institution of current employmentfield - scientific disciplinephd_year - year of phd completionnationality - Country of originbackground - List of degreesaffiliations - List of honours and past appointmentstotal_pubs - Total number of publications Some fields are not available for some researchers. Current employments are accurate as of June, 2010.total_pubs field show total number of publications published by the end of 2010.
Data from: A Greek Parliament Proceedings Dataset for Computational...
data.europa.eu
unknown
Updated Aug 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). A Greek Parliament Proceedings Dataset for Computational Linguistics and Political Analysis [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7005201?locale=cs
Explore at:
unknown(1427754875)Available download formats
Dataset updated
Aug 27, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Greece
Description
The dataset is a new version of the previous upload and includes the following files: 1. dataset_versions/tell_all.csv: The initial dataset of 1,280,927 extracted speeches, before preprocessing and cleaning. The speeches extend chronologically from July 1989 up to July 2020 and were exported from 5,355 parliamentary sitting record files. The file has a total volume of 2.5 GB and includes the following columns: member_name: the name of the individual who spoke during a sitting. sitting_date: the date the sitting took place. parliamentary_period: the name and/or number of the parliamentary period that the speech took place in. A parliamentary period is defined as the time span between one general election and the next. A parliamentary period includes multiple parliamentary sessions. parliamentary_session: the name and/or number of the parliamentary session that the speech took place in. A session is defined as a time span of usually 10 months within a parliamentary period during which the parliament can convene and function as stipulated by the constitution. A session can fall into the following categories: regular, extraordinary or special. In the intervals between the sessions the parliament is in recess. A parliamentary session includes multiple parliamentary sittings. parliamentary_sitting: the name and/or number of the parliamentary sitting that the speech took place in. A sitting is defined as a meeting of parliament members. political_party: the political party of the speaker. government: the government in force when the speech took place. member_region: the electoral district the speaker belonged to. roles: information about the parliamentary roles and/or government position of the speaker. member_gender: the gender of the speaker speech: the speech that the individual gave during the parliamentary sitting. 2. dataset_versions/tell_all_FILLED.csv: This file is an intermediate version of the dataset that includes improvements in the consistency and completeness of the dataset, with a total volume of 2.5 GB. Specifically, this file is produced by filling the missing names of chairmen of various parliamentary sittings of the "tell_all.csv". It includes the same columns as the "tell_all.csv" file. 3. dataset_versions/tell_all_cleaned.csv: This version of the dataset is the result of further cleaning and preprocessing and is used for our word usage change study. It consists of 1,280,918 speech fragments of Greek parliament members in the order of the conversation that took place, with a total volume of 2.12 GB. It includes the same columns as the aforementioned versions. The preprocessing includes the replacement of all references to political parties with the symbol "@" followed by an abbreviation of the party name, using regular expressions that capture different grammatical cases and variations. It also includes the removal of accents, strings with length less than 2 characters, all punctuation except full stops, and the replacement of stopwords with "@sw". 4. wiki_data: A folder of modern Greek female and male names and surnames and their available grammatical cases crawled from the entries of the Wiktionary Greek names category (https://en.wiktionary.org/wiki/Category:Greek_names). We produced the grammatical cases of the missing grammatical entries according to the rules of the Greek grammar and saved the files in the same folder by adding to their filenames the string "_populated.json". 5. parl_members_activity_1989onwards_with_gender.csv: The Greek Parliament website provides a list of all the elected members of parliament since the fall of the military junta in Greece, in 1974. We collected and cleaned the data, added the gender and kept the elected members from 1989 onwards, matching the available parliament proceeding records. This dataset includes the full names of the members, the date range of their service, the political party they served, the electoral district they belonged to and their gender. 6. formatted_roles_gov_members_data.csv: As government members we refer to individuals in ministerial or other government posts, regardless of whether they were elected in the parliament. This information is available in the website of the Secretariat General for Legal and Parliamentary Affairs. The government members dataset includes the full names of the official individuals, the name of the role they were given, the date range of their service at each specific role and their gender. 7. governments_1989onwards.csv: A dataset of government information including the names of governments since 1989, their start and end dates, and a URL that points to the respective official government web page of each past government. The data is crawled from the website of the Secretariat General for Legal and Parliamentary Affairs. 8. extra_roles_manually_collected.csv: A dataset with manually collected information from Wikipedia about additional government or parliament posts such as Chairman of the Parliament,

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2022). Gender by Name (Time-series) [Dataset]. https://www.kaggle.com/datasets/thedevastator/automated-gender-identification-using-name-proba/data

Gender by Name (Time-series)

Probability of given names being M/F based on US names from 1930-Present

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 5, 2022

Dataset provided by

Kaggle

Authors

The Devastator

Description

Automated Gender Identification Using Name Probabilities

2019 US Social Security Administration Data

By Derek Howard [source]

About this dataset

This dataset provides an essential tool for generating gender-specific datasets from names alone. It contains information on the probability of a person's name belonging to a certain gender, based off of US Social Security records from the last century. This makes it easy to assign genders to datasets that do not natively include this data. All probability values were culled from records with 5 or more people associated with each name - so those individuals with less common monikers can still have their genders correctly predicted! With this resource, users can generate gender-aware data in no time, making gender identification in data sets more accurate and easier than ever

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a helpful resource when you need to accurately identify gender from names. With this dataset, you’ll be able to quickly and accurately assign genders to datasets that contain names but no other information about the person.

To get started, you will need a csv file with two columns: name and probability. The name column should contain the first names of the people in your dataset. The probability column should contain numbers between 0 and 1 indicating the likelihood that each name is associated with one specific gender (0 for male, 1 for female).

In addition to simply assigning genders from these probabilities alone, users of this dataset also have more control over their classifications - they can use it as either a baseline or as an absolute measure of accuracy depending on their exact needs/preferences. Experimentation is highly encouraged here!
Good luck!

Research Ideas

Create gender-specific applications - tailor different apps to different genders based on the probability of a particular name belonging to a certain gender.

Generate gender neutral names - use this data to generate random names with no gender bias.

Automate record lookup - quickly and accurately assign genders based on the probability associated with their name

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

Unknown License - Please check the dataset description for more information.

Columns

File: name_gender.csv | Column name | Description | |:----------------|:--------------------------------------------------------------------| | name | The name of the person. (String) | | gender | The gender of the person. (String) | | probability | The probability of the gender being assigned to the person. (Float) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Derek Howard.

Clear search

Close search

Google apps

Main menu

Gender by Name (Time-series)

Automated Gender Identification Using Name Probabilities

2019 US Social Security Administration Data

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

The annual list of first names of newborns — city of Nancy

Publication records for top US researchers in 6 fields

World Gender Statistics

The Data

Data.csv

The other files

Acknowledgements

Nyc popular baby names

Gender and Intersectional Disparities in Biographies on English and Spanish...

Worldwide Gender Differences in Public Code Contributions - Replication...

Hospital Management Dataset

Baby Names from Social Security Card Applications - National Data

Popular Baby Names

CrowS-Pairs (Social biases in MLMs)

CrowS-Pairs (Social biases in MLMs)

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked LM

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Author Gender Representation at Audio Engineering Conferences - An...

Airline Dataset

Context

Content

Dataset Glossary (Column-wise)

Structure of the Dataset

Acknowledgement

Trends in gender homophily in scientific publications (data)

Supporting material for "Impact of gender on the formation and outcome of...

Global Freelancers (Raw) Dataset

Description :

Key Features :

Use Cases :

File : global_freelancers_raw.csv

Data from: THE RELEVANCY OF MASSIVE HEALTH EDUCATION IN THE BRAZILIAN PRISON...

Oscar-Winning Directors Analysis

Oscar-Winning Directors Analysis

Examining Gender, Race and Director Trends from 1930-2019

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Disambiguated researchers publication data

Data from: A Greek Parliament Proceedings Dataset for Computational...

Gender by Name (Time-series)

Probability of given names being M/F based on US names from 1930-Present

Automated Gender Identification Using Name Probabilities

2019 US Social Security Administration Data

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

File : `global_freelancers_raw.csv`