33 datasets found
  1. Gender by Name (Time-series)

    • kaggle.com
    Updated Dec 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Gender by Name (Time-series) [Dataset]. https://www.kaggle.com/datasets/thedevastator/automated-gender-identification-using-name-proba/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    Description

    Automated Gender Identification Using Name Probabilities

    2019 US Social Security Administration Data

    By Derek Howard [source]

    About this dataset

    This dataset provides an essential tool for generating gender-specific datasets from names alone. It contains information on the probability of a person's name belonging to a certain gender, based off of US Social Security records from the last century. This makes it easy to assign genders to datasets that do not natively include this data. All probability values were culled from records with 5 or more people associated with each name - so those individuals with less common monikers can still have their genders correctly predicted! With this resource, users can generate gender-aware data in no time, making gender identification in data sets more accurate and easier than ever

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a helpful resource when you need to accurately identify gender from names. With this dataset, you’ll be able to quickly and accurately assign genders to datasets that contain names but no other information about the person.

    To get started, you will need a csv file with two columns: name and probability. The name column should contain the first names of the people in your dataset. The probability column should contain numbers between 0 and 1 indicating the likelihood that each name is associated with one specific gender (0 for male, 1 for female).

    In addition to simply assigning genders from these probabilities alone, users of this dataset also have more control over their classifications - they can use it as either a baseline or as an absolute measure of accuracy depending on their exact needs/preferences. Experimentation is highly encouraged here!
    Good luck!

    Research Ideas

    • Create gender-specific applications - tailor different apps to different genders based on the probability of a particular name belonging to a certain gender.

    • Generate gender neutral names - use this data to generate random names with no gender bias.

    • Automate record lookup - quickly and accurately assign genders based on the probability associated with their name

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    Unknown License - Please check the dataset description for more information.

    Columns

    File: name_gender.csv | Column name | Description | |:----------------|:--------------------------------------------------------------------| | name | The name of the person. (String) | | gender | The gender of the person. (String) | | probability | The probability of the gender being assigned to the person. (Float) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Derek Howard.

  2. g

    The annual list of first names of newborns — city of Nancy

    • gimi9.com
    • data.europa.eu
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). The annual list of first names of newborns — city of Nancy [Dataset]. https://gimi9.com/dataset/eu_5d2c2919634f41429aae86ce/
    Explore at:
    Dataset updated
    Dec 16, 2023
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    The annual list of first names of newborns is a simple and popular dataset. These data, from the register of civil status, shall contain the following essential data: sex of the newborn, first name of the newborn, number of occurrences of the first name for the corresponding year, year of survey. The dataset consists of the list of first names of children born in Nancy since 2016, in CSV format, with the number of occurrences of each given name, classified by year and sex. The first names declared below an occurrence of five are not published, with a view to protecting personal data. The standardisation of this dataset follows the recommendations of Opendata France following the work around the Common Socle des Data Locales. Definition of headers COLL_NOM: name of the municipality COLL_INSEE: Insee code of the municipality where the first names are registered in the civil status of the place of birth. Note that the place of birth may be different from the place of residence of the parents. CHILD_SEX: Gender corresponding to first name: M or F respectively for men or women CHILD_PRENOM: first name of new born(s) recorded as first name in the civil status documents of the corresponding year. NUMBER_OCCURENCES: occurrence of first name YEAR: year of birth Total births reported to the City of Nancy 2018 Total number of births: 5135 Total number of births of girls: 2692 Total number of births of boys: 2443 2017 Total number of births: 5483 Total number of births of girls: 2704 Total number of births of boys: 2779 2016 Total number of births: 5544 Total number of births of girls: 2692 Total number of births of boys: 2852

  3. Publication records for top US researchers in 6 fields

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amaral Lab (2023). Publication records for top US researchers in 6 fields [Dataset]. http://doi.org/10.6084/m9.figshare.3472616.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Amaral Lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset lists the full publication and biographical records for 3,979 researchers in 6 distinct fields working at top U.S. institutions.The file full_researcher_data.csv contains the list of all 3,979 researchers in csv format with the following fields:field - scientific disciplinetotal_publications - Total number of publications short_name - Researchers abbreviated name as: [last name], [first and middle name initals]. Used for searching in the authors field in the file full_publication_data.csv.full_name - Researcher full name as: [Last name], [First Name] [Middle initials]phd_year - year of PhD completiongender - researcher gender as: M (male) or F (female)author_id - Index of researcher. Matches the author_id field in the file full_publication_data.csvuniversity - Institution of current employmentGender and PhD year are not available for all researchers.Current employments are accurate as of June, 2010. total_pubs field show total number of publications published by the end of 2010.The file full_publication_data.csv contains the list of all 417,609 publications in csv format with the following fields:title - publication titlejournal - publication journalauthors - publication authors as a comma-separated list. Author syntax matches the short_name field in the file full_researcher_data.csv.year - publication year author_id - Index of researcher. Matches the author_id field in the file full_researcher_data.csvabstract - publication abstract

  4. World Gender Statistics

    • kaggle.com
    zip
    Updated Nov 28, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2016). World Gender Statistics [Dataset]. https://www.kaggle.com/datasets/theworldbank/world-gender-statistics/versions/1
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Nov 28, 2016
    Dataset authored and provided by
    World Bankhttp://topics.nytimes.com/top/reference/timestopics/organizations/w/world_bank/index.html
    Area covered
    World
    Description

    The Gender Statistics database is a comprehensive source for the latest sex-disaggregated data and gender statistics covering demography, education, health, access to economic opportunities, public life and decision-making, and agency.

    The Data

    The data is split into several files, with the main one being Data.csv. The Data.csv contains all the variables of interest in this dataset, while the others are lists of references and general nation-by-nation information.

    Data.csv contains the following fields:

    Data.csv

    • Country.Name: the name of the country
    • Country.Code: the country's code
    • Indicator.Name: the name of the variable that this row represents
    • Indicator.Code: a unique id for the variable
    • 1960 - 2016: one column EACH for the value of the variable in each year it was available

    The other files

    I couldn't find any metadata for these, and I'm not qualified to guess at what each of the variables mean. I'll list the variables for each file, and if anyone has any suggestions (or, even better, actual knowledge/citations) as to what they mean, please leave a note in the comments and I'll add your info to the data description.

    Country-Series.csv

    • CountryCode
    • SeriesCode
    • DESCRIPTION

    Country.csv

    • Country.Code
    • Short.Name
    • Table.Name
    • Long.Name
    • 2-alpha.code
    • Currency.Unit
    • Special.Notes
    • Region
    • Income.Group
    • WB-2.code
    • National.accounts.base.year
    • National.accounts.reference.year
    • SNA.price.valuation
    • Lending.category
    • Other.groups
    • System.of.National.Accounts
    • Alternative.conversion.factor
    • PPP.survey.year
    • Balance.of.Payments.Manual.in.use
    • External.debt.Reporting.status
    • System.of.trade
    • Government.Accounting.concept
    • IMF.data.dissemination.standard
    • Latest.population.census
    • Latest.household.survey
    • Source.of.most.recent.Income.and.expenditure.data
    • Vital.registration.complete
    • Latest.agricultural.census
    • Latest.industrial.data
    • Latest.trade.data
    • Latest.water.withdrawal.data

    FootNote.csv

    • CountryCode
    • SeriesCode
    • Year
    • DESCRIPTION

    Series-Time.csv

    • SeriesCode
    • Year
    • DESCRIPTION

    Series.csv

    • Series.Code
    • Topic
    • Indicator.Name
    • Short.definition
    • Long.definition
    • Unit.of.measure
    • Periodicity
    • Base.Period
    • Other.notes
    • Aggregation.method
    • Limitations.and.exceptions
    • Notes.from.original.source
    • General.comments
    • Source
    • Statistical.concept.and.methodology
    • Development.relevance
    • Related.source.links
    • Other.web.links
    • Related.indicators
    • License.Type

    Acknowledgements

    This dataset was downloaded from The World Bank's Open Data project. The summary of the Terms of Use of this data is as follows:

    • You are free to copy, distribute, adapt, display or include the data in other products for commercial and noncommercial purposes at no cost subject to certain limitations summarized below.

    • You must include attribution for the data you use in the manner indicated in the metadata included with the data.

    • You must not claim or imply that The World Bank endorses your use of the data by or use The World Bank’s logo(s) or trademark(s) in conjunction with such use.

    • Other parties may have ownership interests in some of the materials contained on The World Bank Web site. For example, we maintain a list of some specific data within the Datasets that you may not redistribute or reuse without first contacting the original content provider, as well as information regarding how to contact the original content provider. Before incorporating any data in other products, please check the list: Terms of use: Restricted Data.

    -- [ed. note: this last is not applicable to the Gender Statistics database]

    • The World Bank makes no warranties with respect to the data and you agree The World Bank shall not be liable to you in connection with your use of the data.

    • This is only a summary of the Terms of Use for Datasets Listed in The World Bank Data Catalogue. Please read the actual agreement that controls your use of the Datasets, which is available here: Terms of use for datasets. Also see World Bank Terms and Conditions.

  5. Nyc popular baby names

    • kaggle.com
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Sarkar (2022). Nyc popular baby names [Dataset]. https://www.kaggle.com/datasets/rahulsarkar221/nyc-popular-baby-names
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    Kaggle
    Authors
    Rahul Sarkar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    New York
    Description

    This data contains popular baby names in New York .

    Dataset :- 1 file (popular-baby-names.csv)

    Columns - Year of Birth : Year of the baby's birth. - Gender : Gender of the baby. - Ethnicity : Types of ethnicity they belong to. - Child's First Name : The first name of the child. - Count : How many babies were named . - Ranking : Ranking of that name.

  6. e

    Gender and Intersectional Disparities in Biographies on English and Spanish...

    • b2find.eudat.eu
    Updated Jan 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Gender and Intersectional Disparities in Biographies on English and Spanish Wikipedia Front Pages (2013-2023) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/0f9014ea-070d-5f25-8503-e46a38cf3c40
    Explore at:
    Dataset updated
    Jan 9, 2025
    Description

    El següent dataset conté dos carpetes amb dades diferents, les quals inclouen: El conjunt de dades de la carpeta amb nom "Gender" proporciona la distribució per gènere de les persones que han estat destacades a les portades de les versions anglesa i espanyola de Wikipedia, durant el període 2013-2023. Pel que fa a l'edició en castellà, les dades s'han recollit de les seccions "Artículos buenos" i "Artículos destacados" i es mostren en forma agregada. El conjunt de dades de la carpeta amb nom "Intersectionality" proporciona la distribució per diferents atributs sociodemogràfics de les persones que han estat destacades a les portades de les versions en anglès i en espanyol de Wikipedia, en el període del 2013 al 2023. Està estructurat en quatre CSV. Tres d'aquests CSV corresponen a l'edició de Wikipedia en anglès: el CSV English 3C que conté les dades de les seccions "Did you know...", "In the news" i "On this day..."; un CSV dedicat a "English Featured Article", i un altre a "English Featured Picture". El quart CSV conté les dades de l'edició en castellà de la Wikipedia, extretes de les seccions "Artículo Destacado" i "Artículo Bueno". A cada CSV, les dades es presenten en columnes, cadascuna dedicada a un atribut sociodemogràfic. The following dataset contains two folders with different data, which include: The data set of the folder with name "Gender" provides the gender distribution of individuals featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. For the Spanish edition, data has been collected from the "Artículos buenos" and "Artículos destacados" sections and is displayed in an aggregated format. The data set of the folder with name "Intersectionality" provides the distribution based on various sociodemographic attributes of individuals who have been featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. It is structured into four CSV. Three of these CSV correspond to the English Wikipedia edition: the English 3C CSV containing data from the sections "Did you know...", "In the news," and "On this day..."; a CSV dedicated to "English Featured Article," and another to "English Featured Picture." The fourth CSV contains data from the Spanish edition of Wikipedia, extracted from the sections "Artículo Destacado" and "Artículo Bueno." Within each CSV, the data is presented in columns, each dedicated to a sociodemographic attribute.

  7. Z

    Worldwide Gender Differences in Public Code Contributions - Replication...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Rossi (2022). Worldwide Gender Differences in Public Code Contributions - Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6020474
    Explore at:
    Dataset updated
    Feb 9, 2022
    Dataset provided by
    Stefano Zacchiroli
    Davide Rossi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Worldwide Gender Differences in Public Code Contributions - Replication Package

    This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011

    This document comes with the software needed to mine and analyze the data presented in the paper.

    Prerequisites

    These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0

    Initial data

    swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.

    names.tab - forenames and surnames per country with their frequency

    zones.acc.tab - countries/territories, timezones, population and world zones

    c_c.tab - ccTDL entities - world zones matches

    Data preparation

    Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst sh> ./export.sh

    Run the authors cleanup script to create authors--clean.csv.zst sh> ./cleanup.sh authors.csv.zst

    Filter out implausible names and create authors--plausible.csv.zst sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst

    Gender detection

    Run the gender guessing script to create author-fullnames-gender.csv.zst sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst

    Database creation and data ingestion

    Create the PostgreSQL DB sh> createdb gender-commit Notice that from now on when prepending the psql> prompt we assume the execution of psql on the gender-commit database.

    Import data into PostgreSQL DB sh> ./import_data.sh

    Zone detection

    Extract commits data from the DB and create commits.tab, that is used as input for the gender detection script sh> psql -f extract_commits.sql gender-commit

    Run the world zone detection script to create commit_zones.tab.zst sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.

    Read zones assignment data from the file into the DB psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''

    Extraction and graphs

    Run the script to execute the queries to extract the data to plot from the DB. This creates commits_tz.tab, authors_tz.tab, commits_zones.tab, authors_zones.tab, and authors_zones_1620.tab. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh

    Run the script to create the graphs from all the previously extracted tabfiles. This will generate commits_tzs.pdf, authors_tzs.pdf, commits_zones.pdf, authors_zones.pdf, and authors_zones_1620.pdf. sh> ./create_charts.sh

    Additional graphs

    This package also includes some already-made graphs

    authors_zones_1.pdf: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per period

    authors_zones_2.pdf: ditto with at least two commits per period

    authors_zones_10.pdf: ditto with at least ten commits per period

  8. Hospital Management Dataset

    • kaggle.com
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kanak Baghel
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

    Dataset Overview

    This dataset includes five CSV files:

    1. patients.csv – Patient demographics, contact details, registration info, and insurance data

    2. doctors.csv – Doctor profiles with specializations, experience, and contact information

    3. appointments.csv – Appointment dates, times, visit reasons, and statuses

    4. treatments.csv – Treatment types, descriptions, dates, and associated costs

    5. billing.csv – Billing amounts, payment methods, and status linked to treatments

    📁 Files & Column Descriptions

    ** patients.csv**

    Contains patient demographic and registration details.

    Column Description

    patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

    ** doctors.csv**

    Details about the doctors working in the hospital.

    Column Description

    doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

    appointments.csv

    Records of scheduled and completed patient appointments.

    Column Description

    appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

    treatments.csv

    Information about the treatments given during appointments.

    Column Description

    treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

    ** billing.csv**

    Billing and payment details for treatments.

    Column Description

    bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

    Possible Use Cases

    SQL queries and relational database design

    Exploratory data analysis (EDA) and dashboarding

    Machine learning projects (e.g., cost prediction, no-show analysis)

    Feature engineering and data cleaning practice

    End-to-end healthcare analytics workflows

    Recommended Tools & Resources

    SQL (joins, filters, window functions)

    Pandas and Matplotlib/Seaborn for EDA

    Scikit-learn for ML models

    Pandas Profiling for automated EDA

    Plotly for interactive visualizations

    Please Note that :

    All data is synthetically generated for educational and project use. No real patient information is included.

    If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.

  9. Baby Names from Social Security Card Applications - National Data

    • catalog.data.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    Updated Jul 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Security Administration (2025). Baby Names from Social Security Card Applications - National Data [Dataset]. https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Social Security Administrationhttp://ssa.gov/
    Description

    The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 on.

  10. c

    Popular Baby Names

    • data.cityofnewyork.us
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • +2more
    application/rdfxml +2
    Updated Jun 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health and Mental Hygiene (DOHMH) (2025). Popular Baby Names [Dataset]. https://data.cityofnewyork.us/Health/Popular-Baby-Names/25th-nujf
    Explore at:
    application/rdfxml, application/rssxml, xmlAvailable download formats
    Dataset updated
    Jun 8, 2025
    Dataset authored and provided by
    Department of Health and Mental Hygiene (DOHMH)
    Description

    Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.

  11. CrowS-Pairs (Social biases in MLMs)

    • kaggle.com
    Updated Nov 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). CrowS-Pairs (Social biases in MLMs) [Dataset]. https://www.kaggle.com/datasets/thedevastator/a-dataset-for-measuring-social-biases-in-mlms
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CrowS-Pairs (Social biases in MLMs)

    CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked LM

    By [source]

    About this dataset

    The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status. Each sentence pair is a minimal edit of the first sentence: The only words that change between them are those that identify the group. The first sentence can demonstrate or violate a stereotype. The other sentence is a minimal edit of the first sentence: The only words that change between them are those that identify the group. Each example has the following information:

    Columns:,**sent_more**,sent_less,**stereo_antistereo**,bias_type,**annotations**,,anon_writer,,anon_annotators,,prompt,,source

    The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The CrowS-Pairs dataset is a collection of 1,508 sentence pairs that cover nine types of biases: race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status. Each sentence pair is a minimal edit of the first sentence: The only words that change between them are those that identify the group. The first sentence can demonstrate or violate a stereotype. The other sentence is a minimal edit of the first sentence: The only words that change between them are those that identify the group. Each example has the following information:

    Columns:,**sent_less**sent_more,,stereo_antistereo,,bias_type,,annotations,,anon_writer,,anon_annotators,,,,prompt,,source

    This dataset can be used to measure social biases in MLMs by training models on it and evaluating their performance

    Research Ideas

    • Measuring the ability of MLMs to identify and avoid social biases;
    • Developing new methods for reducing social biases in MLMs; and
    • Investigating the impact of social biases on downstream tasks such as reading comprehension or question answering

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: crows_pairs_anonymized.csv | Column name | Description | |:----------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------| | sent_more | The first sentence in the pair, which can demonstrate or violate a stereotype. (String) | | sent_less | The second sentence in the pair, which is a minimal edit of the first sentence. The only words that change between them are those that identify the group. (String) | | stereo_antistereo | Whether the first sentence demonstrates or violates a stereotype. (String) | | bias_type | The type of bias represented in the sentence pair. (String) | | annotations | The annotations made by the crowdworkers on the sentence pair. (String) | | anon_writer | The anonymous writer of the sentence pair. (String) | | anon_annotators | The anonymous annotators of the sentence pair. (String) |

    File: prompts.csv | Column name | Descripti...

  12. Author Gender Representation at Audio Engineering Conferences - An...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Feb 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kat Young; Kat Young; Michael Lovedee-Turner; Michael Lovedee-Turner; Jude Brereton; Jude Brereton; Helena Daffern; Helena Daffern (2021). Author Gender Representation at Audio Engineering Conferences - An Anonymised Dataset v2 [Dataset]. http://doi.org/10.5281/zenodo.4535610
    Explore at:
    Dataset updated
    Feb 12, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kat Young; Kat Young; Michael Lovedee-Turner; Michael Lovedee-Turner; Jude Brereton; Jude Brereton; Helena Daffern; Helena Daffern
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the author gender dataset (as a comma-delimited .csv file) originally created in association with the paper entitled 'The Impact of Gender on Conference Authorship in Audio Engineering: Analysis Using a New Data Collection Method', but since extended to include conferences up to the end of 2019. The original dataset is available at: https://doi.org/10.5281/zenodo.1249693. Please cite both the paper and the relevant dataset if used. Visualisation is available at: http://tibbakoi.github.io/aesgender.

    ---

    The dataset was produced using a novel method which used self-identified pronouns, therefore allowing for as many groups as necessary to describe the population.

    1. A list of authors was generated from conference proceedings.
    2. An email was sent to each author to acquire their pronoun.
    3. If no email was available/no response was received, a pronoun was acquired from a biography.
    4. If no biography was available, a pronoun was inferred from traditional gender markers and gender presentation.
    5. If no gender marker/photograph was available, the entry was labelled as 'Information Unavailable'. For brevity, the label 'Unknown' is used in the paper.

    ---

    The columns in the dataset are as follows:

    1. ID: unique identifier of entry
    2. Pronoun: pronoun of entry
    3. Position (abs): numerical absolute position within author list for entry
    4. Position (relative): relative position within author list for entry (either First, Last, or Middle)
    5. Single/multi-author: whether the publication for that entry has a single author or has multiple authors (single author publications are excluded from author position analysis)
    6. Conference: Full conference name of entry
    7. Topic: Topic of conference of entry, taken from conference name
    8. Year: Year of conference of entry
    9. Type: Type of publication for that entry as listed on the online conference proceedings
    10. Grouped Type: Grouping of publication types for that entry for easier analysis due to inconsistencies in online conference proceedings (groups are: workshop, poster, paper, panel, keynote, invited speaker, invited paper, demo)
    11. Inc. for author pos?: True/False as to whether to include the entry for analysis over author position (included types are: paper, invited paper, poster (all with multiple authors) as these have meaningful author orders)
    12. Inc. for single/multi-author?: True/False as to whether to include the entry for analysis over single/multi author (includes types are: paper, invited paper, poster as these have meaningful author orders)
    13. Invited paper status: Grouping of the types to allow statistical analysis over invited vs non-invited types (invited types are: invited speaker, invited paper, keynote, panel. Non-invited types are: poster, paper, demo, workshop)

    NB: Some grouping of the data is required as online conference proceedings are not always consistent (Column 10). Some labelling of the data is required to determine which entries to include in certain types of analysis (Columns 11-13).

    ---

    This dataset is distributed in the hopes that it will prove useful under the Creative Commons Attribution 4.0, with no warranty; or the implied warranty of merchantability or fitness for a particular problem.

    ---

    Dataset curated by: Kat Young and Michael Lovedee-Turner, formerly at the AudioLab, Dept. of Electronic Engineering, University of York.
    Contact: kathryn.ae.young@gmail.com

  13. Airline Dataset

    • kaggle.com
    Updated Sep 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Airline Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sourav Banerjee
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Airline data holds immense importance as it offers insights into the functioning and efficiency of the aviation industry. It provides valuable information about flight routes, schedules, passenger demographics, and preferences, which airlines can leverage to optimize their operations and enhance customer experiences. By analyzing data on delays, cancellations, and on-time performance, airlines can identify trends and implement strategies to improve punctuality and mitigate disruptions. Moreover, regulatory bodies and policymakers rely on this data to ensure safety standards, enforce regulations, and make informed decisions regarding aviation policies. Researchers and analysts use airline data to study market trends, assess environmental impacts, and develop strategies for sustainable growth within the industry. In essence, airline data serves as a foundation for informed decision-making, operational efficiency, and the overall advancement of the aviation sector.

    Content

    This dataset comprises diverse parameters relating to airline operations on a global scale. The dataset prominently incorporates fields such as Passenger ID, First Name, Last Name, Gender, Age, Nationality, Airport Name, Airport Country Code, Country Name, Airport Continent, Continents, Departure Date, Arrival Airport, Pilot Name, and Flight Status. These columns collectively provide comprehensive insights into passenger demographics, travel details, flight routes, crew information, and flight statuses. Researchers and industry experts can leverage this dataset to analyze trends in passenger behavior, optimize travel experiences, evaluate pilot performance, and enhance overall flight operations.

    Dataset Glossary (Column-wise)

    • Passenger ID - Unique identifier for each passenger
    • First Name - First name of the passenger
    • Last Name - Last name of the passenger
    • Gender - Gender of the passenger
    • Age - Age of the passenger
    • Nationality - Nationality of the passenger
    • Airport Name - Name of the airport where the passenger boarded
    • Airport Country Code - Country code of the airport's location
    • Country Name - Name of the country the airport is located in
    • Airport Continent - Continent where the airport is situated
    • Continents - Continents involved in the flight route
    • Departure Date - Date when the flight departed
    • Arrival Airport - Destination airport of the flight
    • Pilot Name - Name of the pilot operating the flight
    • Flight Status - Current status of the flight (e.g., on-time, delayed, canceled)

    Structure of the Dataset

    https://i.imgur.com/cUFuMeU.png" alt="">

    Acknowledgement

    The dataset provided here is a simulated example and was generated using the online platform found at Mockaroo. This web-based tool offers a service that enables the creation of customizable Synthetic datasets that closely resemble real data. It is primarily intended for use by developers, testers, and data experts who require sample data for a range of uses, including testing databases, filling applications with demonstration data, and crafting lifelike illustrations for presentations and tutorials. To explore further details, you can visit their website.

    Cover Photo by: Kevin Woblick on Unsplash

    Thumbnail by: Airplane icons created by Freepik - Flaticon

  14. u

    Trends in gender homophily in scientific publications (data)

    • observatorio-investigacion.unavarra.es
    • data.niaid.nih.gov
    • +1more
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Torre, Margarita; Prieto-Alonso, Jesús Álvaro; Ucar, Iñaki; Torre, Margarita; Prieto-Alonso, Jesús Álvaro; Ucar, Iñaki (2024). Trends in gender homophily in scientific publications (data) [Dataset]. https://observatorio-investigacion.unavarra.es/documentos/688b600d17bb6239d2d47fa9
    Explore at:
    Dataset updated
    2024
    Authors
    Torre, Margarita; Prieto-Alonso, Jesús Álvaro; Ucar, Iñaki; Torre, Margarita; Prieto-Alonso, Jesús Álvaro; Ucar, Iñaki
    Description

    This dataset contains records of research articles extracted from the Web of Science (WoS) from 1980 to 2019---in total, 15,642 journals, 28,241,100 articles and 111,980,858 authorships across 153 research areas.

    The main dataset (author_address_article_gend_v3.parquet), in Parquet format, contains all the authorships, where an authorship is defined as the tuple article-author. There are 12 variables per authorship (row):

    ut: unique article identifier.

    daisng_id: unique author identifier.

    author_no: author number, as listed in the article.

    country: author country (two-letter ISO code).

    date: publication date.

    gender: gender of the author ("male" or "female"), as provided by the Genderize.io API.

    probability: probability of the gender attribute, as provided by the Genderize.io API.

    count: number of entries for the author first name, as provided by the Genderize.io API.

    jsc: journal subject category.

    field: field of research.

    research_area: area of research.

    n_aut: number of authors in this publication.

    journal: journal name.

    alphabetical: whether the author list for this article is in alphabetical order.

    With the previous dataset, a resampler was applied to generate null homophily values for each year. There are 4 datasets in R Data Serialization (RDS) format:

    null_field.rds: null homophily values per country, year and field of research.

    null_field_comp.rds: null homophily values per year and field of research (only for complete authorships).

    null_research.rds: null homophily values per year and area of research.

    null_research_comp.rds: null homophily values per year and area of research (only for complete authorships).

    All these datasets have the same structure:

    country: country (two-letter ISO code).

    year: year.

    variable: either field or research area name.

    m: average homophily.

    s: homophily std. error.

    Finally, some supplementary files used in the descriptive analysis and methods:

    File null_research_l2019.rds is an example of the output from the resampling algorithm for year 2019.

    File wos_category_to_field.csv is a mapping from WoS categories to more general fields.

    File jcr_if_2020.csv contains the percentiles of the journal impact factor for the JCR 2020.

  15. Supporting material for "Impact of gender on the formation and outcome of...

    • zenodo.org
    • explore.openaire.eu
    bin, csv +1
    Updated Jul 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leah P. Schwartz; Leah P. Schwartz; Jean F. Liénard; Jean F. Liénard; Stephen V. David; Stephen V. David (2022). Supporting material for "Impact of gender on the formation and outcome of formal mentoring relationships in the life sciences" [Dataset]. http://doi.org/10.5281/zenodo.6897394
    Explore at:
    csv, text/x-python, binAvailable download formats
    Dataset updated
    Jul 25, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Leah P. Schwartz; Leah P. Schwartz; Jean F. Liénard; Jean F. Liénard; Stephen V. David; Stephen V. David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains data and analysis code associated with the manuscript: L.P. Schwartz, J. Liénard, S. V. David. (2022) "Impact of gender on formation and outcome of formal mentoring relationships in the life sciences." Figures and tables in the manuscript can be produced by running the make_figures.ipynb notebook. Figures have been marked with headings indicating their position in the manuscript (Figure 1, Figure S1, etc.). In addition, the notebook contains code to reproduce regression analyses that are cited in the text but not directly associated with a figure.

    Data on mentoring relationships derives from Academic Family Tree (AFT, www.academictree.org) and public data sources on funding, publications, and awards. Inclusion criteria, public data sources, and procedures for linking across sources are described in the manuscript. Personal identifiers for researchers have been anonymized, but remain consistent across all data in the repository. In other words, the personal identifier "1" refers to the same person in all dataframes in the repository. But, that person is *not* the same researcher identified as "1" on the public AFT website.

    Installation

    Requires Python 3.x. and Pandas. To load required libraries using Anaconda, run:

    `conda create --name aft -c conda-forge pandas numpy scipy ipython jupyterlab scipy scikit-learn pandas matplotlib numpy statsmodels seaborn pytables`

    Dataframes

    Data is stored as a series of Pandas dataframes within HDF5 or CSV files:

    * cng_tc: The primary dataset used in the analysis. The name is an acronym for "connections" (i.e. training relationships, "cn"), "gender" ("g"), and "trainee count" ("tc"). Each row contains data on the mentor and trainee in one training relationship. See manuscript for inclusion criteria.

    * mentors: Data on mentors. Each row contains data on one mentor. See manunscript for inclusion criteria.

    * mentors_grants, mentors_hindex, mentors_locs_ranked: Subset of mentors with data available for funding (mentors_grants), citation (mentors_hindex), and institution rank (mentors_locs_ranked).

    * mentors_nobel, mentors_hhmi, mentors_nas: Subsets of mentors that received a Nobel (mentors_nobel), Howard Hughes Medical Institute grants (mentors_hhmi), or membership in the National Academy of Sciences (mentors_nas). See manuscript for details of data sources and linking procedures.

    * cn, cng, first_names, gn, gn_all, locs: Partial data (connections only, inferred gender only, connections and gender only, location only, first names and inferred gender only) for more inclusive sets of researchers in AFT. They are generally not used used for analysis, but have been included here to calculate statistics on the total amount of data included and to screen for data from U.S. locations.

    * nsf_gender_phds, nsf_gender_pds: National Science Foundation survey data on gender and fraction PhDs conferred per year (nsf_gender_phds) or fraction postdocs employed per year (nsf_gender_pds). See manuscript for details of data source.

    * photo: Data for validation of gender inference method.

    Dataframe columns

    * amount: Mentor's total funding
    * amount_adj: Mentor's total funding (adjusted to 2020 dollars)
    * broad_field: Mentor's general research area (e.g., life sciences, engineering, based on National Science Foundation classifications)
    * continue: Whether trainee went on to become a mentor (i.e., has trainees listed in AFT)
    * country: Country in which mentor's current institution is located
    * firstname: First name of researcher (table of first names is not aligned with tables containing anonymized personal identifiers)
    * first_grant_year: Year of mentor's first grant
    * funding_rate: Mentor's annual funding rate (since first grant)
    * funding_rate_adj: Mentor's annual funding rate (since first grant) adjusted to 2020 dollars
    * hhmi: Whether mentor was granted HHMI funding
    * hindex: Mentor's hindex
    * location: Name of mentor's current institution
    * locid: Identifier for mentor's institution
    * locid_rank: Postion of mentor's institution in 2015 Quacquarelli-Symonds rankings (lower numbers are better)
    * locid_rank_rev: Reversed version of "locid_rank" (i.e., higher numbers are better)
    * majorarea: Mentor's specific research area (e.g, neuroscience)
    * male_mentor, male trainee: Whether the probability that a researcher's first name is used by a person identifying as a man meets threshold (see manuscript for details on gender inference using first names)
    * match_score: Score for string match between institution or name of awardee and researcher
    * mentor_career_start: The date at which the mentor's academic career began
    * mentor_continue_rate: Fraction of mentor's trainees that become mentors
    * mentor_continue_rate_ft: Fraction of mentor's woman trainees that become mentors
    * mentor_continue_rate_mt: Fraction of mentor's man trainees that become mentors
    * mentor_t_p_male0: Fraction of mentor's trainees that are men
    * mentor_t_p_male0_gs: Fraction of mentor's trainees that are men (graduate students only)
    * mentor_t_p_male0_pd: Fraction of mentor's trainees that are men (postdocs only)
    * mentor_tcount0: Mentor's total number of trainees
    * nas: Whether mentor is a member of the National Academy of Sciences
    * nobel: Whether mentor is a Nobel laureate
    * p_male_mentor, p_male_trainee: Probability that a researcher's first name is used by a person identifying as a man
    * pid: Anonymized identifier of researcher
    * pid_mentor: Anonymized identifier of mentor in training relationship
    * pid_trainee: Anonymized identifier of trainee in training relationship
    * pq: "1" if data on training relationship is drawn from ProQuest database and has not been manually edited a human AFT user
    * relation: Type of training relationship (1: graduate student, 2: postdoc)
    * scorer1, scorer2, scorer3: Results of photo validation of gender inference for each scorer
    * start: Training start year
    * stop: Training end year
    * trainee_tcount: Total people that the trainee has trained
    * triad: Whether trainee has participated in both a graduate-level and postdoctoral training relationship

    The cn dataframe follows slightly different naming conventions, but is not generally used in the analysis (pid1 = pid_trainee, pid2 = pid_mentor, startdate = start, stopdate = stop).

  16. Global Freelancers (Raw) Dataset

    • kaggle.com
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urvish Ahir (2025). Global Freelancers (Raw) Dataset [Dataset]. https://www.kaggle.com/datasets/urvishahir/global-freelancers-raw-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Urvish Ahir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description :

    This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.

    • Each entry includes demographic, professional, and platform-related information such as:
    • Name, gender, age, and country
    • Primary skill and years of experience
    • Hourly rate (with mixed formatting), client rating, and satisfaction score
    • Language spoken (based on country)
    • Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)

    Key Features :

    • Gender-based names using Faker’s male/female name generators
    • Realistic age and experience distribution (with missing and noisy values)
    • Country-language pairs mapped using actual linguistic data
    • Messy formatting: mixed data types, missing values, inconsistent casing
    • Generated entirely in Python using the faker library no real data used

    Use Cases :

    • Practicing data cleaning and preprocessing
    • Performing EDA (Exploratory Data Analysis)
    • Developing data pipelines: raw → clean → model-ready
    • Teaching feature engineering and handling real-world dirty data
    • Exercises in data validation, outlier detection, and format standardization

    File : global_freelancers_raw.csv

    | Column Name      | Description                               |
    | --------------------- | ------------------------------------------------------------------------ |
    | `freelancer_ID`    | Unique ID starting with `FL` (e.g., FL250001)              |
    | `name`        | Full name of freelancer (based on gender)                |
    | `gender`       | Gender (messy values and case inconsistency)               |
    | `age`         | Age of the freelancer (20–60, with occasional nulls/outliers)      |
    | `country`       | Country name (with random formatting/casing)               |
    | `language`      | Language spoken (mapped from country)                  |
    | `primary_skill`    | Key freelance domain (e.g., Web Dev, AI, Cybersecurity)         |
    | `years_of_experience` | Work experience in years (some missing values or odd values included)  |
    | `hourly_rate (USD)`  | Hourly rate with currency symbols or missing data            |
    | `rating`       | Rating between 1.0–5.0 (some zeros and nulls included)          |
    | `is_active`      | Active status (inconsistently represented as strings, numbers, booleans) |
    | `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs)      |
    
  17. Data from: THE RELEVANCY OF MASSIVE HEALTH EDUCATION IN THE BRAZILIAN PRISON...

    • zenodo.org
    csv, pdf
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Janaína L. R. da S. Valentim; Janaína L. R. da S. Valentim; Sara Dias-Trindade; Sara Dias-Trindade; Eloiza da S. G. Oliveira; Eloiza da S. G. Oliveira; José A. M. Moreira; José A. M. Moreira; Felipe Fernandes; Felipe Fernandes; Manoel Honorio Romão; Manoel Honorio Romão; Philippi S. G. de Morais; Philippi S. G. de Morais; Alexandre R. Caitano; Alexandre R. Caitano; Aline P. Dias; Aline P. Dias; Carlos A. P. Oliveira; Carlos A. P. Oliveira; Karilany D. Coutinho; Karilany D. Coutinho; Ricardo B. Ceccim; Ricardo B. Ceccim; Ricardo A. de M. Valentim; Ricardo A. de M. Valentim (2024). THE RELEVANCY OF MASSIVE HEALTH EDUCATION IN THE BRAZILIAN PRISON SYSTEM: THE COURSE "HEALTH CARE FOR PEOPLE DEPRIVED OF FREEDOM" AND ITS IMPACTS [Dataset]. http://doi.org/10.5281/zenodo.6499752
    Explore at:
    csv, pdfAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Janaína L. R. da S. Valentim; Janaína L. R. da S. Valentim; Sara Dias-Trindade; Sara Dias-Trindade; Eloiza da S. G. Oliveira; Eloiza da S. G. Oliveira; José A. M. Moreira; José A. M. Moreira; Felipe Fernandes; Felipe Fernandes; Manoel Honorio Romão; Manoel Honorio Romão; Philippi S. G. de Morais; Philippi S. G. de Morais; Alexandre R. Caitano; Alexandre R. Caitano; Aline P. Dias; Aline P. Dias; Carlos A. P. Oliveira; Carlos A. P. Oliveira; Karilany D. Coutinho; Karilany D. Coutinho; Ricardo B. Ceccim; Ricardo B. Ceccim; Ricardo A. de M. Valentim; Ricardo A. de M. Valentim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset name: asppl_dataset_v2.csv

    Version: 2.0

    Dataset period: 06/07/2018 - 01/14/2022

    Dataset Characteristics: Multivalued

    Number of Instances: 8118

    Number of Attributes: 9

    Missing Values: Yes

    Area(s): Health and education

    Sources:

    • Virtual Learning Environment of the Brazilian Health System (AVASUS) (Brasil, 2022a);

    • Brazilian Occupational Classification (CBO) (Brasil, 2022b);

    • National Registry of Health Establishments (CNES) (Brasil, 2022c);

    • Brazilian Institute of Geography and Statistics (IBGE) (Brasil, 2022e).

    Description: The data contained in the asppl_dataset_v2.csv dataset (see Table 1) originates from participants of the technology-based educational course “Health Care for People Deprived of Freedom.” The course is available on the AVASUS (Brasil, 2022a). This dataset provides elementary data for analyzing the course’s impact and reach and the profile of its participants. In addition, it brings an update of the data presented in work by Valentim et al. (2021).

    Table 1: Description of AVASUS dataset features.

    Attributes

    Description

    datatype

    Value

    gender

    Gender of the course participant.

    Categorical.

    Feminino / Masculino / Não Informado. (In English, Female, Male or Uninformed)

    course_progress

    Percentage of completion of the course.

    Numerical.

    Range from 0 to 100.

    course_evaluation

    A score given to the course by the participant.

    Numerical.

    0, 1, 2, 3, 4, 5 or NaN.

    evaluation_commentary

    Comment made by the participant about the course.

    Categorical.

    Free text or NaN.

    region

    Brazilian region in which the participant resides.

    Categorical.

    Brazilian region according to IBGE: Norte, Nordeste, Centro-Oeste, Sudeste or Sul (In English North, Northeast, Midwest, Southeast or South).

    CNES

    The CNES code refers to the health establishment where the participant works.

    Numerical.

    CNES Code or NaN.

    health_care_level

    Identification of the health care network level for which the course participant works.

    Categorical.

    “ATENCAO PRIMARIA”,

    “MEDIA COMPLEXIDADE”,

    “ALTA COMPLEXIDADE”,

    and their possible combinations.

    (In English "PRIMARY HEALTH CARE", "SECONDARY HEALTH CARE" AND "TERTIARY HEALTH CARE")

    year_enrollment

    Year in which the course participant registered.

    Numerical.

    Year (YYYY).

    CBO

    Participant occupation.

    Categorical.

    Text coded according to the Brazilian Classification of Occupations or “Indivíduo sem afiliação formal.” (In English “Individual without formal affiliation.”)

    Dataset name: prison_syphilis_and_population_brazil.csv

    Dataset period: 2017 - 2020

    Dataset Characteristics: Multivalued

    Number of Instances: 6

    Number of Attributes: 13

    Missing Values: No

    Source:

    • National Penitentiary Department (DEPEN) (Brasil, 2022d);

    Description: The data contained in the prison_syphilis_and_population_brazil.csv dataset (see Table 2) originate from the National Penitentiary Department Information System (SISDEPEN) (Brasil, 2022d). This dataset provides data on the population and prevalence of syphilis in the Brazilian prison system. In addition, it brings a rate that represents the normalized data for purposes of comparison between the populations of each region and Brazil.

    Table 2: Description of DEPEN dataset Features.

    Attributes

    Description

    datatype

    Value

    Region

    Brazilian region in which the participant resides. In addition, the sum of the regions, which refers to Brazil.

    Categorical.

    Brazil and Brazilian region according to IBGE: North, Northeast, Midwest, Southeast or South.

    syphilis_2017

    Number of syphilis cases in the prison system in 2017.

    Numerical.

    Number of syphilis cases.

    syphilis_rate_2017

    Normalized rate of syphilis cases in 2017.

    Numerical.

    Syphilis case rate.

    syphilis_2018

    Number of syphilis cases in the prison system in 2018.

    Numerical.

    Number of syphilis cases.

    syphilis_rate_2018

    Normalized rate of syphilis cases in 2018.

    Numerical.

    Syphilis case rate.

    syphilis_2019

    Number of syphilis cases in the prison system in 2019.

    Numerical.

    Number of syphilis cases.

    syphilis_rate_2019

    Normalized rate of syphilis cases in 2019.

    Numerical.

    Syphilis case rate.

    syphilis_2020

    Number of syphilis cases in the prison system in 2020.

    Numerical.

    Number of syphilis cases.

    syphilis_rate_2020

    Normalized rate of syphilis cases in 2020.

    Numerical.

    Syphilis case rate.

    pop_2017

    Prison population in 2017.

    Numerical.

    Population number.

    pop_2018

    Prison population in 2018.

    Numerical.

    Population number.

    pop_2019

    Prison population in 2019.

    Numerical.

    Population number.

    pop_2020

    Prison population in 2020.

    Numerical.

    Population number.

    Dataset name: students_cumulative_sum.csv

    Dataset period: 2018 - 2020

    Dataset Characteristics: Multivalued

    Number of Instances: 6

    Number of Attributes: 7

    Missing Values: No

    Source:

    • Virtual Learning Environment of the Brazilian Health System (AVASUS) (Brasil, 2022a);

    • Brazilian Institute of Geography and Statistics (IBGE) (Brasil, 2022e).

    Description: The data contained in the students_cumulative_sum.csv dataset (see Table 3) originate mainly from AVASUS (Brasil, 2022a). This dataset provides data on the number of students by region and year. In addition, it brings a rate that represents the normalized data for purposes of comparison between the populations of each region and Brazil. We used population data estimated by the IBGE (Brasil, 2022e) to calculate the rate.

    Table 3: Description of Students dataset Features.

  18. Oscar-Winning Directors Analysis

    • kaggle.com
    Updated Jan 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Oscar-Winning Directors Analysis [Dataset]. https://www.kaggle.com/datasets/thedevastator/oscar-winning-directors-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Description

    Oscar-Winning Directors Analysis

    Examining Gender, Race and Director Trends from 1930-2019

    By Priyanka Dobhal [source]

    About this dataset

    This dataset collects information on the Academy Award for Best Director winners from 1930 to 2019, and provides insight into the gender and racial disparity of filmmaking over time. It includes the winner's name, their respective award year, race, gender, nominated/winning film title, and the filmmakers' names. By looking at this data it is possible to identify emerging trends in cinema- such as who is dominating in terms of awards recognition- and consider how much progress has been made when it comes to equal opportunity within Hollywood. Examining Oscar winning directors over time can tell us a lot about its impact on systemic issues in our society as diversity increases among winners. To deepen our understanding of this award’s significance it is necessary to consider all factors included; from awarded directors’ gender to what kind of films are being supported by these awards annually. So come explore with us! Let's take an analysis deep dive into almost nine decades worth of cinematic history - starting from 1930 - and see who won big at the Academy Awards…

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is the perfect resource for anyone looking to conduct an analysis of the Academy Award for Best Director winners from 1930 to 2019. It contains information on the year, gender, race, director(s), film and nomination/winner of each winner.

    By using this dataset, you can gain insight into trends in Oscar winning directors over time. For example, you can compare the number of nominees between different years or examine differences in representation of gender and race among directors who have won Oscars over time. Additionally, you can use this data to explore the films that have received an Oscar for best director – which films were most successful from a narrative perspective? Or analyze which films used unique filming techniques or visual designs? Finally, this dataset also makes it possible to conduct more targeted analyses by identifying patterns across multiple aspects such as furthering social issues that are depicted in film through positive filmmaking - such as LGBTQ representation.

    To start exploring with this dataset:

    2) Open your favorite spreadsheet program ('Microsoft Excel', 'Libre Office', etc.)

    3) Load csv file with' File —> Open' command

    4) Review column headers and values contained within each row

    5) Start creating charts and graphs (pie charts barplots etc.) that show trends over time according to your needs

    6) Take notes while analyzing datasets

    7) Publish your findings online if desired

    The possibilities are endless! If you’d like additional guidance or tips on how to effectively use this data set please subscribe our newsletter at oscarwinningdirectorsanalysisgmail.com

    Research Ideas

    • Analyzing gender and racial disparity in the Academy award for Best Director across different years.
    • Investigating if the age of directors has an effect on what film they create and how successful it is at winning an Oscar for Best Director.
    • Crafting a recommendation system to recommend movies based on a director's previous Oscar-winning work or even pair users with film recommendations that have similar director/genre preference in order to discover new titles they may enjoy watching

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: Oscar Winners - Director.csv | Column name | Description | |:----------------------|:--------------------------------------------------------------| | Year | The year in which the award was given. (Integer) | | Gender | The gender of the director. (String) | | Race | The race of the director. (String) | | Director(s) | The name of the director(s). (String) | | Film | The title of the film that won the award. (String) | | Nomination/Winner | Whether the director was nominated or won the award. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Priyanka Dobhal.

  19. Disambiguated researchers publication data

    • figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amaral Lab (2023). Disambiguated researchers publication data [Dataset]. http://doi.org/10.6084/m9.figshare.1591864.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Amaral Lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Companion dataset to "The possible role of resource requirements and academic career-choice risk on gender differences in publication rate and impact" by Duch J, Zeng XHT, Sales-Pardo M, Radicchi F, Otis S, Woodruff TK, Amaral LAN (PLoS ONE 7, e51332, 2012) doi: 10.1371/journal.pone.0051332 This dataset lists the total number of publications by 4,394 faculty members from 7 distinct research fields working at top U.S. institutions. The dataset also contains bibliographic information manualy gathered from the CVs of those faculty members. The publications data was collected from Thomson Reuters' Web of Science according to the procedures described in the published paper.

    The data is a single csv file with the following fields: author_name - researcher name as: Last name, Initialsgender - researcher gender as: M (male) or F (female)univ_name - Institution of current employmentfield - scientific disciplinephd_year - year of phd completionnationality - Country of originbackground - List of degreesaffiliations - List of honours and past appointmentstotal_pubs - Total number of publications Some fields are not available for some researchers. Current employments are accurate as of June, 2010.total_pubs field show total number of publications published by the end of 2010.

  20. Data from: A Greek Parliament Proceedings Dataset for Computational...

    • data.europa.eu
    unknown
    Updated Aug 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). A Greek Parliament Proceedings Dataset for Computational Linguistics and Political Analysis [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7005201?locale=cs
    Explore at:
    unknown(1427754875)Available download formats
    Dataset updated
    Aug 27, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Greece
    Description

    The dataset is a new version of the previous upload and includes the following files: 1. dataset_versions/tell_all.csv: The initial dataset of 1,280,927 extracted speeches, before preprocessing and cleaning. The speeches extend chronologically from July 1989 up to July 2020 and were exported from 5,355 parliamentary sitting record files. The file has a total volume of 2.5 GB and includes the following columns: member_name: the name of the individual who spoke during a sitting. sitting_date: the date the sitting took place. parliamentary_period: the name and/or number of the parliamentary period that the speech took place in. A parliamentary period is defined as the time span between one general election and the next. A parliamentary period includes multiple parliamentary sessions. parliamentary_session: the name and/or number of the parliamentary session that the speech took place in. A session is defined as a time span of usually 10 months within a parliamentary period during which the parliament can convene and function as stipulated by the constitution. A session can fall into the following categories: regular, extraordinary or special. In the intervals between the sessions the parliament is in recess. A parliamentary session includes multiple parliamentary sittings. parliamentary_sitting: the name and/or number of the parliamentary sitting that the speech took place in. A sitting is defined as a meeting of parliament members. political_party: the political party of the speaker. government: the government in force when the speech took place. member_region: the electoral district the speaker belonged to. roles: information about the parliamentary roles and/or government position of the speaker. member_gender: the gender of the speaker speech: the speech that the individual gave during the parliamentary sitting. 2. dataset_versions/tell_all_FILLED.csv: This file is an intermediate version of the dataset that includes improvements in the consistency and completeness of the dataset, with a total volume of 2.5 GB. Specifically, this file is produced by filling the missing names of chairmen of various parliamentary sittings of the "tell_all.csv". It includes the same columns as the "tell_all.csv" file. 3. dataset_versions/tell_all_cleaned.csv: This version of the dataset is the result of further cleaning and preprocessing and is used for our word usage change study. It consists of 1,280,918 speech fragments of Greek parliament members in the order of the conversation that took place, with a total volume of 2.12 GB. It includes the same columns as the aforementioned versions. The preprocessing includes the replacement of all references to political parties with the symbol "@" followed by an abbreviation of the party name, using regular expressions that capture different grammatical cases and variations. It also includes the removal of accents, strings with length less than 2 characters, all punctuation except full stops, and the replacement of stopwords with "@sw". 4. wiki_data: A folder of modern Greek female and male names and surnames and their available grammatical cases crawled from the entries of the Wiktionary Greek names category (https://en.wiktionary.org/wiki/Category:Greek_names). We produced the grammatical cases of the missing grammatical entries according to the rules of the Greek grammar and saved the files in the same folder by adding to their filenames the string "_populated.json". 5. parl_members_activity_1989onwards_with_gender.csv: The Greek Parliament website provides a list of all the elected members of parliament since the fall of the military junta in Greece, in 1974. We collected and cleaned the data, added the gender and kept the elected members from 1989 onwards, matching the available parliament proceeding records. This dataset includes the full names of the members, the date range of their service, the political party they served, the electoral district they belonged to and their gender. 6. formatted_roles_gov_members_data.csv: As government members we refer to individuals in ministerial or other government posts, regardless of whether they were elected in the parliament. This information is available in the website of the Secretariat General for Legal and Parliamentary Affairs. The government members dataset includes the full names of the official individuals, the name of the role they were given, the date range of their service at each specific role and their gender. 7. governments_1989onwards.csv: A dataset of government information including the names of governments since 1989, their start and end dates, and a URL that points to the respective official government web page of each past government. The data is crawled from the website of the Secretariat General for Legal and Parliamentary Affairs. 8. extra_roles_manually_collected.csv: A dataset with manually collected information from Wikipedia about additional government or parliament posts such as Chairman of the Parliament,

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2022). Gender by Name (Time-series) [Dataset]. https://www.kaggle.com/datasets/thedevastator/automated-gender-identification-using-name-proba/data
Organization logo

Gender by Name (Time-series)

Probability of given names being M/F based on US names from 1930-Present

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 5, 2022
Dataset provided by
Kaggle
Authors
The Devastator
Description

Automated Gender Identification Using Name Probabilities

2019 US Social Security Administration Data

By Derek Howard [source]

About this dataset

This dataset provides an essential tool for generating gender-specific datasets from names alone. It contains information on the probability of a person's name belonging to a certain gender, based off of US Social Security records from the last century. This makes it easy to assign genders to datasets that do not natively include this data. All probability values were culled from records with 5 or more people associated with each name - so those individuals with less common monikers can still have their genders correctly predicted! With this resource, users can generate gender-aware data in no time, making gender identification in data sets more accurate and easier than ever

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a helpful resource when you need to accurately identify gender from names. With this dataset, you’ll be able to quickly and accurately assign genders to datasets that contain names but no other information about the person.

To get started, you will need a csv file with two columns: name and probability. The name column should contain the first names of the people in your dataset. The probability column should contain numbers between 0 and 1 indicating the likelihood that each name is associated with one specific gender (0 for male, 1 for female).

In addition to simply assigning genders from these probabilities alone, users of this dataset also have more control over their classifications - they can use it as either a baseline or as an absolute measure of accuracy depending on their exact needs/preferences. Experimentation is highly encouraged here!
Good luck!

Research Ideas

  • Create gender-specific applications - tailor different apps to different genders based on the probability of a particular name belonging to a certain gender.

  • Generate gender neutral names - use this data to generate random names with no gender bias.

  • Automate record lookup - quickly and accurately assign genders based on the probability associated with their name

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

Unknown License - Please check the dataset description for more information.

Columns

File: name_gender.csv | Column name | Description | |:----------------|:--------------------------------------------------------------------| | name | The name of the person. (String) | | gender | The gender of the person. (String) | | probability | The probability of the gender being assigned to the person. (Float) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Derek Howard.

Search
Clear search
Close search
Google apps
Main menu