The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.
El següent dataset conté dos carpetes amb dades diferents, les quals inclouen: El conjunt de dades de la carpeta amb nom "Gender" proporciona la distribució per gènere de les persones que han estat destacades a les portades de les versions anglesa i espanyola de Wikipedia, durant el període 2013-2023. Pel que fa a l'edició en castellà, les dades s'han recollit de les seccions "Artículos buenos" i "Artículos destacados" i es mostren en forma agregada. El conjunt de dades de la carpeta amb nom "Intersectionality" proporciona la distribució per diferents atributs sociodemogràfics de les persones que han estat destacades a les portades de les versions en anglès i en espanyol de Wikipedia, en el període del 2013 al 2023. Està estructurat en quatre CSV. Tres d'aquests CSV corresponen a l'edició de Wikipedia en anglès: el CSV English 3C que conté les dades de les seccions "Did you know...", "In the news" i "On this day..."; un CSV dedicat a "English Featured Article", i un altre a "English Featured Picture". El quart CSV conté les dades de l'edició en castellà de la Wikipedia, extretes de les seccions "Artículo Destacado" i "Artículo Bueno". A cada CSV, les dades es presenten en columnes, cadascuna dedicada a un atribut sociodemogràfic. The following dataset contains two folders with different data, which include: The data set of the folder with name "Gender" provides the gender distribution of individuals featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. For the Spanish edition, data has been collected from the "Artículos buenos" and "Artículos destacados" sections and is displayed in an aggregated format. The data set of the folder with name "Intersectionality" provides the distribution based on various sociodemographic attributes of individuals who have been featured on the front pages of the English and Spanish versions of Wikipedia from 2013 to 2023. It is structured into four CSV. Three of these CSV correspond to the English Wikipedia edition: the English 3C CSV containing data from the sections "Did you know...", "In the news," and "On this day..."; a CSV dedicated to "English Featured Article," and another to "English Featured Picture." The fourth CSV contains data from the Spanish edition of Wikipedia, extracted from the sections "Artículo Destacado" and "Artículo Bueno." Within each CSV, the data is presented in columns, each dedicated to a sociodemographic attribute.
Popular Baby Names by Sex and Ethnic Group Data were collected through civil birth registration. Each record represents the ranking of a baby name in the order of frequency. Data can be used to represent the popularity of a name. Caution should be used when assessing the rank of a baby name if the frequency count is close to 10; the ranking may vary year to year.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Worldwide Gender Differences in Public Code Contributions - Replication Package
This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011
This document comes with the software needed to mine and analyze the data presented in the paper.
Prerequisites
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs.
It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0
Initial data
swh-replica
, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/.swh-replica
.names.tab
- forenames and surnames per country with their frequencyzones.acc.tab
- countries/territories, timezones, population and world zonesc_c.tab
- ccTDL entities - world zones matchesData preparation
swh-replica
database to create commits.csv.zst
and authors.csv.zst
sh> ./export.sh
authors--clean.csv.zst
sh> ./cleanup.sh authors.csv.zst
authors--plausible.csv.zst
sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
Gender detection
author-fullnames-gender.csv.zst
sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst
Database creation and data ingestion
Create the PostgreSQL DB sh> createdb gender-commit
Notice that from now on when prepending the psql>
prompt we assume the execution of psql on the gender-commit
database.
Import data into PostgreSQL DB sh> ./import_data.sh
Zone detection
commits.tab
, that is used as input for the gender detection scriptsh> psql -f extract_commits.sql gender-commit
commit_zones.tab.zst
sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst
Use ./assign_world_zone.py --help
if you are interested in changing the script parameters.psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
Extraction and graphs
commits_tz.tab
, authors_tz.tab
, commits_zones.tab
, authors_zones.tab
, and authors_zones_1620.tab
.extract_data.sql
if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh
commits_tzs.pdf
, authors_tzs.pdf
, commits_zones.pdf
, authors_zones.pdf
, and authors_zones_1620.pdf
. sh> ./create_charts.sh
Additional graphs
This package also includes some already-made graphs
authors_zones_1.pdf
: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per periodauthors_zones_2.pdf
: ditto with at least two commits per periodauthors_zones_10.pdf
: ditto with at least ten commits per periodAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the author gender dataset (as a comma-delimited .csv file) originally created in association with the paper entitled 'The Impact of Gender on Conference Authorship in Audio Engineering: Analysis Using a New Data Collection Method', but since extended to include conferences up to the end of 2019. The original dataset is available at: https://doi.org/10.5281/zenodo.1249693. Please cite both the paper and the relevant dataset if used. Visualisation is available at: http://tibbakoi.github.io/aesgender.
---
The dataset was produced using a novel method which used self-identified pronouns, therefore allowing for as many groups as necessary to describe the population.
---
The columns in the dataset are as follows:
NB: Some grouping of the data is required as online conference proceedings are not always consistent (Column 10). Some labelling of the data is required to determine which entries to include in certain types of analysis (Columns 11-13).
---
This dataset is distributed in the hopes that it will prove useful under the Creative Commons Attribution 4.0, with no warranty; or the implied warranty of merchantability or fitness for a particular problem.
---
Dataset curated by: Kat Young and Michael Lovedee-Turner, formerly at the AudioLab, Dept. of Electronic Engineering, University of York.
Contact: kathryn.ae.young@gmail.com
This dataset contains records of research articles extracted from the Web of Science (WoS) from 1980 to 2019---in total, 15,642 journals, 28,241,100 articles and 111,980,858 authorships across 153 research areas.
The main dataset (author_address_article_gend_v2.parquet), in Parquet format, contains all the authorships, where an authorship is defined as the tuple article-author. There are 12 variables per authorship (row):
With the previous dataset, a resampler was applied to generate null homophily values for each year. There are 4 datasets in R Data Serialization (RDS) format:
All these datasets have the same structure:
Finally, some supplementary files used in the descriptive analysis and methods:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study reproduces the results of the article Relationship of gender differences in preferences to economic development and gender equality (DOI: 10.1126/science.aas9899) and partially its supplementary material.
The code for the analysis can be found at the following GitHub page: https://github.com/scerioli/Global-Preferences-Survey
The data used in the Falk & Hermle 2018 is not fully available because of two reasons:
Data paywall: Some part of the data is not available for free. It requires to pay a fee to the Gallup to access them. This is the case for the additional data set that is used in the article, for instance, the one that contains the education level and the household income quintile. Check the website of the briq - Institute on Behavior & Inequality for more information on it.
Data used in study is not available online: This is what happened for the LogGDP p/c calculated in 2005 US dollars (which is not directly available online). We decided to calculate the LogGDP p/c in 2010 US dollars because it was easily available, which should not change the main findings of the article.
This data is protected by copyright and cannot be given to third parties.
To download the GPS data set, go to the website of the Global Preferences Survey in the section "downloads". There, choose the "Dataset" form and after filling it, we can download the data set.
Hint: The organisation can be also "private".
The following two relevant papers have to be also cited in all publications that make use of or refer in any kind to GPS dataset:
Falk, A., Becker, A., Dohmen, T., Enke, B., Huffman, D., & Sunde, U. (2018). Global evidence on economic preferences. Quarterly Journal of Economics, 133 (4), 1645–1692.
Falk, A., Becker, A., Dohmen, T. J., Huffman, D., & Sunde, U. (2016). The preference survey module: A validated instrument for measuring risk, time, and social preferences. IZA Discussion Paper No. 9674.
From the website of the World Bank, one can access the data about the GDP per capita on a certain set of years. We took the GDP per capita (constant 2010 US$), made an average of the data from 2003 until 2012 for all the available countries, and matched the names of the countries with the ones from the GPS data set.
The Gender Equality Index is composed of four main data sets.
Time since women’s suffrage: Taken from the Inter-Parliamentary Union Website. We prepared the data in the following way. For several countries more than one date where provided (for example, the right to be elected and the right to vote). We use the last date when both vote and stand for election right were granted, with no other restrictions commented. Some counties were a colony or within union of the countries (for instance, Kazakhstan in Soviet Union). For these countries, the rights to vote and be elected might be technically granted two times within union and as independent state. In this case we kept the first date. It was difficult to decide on South Africa because its history shows the racism part very entangled with women's rights. We kept the latest date when also Black women could vote. For Nigeria, considered the distinctions between North and South, we decided to keep only the North data because, again, it was showing the completeness of the country and it was the last date. Note: USA data doesn't take into account that also up to 1964 black women couldn't vote (in general, Blacks couldn't vote up to that year). We didn’t keep this date, because it was not explicitly mentioned in the original data set. This is in contrast with other choices made, but it is important to reproduce exactly the results of the publication, and the USA is often easy to spot on the plots.
UN Gender Inequality Index: Taken from the Human Development Report 2015. We kept only the table called "Gender Inequality Index".
WEF Global Gender Gap: WEF Global Gender Gap Index Taken from the World Economic Forum Global Gender Gap Report 2015. For countries where data were missing, data was added from the World Economic Forum Global Gender Gap Report 2006. We modified some of the country names directly in the csv file, that is why we provide it as an input file.
Ratio of female and male labour force participation: Average International Labour Organization estimates from 2003 to 2012 taken from the World Bank database (http://data.worldbank.org/indicator/SL.TLF.CACT.FM.ZS). Values were inverted to create an index of equality. We took the average for the period between 2004 and 2013.
In our extended analysis, we also involved the following index:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset name: asppl_dataset_v2.csv
Version: 2.0
Dataset period: 06/07/2018 - 01/14/2022
Dataset Characteristics: Multivalued
Number of Instances: 8118
Number of Attributes: 9
Missing Values: Yes
Area(s): Health and education
Sources:
Virtual Learning Environment of the Brazilian Health System (AVASUS) (Brasil, 2022a);
Brazilian Occupational Classification (CBO) (Brasil, 2022b);
National Registry of Health Establishments (CNES) (Brasil, 2022c);
Brazilian Institute of Geography and Statistics (IBGE) (Brasil, 2022e).
Description: The data contained in the asppl_dataset_v2.csv dataset (see Table 1) originates from participants of the technology-based educational course “Health Care for People Deprived of Freedom.” The course is available on the AVASUS (Brasil, 2022a). This dataset provides elementary data for analyzing the course’s impact and reach and the profile of its participants. In addition, it brings an update of the data presented in work by Valentim et al. (2021).
Table 1: Description of AVASUS dataset features.
Attributes |
Description |
datatype |
Value |
gender |
Gender of the course participant. |
Categorical. |
Feminino / Masculino / Não Informado. (In English, Female, Male or Uninformed) |
course_progress |
Percentage of completion of the course. |
Numerical. |
Range from 0 to 100. |
course_evaluation |
A score given to the course by the participant. |
Numerical. |
0, 1, 2, 3, 4, 5 or NaN. |
evaluation_commentary |
Comment made by the participant about the course. |
Categorical. |
Free text or NaN. |
region |
Brazilian region in which the participant resides. |
Categorical. |
Brazilian region according to IBGE: Norte, Nordeste, Centro-Oeste, Sudeste or Sul (In English North, Northeast, Midwest, Southeast or South). |
CNES |
The CNES code refers to the health establishment where the participant works. |
Numerical. |
CNES Code or NaN. |
health_care_level |
Identification of the health care network level for which the course participant works. |
Categorical. |
“ATENCAO PRIMARIA”, “MEDIA COMPLEXIDADE”, “ALTA COMPLEXIDADE”, and their possible combinations. |
year_enrollment |
Year in which the course participant registered. |
Numerical. |
Year (YYYY). |
CBO |
Participant occupation. |
Categorical. |
Text coded according to the Brazilian Classification of Occupations or “Indivíduo sem afiliação formal.” (In English “Individual without formal affiliation.”) |
Dataset name: prison_syphilis_and_population_brazil.csv
Dataset period: 2017 - 2020
Dataset Characteristics: Multivalued
Number of Instances: 6
Number of Attributes: 13
Missing Values: No
Source:
National Penitentiary Department (DEPEN) (Brasil, 2022d);
Description: The data contained in the prison_syphilis_and_population_brazil.csv dataset (see Table 2) originate from the National Penitentiary Department Information System (SISDEPEN) (Brasil, 2022d). This dataset provides data on the population and prevalence of syphilis in the Brazilian prison system. In addition, it brings a rate that represents the normalized data for purposes of comparison between the populations of each region and Brazil.
Table 2: Description of DEPEN dataset Features.
Attributes |
Description |
datatype |
Value |
Region |
Brazilian region in which the participant resides. In addition, the sum of the regions, which refers to Brazil. |
Categorical. |
Brazil and Brazilian region according to IBGE: North, Northeast, Midwest, Southeast or South. |
syphilis_2017 |
Number of syphilis cases in the prison system in 2017. |
Numerical. |
Number of syphilis cases. |
syphilis_rate_2017 |
Normalized rate of syphilis cases in 2017. |
Numerical. |
Syphilis case rate. |
syphilis_2018 |
Number of syphilis cases in the prison system in 2018. |
Numerical. |
Number of syphilis cases. |
syphilis_rate_2018 |
Normalized rate of syphilis cases in 2018. |
Numerical. |
Syphilis case rate. |
syphilis_2019 |
Number of syphilis cases in the prison system in 2019. |
Numerical. |
Number of syphilis cases. |
syphilis_rate_2019 |
Normalized rate of syphilis cases in 2019. |
Numerical. |
Syphilis case rate. |
syphilis_2020 |
Number of syphilis cases in the prison system in 2020. |
Numerical. |
Number of syphilis cases. |
syphilis_rate_2020 |
Normalized rate of syphilis cases in 2020. |
Numerical. |
Syphilis case rate. |
pop_2017 |
Prison population in 2017. |
Numerical. |
Population number. |
pop_2018 |
Prison population in 2018. |
Numerical. |
Population number. |
pop_2019 |
Prison population in 2019. |
Numerical. |
Population number. |
pop_2020 |
Prison population in 2020. |
Numerical. |
Population number. |
Dataset name: students_cumulative_sum.csv
Dataset period: 2018 - 2020
Dataset Characteristics: Multivalued
Number of Instances: 6
Number of Attributes: 7
Missing Values: No
Source:
Virtual Learning Environment of the Brazilian Health System (AVASUS) (Brasil, 2022a);
Brazilian Institute of Geography and Statistics (IBGE) (Brasil, 2022e).
Description: The data contained in the students_cumulative_sum.csv dataset (see Table 3) originate mainly from AVASUS (Brasil, 2022a). This dataset provides data on the number of students by region and year. In addition, it brings a rate that represents the normalized data for purposes of comparison between the populations of each region and Brazil. We used population data estimated by the IBGE (Brasil, 2022e) to calculate the rate.
Table 3: Description of Students dataset Features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Companion dataset to "The possible role of resource requirements and academic career-choice risk on gender differences in publication rate and impact" by Duch J, Zeng XHT, Sales-Pardo M, Radicchi F, Otis S, Woodruff TK, Amaral LAN (PLoS ONE 7, e51332, 2012) doi: 10.1371/journal.pone.0051332 This dataset lists the total number of publications by 4,394 faculty members from 7 distinct research fields working at top U.S. institutions. The dataset also contains bibliographic information manualy gathered from the CVs of those faculty members. The publications data was collected from Thomson Reuters' Web of Science according to the procedures described in the published paper.
The data is a single csv file with the following fields: author_name - researcher name as: Last name, Initialsgender - researcher gender as: M (male) or F (female)univ_name - Institution of current employmentfield - scientific disciplinephd_year - year of phd completionnationality - Country of originbackground - List of degreesaffiliations - List of honours and past appointmentstotal_pubs - Total number of publications Some fields are not available for some researchers. Current employments are accurate as of June, 2010.total_pubs field show total number of publications published by the end of 2010.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Worldwide Bureaucracy Indicators (WWBI) dataset from the World Bank.
The Worldwide Bureaucracy Indicators (WWBI) database is a unique cross-national dataset on public sector employment and wages that aims to fill an information gap, thereby helping researchers, development practitioners, and policymakers gain a better understanding of the personnel dimensions of state capability, the footprint of the public sector within the overall labor market, and the fiscal implications of the public sector wage bill. The dataset is derived from administrative data and household surveys, thereby complementing existing, expert perception-based approaches.
The World Bank introduced the dataset with a series of four blogs:
Can you replicate the figures in the blogs? Can you display any of the data more clearly than in the blogs?
wwbi_data.csv
variable | class | description |
---|---|---|
country_code | character | 3-letter ISO_3166-1 code |
indicator_code | character | code identifying the indicator of bureaucracy |
year | numeric | year of the data |
value | numeric | numeric value of the data |
wwbi_series.csv
variable | class | description |
---|---|---|
indicator_code | character | code identifying the indicator of bureaucracy |
indicator_name | character | name of the indicator |
wwbi_country.csv
variable | class | description |
---|---|---|
country_code | character | 3-letter ISO_3166-1 code |
short_name | character | short or common name for the country |
table_name | character | more alphabetically sortable name of the country |
long_name | character | full name of the country |
x2_alpha_code | character | 2-letter ISO_3166-1 code |
currency_unit | character | currency unit |
special_notes | character | special notes |
region | character | region |
income_group | character | low, lower middle, upper middle, or high income |
wb_2_code | character | alternate 2-letter code |
national_accounts_base_year | integer | national accounts base year |
national_accounts_reference_year | integer | national accounts reference year |
sna_price_valuation | character | UN system of national accounts price valuation |
lending_category | character | International Development Association (IDA), Interanational Bank of Reconstruction and Development (IBRD), a blend or neither |
other_groups | character | Heavily Indebted Poor Countries initiative (HIPC), or countries classified as the "Euro area" |
system_of_national_accounts | integer | which System of National Accounts methodology the country uses (1968, 1993, or 2008 version) |
balance_of_payments_manual_in_use | character | the version of the Balance of Payments Manual used by the country |
external_debt_reporting_status | character | estimate, preliminary, or actual |
system_of_trade | character | Under the general system imports include goods imported for domestic consumption and imports into bonded warehouses and free trade zones. Under the special system imports comprise goods imported for domestic consumption (including transformation and repair) and withdrawals for domestic consumption from bonded warehouses and free trade zones. Goods transported through a country en route to another are excluded. |
government_accounting_concept | character | government accounting concept |
imf_data_dissemination_standard | character | International Monetary Fund data-dissemination standard: Special Data Dissemination Standard (SDDS, 1996, created for countries |
that have or seek to have access to international markets), SDDS Plus (2012, the highest tier of data standards, intended for systemically important economies), enhanced GDDS (e-GDDS, 2015, encouraging participants to emphasize data publication) | ||
latest_household_survey | character | which household survey was most recently administered |
source_of_most_recent_income_and_expenditure_data | character | which survey serves as the basis for income and expenditure data |
vital_registration_complete | logical | whether the vital registration is complete |
latest_agricultural_census | integer | year of latest agricultural census |
latest_industrial_data | integer | year of latest industrial data |
latest_trade_data | in... |
Working groups are recognized as a highly effective method for synthesizing science. It is less clear if participating in working groups benefits individual researchers, or if benefits differ between men and women. This is a critical question, for the working group method is not sustainable if the benefit to science comes at a cost to academic careers or gender equity. Here, we analyze the publications of Canadian university faculty specialized in ecology and evolution (N=1244), a field that has embraced the working group method. Researchers were more likely to have participated in a working group as their academic age and prior H-index increased, but controlling for these factors there was no effect of gender. Using a longitudinal analysis, we find that researcher H-indices accrue 14% faster following their first working group publication, regardless of gender. Part of this acceleration may be the 3- to 5-fold higher citation rate of working group synthesis publications. In a survey (N..., We compiled information on 1,244 faculty members at Canadian universities who were funded by a NSERC Discovery grant (Evolution and Ecology subcommittee) between 1991 and 2019. This information included assumed binary gender from first names and institutional website use of pronouns and photographs (coded men, women); we acknowledge that we may have mis-assigned gender or failed to notice non-binary, transitional or fluid gender identities. We also collected information on the researcher’s year of PhD and all institutions they were affiliated with during their research career. This information was obtained from public curriculum vitae, institutional websites, personally-maintained researcher websites, academic networking platforms (LinkedIn, Research Gate), Google Scholar, and other public sources such as obituaries. For each researcher, we reconstructed their H-index through time using (1) a compiled list of their peer-reviewed publications and (2) the citations for each publication, f..., , # Data from: "Working groups, gender and publication impact of Canada’s ecology and evolution faculty"
This readme file describes the (1) scripts and (2) datafiles included in this repository. Missing data in data files are indicated as NA. All data files are in .csv format meaning that “,†is used as the separator.
This do file is written using Stata 18.0. This script analyses whether working group (WG) experience has a significant impact on researchers' Hindex progression and whether this benefit or WG participation is gendered.
Input: researcher_database.csv Output: Table 1, Table 2, Figure 1(a) (b) (c), Figure 3(a) (b)
This script is written in the programming language R. The script analyses the effect of research type and research method on the citations of publications using generalized linear models. The script also plots this data.
Input: syn_sc_socio_public_...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The PAAL ADL Accelerometry dataset (v2.0) has been acquired with a high-quality wearable multisensor device, the Empatica E4. In this dataset, among the signals collected by the sensors embedded in the Empatica E4, only the acceleration has been extracted to monitor the users performing different activities of daily living (ADLs). To promote the real-life acquisition procedure, subjects acted in their natural environment, with no instructions about how and for how long to perform each activity (other than a minimum time). The device was worn on the dominant hand.
The dataset includes 24 different ADLs performed using real objects. Each activity was repeated between 3 and 5 times (on average) by 52 healthy subjects, characterized by a gender balance (26 women and 26 men), and a large age range (between 18 and 77 years, mean = 44.08 years and standard deviation = 17.06 years).
The PAAL ADL Accelerometry dataset (v2.0) is composed of three files:
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The data (name, year of birth, sex, and number) are from a 100 percent sample of Social Security card applications for 1880 onward.