Facebook
TwitterBy Amber Thomas [source]
This dataset contains all of the data used in the Pudding essay When Women Make Headlines published in January 2022. This dataset was created to analyze gendered language, bias and language themes in news headlines from across the world. It contains headlines from top50 news publications and news agencies from four major countries - USA, UK, India and South Africa - as published by SimilarWeb (as of 2021-06-06).
To collect this data we used RapidAPI's google news API to query headlines containing one or more of keywords selected based on existing research done by Huimin Xu & team and The Swaddle team. We analyzed words used in headlines manually curating two dictionaries — gendered words about women (words that are explicitly gendered) and words that denote societal/behavioral stereotypes about women. To calculate bias scores, we utilized technology developed through Yasmeen Hitti & team’s research on gender bias text analysis. To categorize words used into themes (violence/crime, empowerment, race/ethnicity/identity etc), we manually curated four dictionaries utilizing Natural Language Processing packages for Python like spacy & nltk for our analysis. Plus, inverting polarity scores with vaderSentiment algorithm helped us shed light on differences between women-centered/non-women centered polarity levels as well as differences between global polarity baselines of each country's most visited publications & news agencies according to SimilarWeb 2020 statistics..
This dataset enables journalists, researchers and educators researching issues related to gender equity within media outlets around the world further insights into potential disparities with just a few lines of code! Any discoveries made by using this data should provide valuable support for evidence-based argumentation . Let us advocate for greater awareness towards female representation better quality coverage!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive look at the portrayal of women in headlines from 2010-2020. Using this dataset, researchers and data scientists can explore a range of topics including language used to describe women, bias associated with different topics or publications, and temporal patterns in headlines about women over time.
To use this dataset effectively, it is helpful to understand the structure of the data. The columns include headline_no_site (the text of the headline without any information about which publication it is from), time (the date and time that the article was published), country (the country where it was published), bias score (calculated using Gender Bias Taxonomy V1.0) and year (the year that the article was published).
By exploring these columns individually or combining them into groups such as by publication or by topic, there are many ways to make meaningful discoveries using this data set. For example, one could explore if certain news outlets employ more gender-biased language when writing about female subjects than other outlets or investigate whether female-centric stories have higher/lower bias scores than average for a particular topic across multiple countries over time. This type of analysis helps researchers to gain insight into how our culture's dialogue has evolved over recent years as relates to women in media coverage worldwide
- A comparative, cross-country study of the usage of gendered language and the prevalence of gender bias in headlines to better understand regional differences.
- Creating an interactive visualization showing the evolution of headline bias scores over time with respect to a certain topic or population group (such as women).
- Analyzing how different themes are covered in headlines featuring women compared to those without, such as crime or violence versus empowerment or race and ethnicity, to see if there’s any difference in how they are portrayed by the media
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: headlines_reduced_temporal.csv | Column name | Description | |:---------------------|:-------------------------------------------------------------------------------------...
Facebook
TwitterBy Bhavna Chawla [source]
This dataset provides an in-depth look at crime against children throughout India. The data, collected from state and union territories throughout the country, tracks arrests made in response to a variety of crimes including infanticide, murder of children, rape of Children, kidnapping and abduction of children, foeticide, abetment of suicide, exposure and abandonment. Additionally it looks at procuration of minor girls as well as buying or selling minors for prostitution. It also illustrates arrests made related to violation or prevention under the Prohibition Of Child Marriage Act (PCMA).
The dataset paints an unfortunately dark image across India with rising numbers each year - painfully representing the suffering these innocent minors have faced over time. Through this dataset we can not only get a better understanding on who is leading the charge in terms of crime rate but also uncover startling patterns about type specified categories that are particularly egregious when it comes to number of arrests made. By examining this data more closely together we can unravel meaningful solutions which ultimately could help protect our beloved child population from needless harm and distress
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is suitable for researchers interested in learning more about crime against children as well as government planners who may want to analyze which states have higher rates of various types of crimes and identify strategies for managing them.
To use this dataset, start by examining the main columns – STATE/UT, CRIME HEAD, 2001-2012 – which provide additional information about each row such as state or UT name and type of crime committed respectively. Then you can use a visualized comparison to evaluate trends across all the listed years: a look at total numbers or changes over time will help reveal how arrests vary among different categories or within a particular year; it will also identify areas with particularly high numbers that need more attention from policy makers. These visualizations can also be compared with statistics on population density or socio-economic characteristics such as literacy rate or poverty levels to get further insights into characterizing patterns for targeted interventions that could reduce criminal activities towards vulnerable communities.
Additionally, you could use this dataset combined with other external sources/variables (governance measures taken against certain categories etc.) to build predictive models that identify relationships between risks factors associated with higher rate of specific type(s) criminal activities prevailing amongst certain age group(s). Such approaches would help contribute towards evidence informed public safety interventions, public health initiatives and legal systems strengthening over time specifically targeting those districts where higher rates are taking place so that people especially women & girls are protected from any form physical abuse & harassment leading potential threat on their living condition & livelihood opportunities eventually affecting national development levels if left unchecked regularly each year progressing forward
- This dataset could be used to identify the states with the highest crime rates against children, and explore any potential correlations between crime statistics and social or economic factors in those states.
- This dataset can also be used to analyze state-wise trends over time to assess whether government initiatives aimed at curbing crimes against children have been effective or not.
- The dataset can also help researchers examine which type of crimes are most prevalent in each state/UT and come up with ways to reduce these crimes via policy measures or public outreach programs, etc
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: Crime head-wise persons arrested under crime against children during 2001-2012.csv | Column name | Description | |:---------------|:----------------------------------------------------------------| | STATE/UT | The state or union territory in India. (String) | | CRIME HEAD | The type of crime against chi...
Facebook
TwitterAs of January 2024, Instagram was slightly more popular with men than women, with men accounting for 50.6 percent of the platform’s global users. Additionally, the social media app was most popular amongst younger audiences, with almost 32 percent of users aged between 18 and 24 years.
Instagram’s Global Audience
As of January 2024, Instagram was the fourth most popular social media platform globally, reaching two billion monthly active users (MAU). This number is projected to keep growing with no signs of slowing down, which is not a surprise as the global online social penetration rate across all regions is constantly increasing.
As of January 2024, the country with the largest Instagram audience was India with 362.9 million users, followed by the United States with 169.7 million users.
Who is winning over the generations?
Even though Instagram’s audience is almost twice the size of TikTok’s on a global scale, TikTok has shown itself to be a fierce competitor, particularly amongst younger audiences. TikTok was the most downloaded mobile app globally in 2022, generating 672 million downloads. As of 2022, Generation Z in the United States spent more time on TikTok than on Instagram monthly.
Facebook
TwitterThe National Family & Health Survey (NFHS) is a survey in India that attempts to collect information on health conditions, nutrition, family planning, domestic violence, and a host of other factors through conducting surveys on a random ("representative") sample of Indian households in all states. The fifth NFHS was conducted through 2019-21, and the reports were released to the public in 2021 and can be found at this link. The original data was released as PDFs; this Kaggle dataset was created by extracting the tabular data from PDFs into JSONs.
Here's a non-comprehensive list of some indicators collected by this survey:
Major news outlets in India analysed the results of the study too - here are some interesting articles that show what sorts of "stories" or insights you van look for in this data:
Note: I used a Python script to parse the data automatically. I tried my best to make sure the data was parsed correctly, but there is a possibility that some data in JSON might not be 100% accurate - there is no way I could have manually verified all 704 PDF files and their outputs, so I randomly sampled and verified a couple of files, all of which looked okay. If you want to see the scripts used to parse this PDFs, please visit my GitHub repo.
Dataset cover photo by Naveed Ahmed on Unsplash.com
Facebook
TwitterThis statistic depicts the age distribution of India from 2013 to 2023. In 2023, about 25.06 percent of the Indian population fell into the 0-14 year category, 68.02 percent into the 15-64 age group and 6.92 percent were over 65 years of age. Age distribution in India India is one of the largest countries in the world and its population is constantly increasing. India’s society is categorized into a hierarchically organized caste system, encompassing certain rights and values for each caste. Indians are born into a caste, and those belonging to a lower echelon often face discrimination and hardship. The median age (which means that one half of the population is younger and the other one is older) of India’s population has been increasing constantly after a slump in the 1970s, and is expected to increase further over the next few years. However, in international comparison, it is fairly low; in other countries the average inhabitant is about 20 years older. But India seems to be on the rise, not only is it a member of the BRIC states – an association of emerging economies, the other members being Brazil, Russia and China –, life expectancy of Indians has also increased significantly over the past decade, which is an indicator of access to better health care and nutrition. Gender equality is still non-existant in India, even though most Indians believe that the quality of life is about equal for men and women in their country. India is patriarchal and women still often face forced marriages, domestic violence, dowry killings or rape. As of late, India has come to be considered one of the least safe places for women worldwide. Additionally, infanticide and selective abortion of female fetuses attribute to the inequality of women in India. It is believed that this has led to the fact that the vast majority of Indian children aged 0 to 6 years are male.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
A number of studies have documented the evolution of female resistance to mate harm in response to the alteration of intersexual conflict in the populations. However, the life-history consequence of such evolution is still a subject of debate. In the present study, we subjected replicate populations of Drosophila melanogaster to different levels of sexual conflict (generated by altering the operational sex ratio) for over 45 generations. Our results suggest that females from populations experiencing higher level of intersexual conflict evolved increased resistance to mate harm, in terms of both longevity and progeny production. Females from the populations with low conflict were significantly heavier at eclosion and were more susceptible to mate harm in terms of progeny production under continuous exposure to the males. However, these females produced more progeny upon single mating and had significantly higher longevity in absence of any male exposure – a potential evidence of trade-offs between resistance-related traits and other life-history traits, such as fecundity and longevity. We also report tentative evidence, suggesting an increased male cost of interacting with more resistant females.
Facebook
TwitterDifferent countries have different health outcomes that are in part due to the way respective health systems perform. Regardless of the type of health system, individuals will have health and non-health expectations in terms of how the institution responds to their needs. In many countries, however, health systems do not perform effectively and this is in part due to lack of information on health system performance, and on the different service providers.
The aim of the WHO World Health Survey is to provide empirical data to the national health information systems so that there is a better monitoring of health of the people, responsiveness of health systems and measurement of health-related parameters.
The overall aims of the survey is to examine the way populations report their health, understand how people value health states, measure the performance of health systems in relation to responsiveness and gather information on modes and extents of payment for health encounters through a nationally representative population based community survey. In addition, it addresses various areas such as health care expenditures, adult mortality, birth history, various risk factors, assessment of main chronic health conditions and the coverage of health interventions, in specific additional modules.
The objectives of the survey programme are to: 1. develop a means of providing valid, reliable and comparable information, at low cost, to supplement the information provided by routine health information systems. 2. build the evidence base necessary for policy-makers to monitor if health systems are achieving the desired goals, and to assess if additional investment in health is achieving the desired outcomes. 3. provide policy-makers with the evidence they need to adjust their policies, strategies and programmes as necessary.
The survey sampling frame must cover 100% of the country's eligible population, meaning that the entire national territory must be included. This does not mean that every province or territory need be represented in the survey sample but, rather, that all must have a chance (known probability) of being included in the survey sample.
There may be exceptional circumstances that preclude 100% national coverage. Certain areas in certain countries may be impossible to include due to reasons such as accessibility or conflict. All such exceptions must be discussed with WHO sampling experts. If any region must be excluded, it must constitute a coherent area, such as a particular province or region. For example if ¾ of region D in country X is not accessible due to war, the entire region D will be excluded from analysis.
Households and individuals
The WHS will include all male and female adults (18 years of age and older) who are not out of the country during the survey period. It should be noted that this includes the population who may be institutionalized for health reasons at the time of the survey: all persons who would have fit the definition of household member at the time of their institutionalisation are included in the eligible population.
If the randomly selected individual is institutionalized short-term (e.g. a 3-day stay at a hospital) the interviewer must return to the household when the individual will have come back to interview him/her. If the randomly selected individual is institutionalized long term (e.g. has been in a nursing home the last 8 years), the interviewer must travel to that institution to interview him/her.
The target population includes any adult, male or female age 18 or over living in private households. Populations in group quarters, on military reservations, or in other non-household living arrangements will not be eligible for the study. People who are in an institution due to a health condition (such as a hospital, hospice, nursing home, home for the aged, etc.) at the time of the visit to the household are interviewed either in the institution or upon their return to their household if this is within a period of two weeks from the first visit to the household.
Sample survey data [ssd]
SAMPLING GUIDELINES FOR WHS
Surveys in the WHS program must employ a probability sampling design. This means that every single individual in the sampling frame has a known and non-zero chance of being selected into the survey sample. While a Single Stage Random Sample is ideal if feasible, it is recognized that most sites will carry out Multi-stage Cluster Sampling.
The WHS sampling frame should cover 100% of the eligible population in the surveyed country. This means that every eligible person in the country has a chance of being included in the survey sample. It also means that particular ethnic groups or geographical areas may not be excluded from the sampling frame.
The sample size of the WHS in each country is 5000 persons (exceptions considered on a by-country basis). An adequate number of persons must be drawn from the sampling frame to account for an estimated amount of non-response (refusal to participate, empty houses etc.). The highest estimate of potential non-response and empty households should be used to ensure that the desired sample size is reached at the end of the survey period. This is very important because if, at the end of data collection, the required sample size of 5000 has not been reached additional persons must be selected randomly into the survey sample from the sampling frame. This is both costly and technically complicated (if this situation is to occur, consult WHO sampling experts for assistance), and best avoided by proper planning before data collection begins.
All steps of sampling, including justification for stratification, cluster sizes, probabilities of selection, weights at each stage of selection, and the computer program used for randomization must be communicated to WHO
STRATIFICATION
Stratification is the process by which the population is divided into subgroups. Sampling will then be conducted separately in each subgroup. Strata or subgroups are chosen because evidence is available that they are related to the outcome (e.g. health, responsiveness, mortality, coverage etc.). The strata chosen will vary by country and reflect local conditions. Some examples of factors that can be stratified on are geography (e.g. North, Central, South), level of urbanization (e.g. urban, rural), socio-economic zones, provinces (especially if health administration is primarily under the jurisdiction of provincial authorities), or presence of health facility in area. Strata to be used must be identified by each country and the reasons for selection explicitly justified.
Stratification is strongly recommended at the first stage of sampling. Once the strata have been chosen and justified, all stages of selection will be conducted separately in each stratum. We recommend stratifying on 3-5 factors. It is optimum to have half as many strata (note the difference between stratifying variables, which may be such variables as gender, socio-economic status, province/region etc. and strata, which are the combination of variable categories, for example Male, High socio-economic status, Xingtao Province would be a stratum).
Strata should be as homogenous as possible within and as heterogeneous as possible between. This means that strata should be formulated in such a way that individuals belonging to a stratum should be as similar to each other with respect to key variables as possible and as different as possible from individuals belonging to a different stratum. This maximises the efficiency of stratification in reducing sampling variance.
MULTI-STAGE CLUSTER SELECTION
A cluster is a naturally occurring unit or grouping within the population (e.g. enumeration areas, cities, universities, provinces, hospitals etc.); it is a unit for which the administrative level has clear, nonoverlapping boundaries. Cluster sampling is useful because it avoids having to compile exhaustive lists of every single person in the population. Clusters should be as heterogeneous as possible within and as homogenous as possible between (note that this is the opposite criterion as that for strata). Clusters should be as small as possible (i.e. large administrative units such as Provinces or States are not good clusters) but not so small as to be homogenous.
In cluster sampling, a number of clusters are randomly selected from a list of clusters. Then, either all members of the chosen cluster or a random selection from among them are included in the sample. Multistage sampling is an extension of cluster sampling where a hierarchy of clusters are chosen going from larger to smaller.
In order to carry out multi-stage sampling, one needs to know only the population sizes of the sampling units. For the smallest sampling unit above the elementary unit however, a complete list of all elementary units (households) is needed; in order to be able to randomly select among all households in the TSU, a list of all those households is required. This information may be available from the most recent population census. If the last census was >3 years ago or the information furnished by it was of poor quality or unreliable, the survey staff will have the task of enumerating all households in the smallest randomly selected sampling unit. It is very important to budget for this step if it is necessary and ensure that all households are properly enumerated in order that a representative sample is obtained.
It is always best to have as many clusters in the PSU as possible. The reason for this is that the fewer the number of respondents in each PSU, the lower will be the clustering effect which
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterBy Amber Thomas [source]
This dataset contains all of the data used in the Pudding essay When Women Make Headlines published in January 2022. This dataset was created to analyze gendered language, bias and language themes in news headlines from across the world. It contains headlines from top50 news publications and news agencies from four major countries - USA, UK, India and South Africa - as published by SimilarWeb (as of 2021-06-06).
To collect this data we used RapidAPI's google news API to query headlines containing one or more of keywords selected based on existing research done by Huimin Xu & team and The Swaddle team. We analyzed words used in headlines manually curating two dictionaries — gendered words about women (words that are explicitly gendered) and words that denote societal/behavioral stereotypes about women. To calculate bias scores, we utilized technology developed through Yasmeen Hitti & team’s research on gender bias text analysis. To categorize words used into themes (violence/crime, empowerment, race/ethnicity/identity etc), we manually curated four dictionaries utilizing Natural Language Processing packages for Python like spacy & nltk for our analysis. Plus, inverting polarity scores with vaderSentiment algorithm helped us shed light on differences between women-centered/non-women centered polarity levels as well as differences between global polarity baselines of each country's most visited publications & news agencies according to SimilarWeb 2020 statistics..
This dataset enables journalists, researchers and educators researching issues related to gender equity within media outlets around the world further insights into potential disparities with just a few lines of code! Any discoveries made by using this data should provide valuable support for evidence-based argumentation . Let us advocate for greater awareness towards female representation better quality coverage!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a comprehensive look at the portrayal of women in headlines from 2010-2020. Using this dataset, researchers and data scientists can explore a range of topics including language used to describe women, bias associated with different topics or publications, and temporal patterns in headlines about women over time.
To use this dataset effectively, it is helpful to understand the structure of the data. The columns include headline_no_site (the text of the headline without any information about which publication it is from), time (the date and time that the article was published), country (the country where it was published), bias score (calculated using Gender Bias Taxonomy V1.0) and year (the year that the article was published).
By exploring these columns individually or combining them into groups such as by publication or by topic, there are many ways to make meaningful discoveries using this data set. For example, one could explore if certain news outlets employ more gender-biased language when writing about female subjects than other outlets or investigate whether female-centric stories have higher/lower bias scores than average for a particular topic across multiple countries over time. This type of analysis helps researchers to gain insight into how our culture's dialogue has evolved over recent years as relates to women in media coverage worldwide
- A comparative, cross-country study of the usage of gendered language and the prevalence of gender bias in headlines to better understand regional differences.
- Creating an interactive visualization showing the evolution of headline bias scores over time with respect to a certain topic or population group (such as women).
- Analyzing how different themes are covered in headlines featuring women compared to those without, such as crime or violence versus empowerment or race and ethnicity, to see if there’s any difference in how they are portrayed by the media
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: headlines_reduced_temporal.csv | Column name | Description | |:---------------------|:-------------------------------------------------------------------------------------...