Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe accurate measurement of educational attainment is of great importance for population research. Past studies measuring average years of schooling rely on strong assumptions to incorporate binned data. These assumptions, which we refer to as the standard duration method, have not been previously evaluated for bias or accuracy.MethodsWe assembled a database of 1,680 survey and census datasets, representing both binned and single-year education data. We developed two models that split bins of education into single year values. We evaluate our models, and compare them to the standard duration method, using out-of-sample predictive validity.ResultsOur results indicate that typical methods used to split bins of educational attainment introduce substantial error and bias into estimates of average years of schooling, as compared to new approaches. Globally, the standard duration method underestimates average years of schooling, with a median error of -0.47 years. This effect is especially pronounced in datasets with a smaller number of bins or higher true average attainment, leading to irregular error patterns between geographies and time periods. Both models we developed resulted in unbiased predictions of average years of schooling, with smaller average error than previous methods. We find that one approach using a metric of distance in space and time to identify training data, had the best performance, with a root mean squared error of mean attainment of 0.26 years, compared to 0.92 years for the standard duration algorithm.ConclusionsEducation is a key social indicator and its accurate estimation should be a population research priority. The use of a space-time distance bin-splitting model drastically improved the estimation of average years of schooling from binned education data. We provide a detailed description of how to use the method and recommend that future studies estimating educational attainment across time or geographies use a similar approach.
This study provides an update on measures of educational attainment for a broad cross section of countries. In our previous work (Barro and Lee, 1993), we constructed estimates of educational attainment by sex for persons aged 25 and over. The values applied to 129 countries over a five-year intervals from 1960 to 1985.
The present study adds census information for 1985 and 1990 and updates the estimates of educational attainment to 1990. We also have been able to add a few countries, notably China, which were previously omitted because of missing data.
Dataset:
Educational attainment at various levels for the male and female population. The data set includes estimates of educational attainment for the population by age - over age 15 and over age 25 - for 126 countries in the world. (see Barro, Robert and J.W. Lee, "International Measures of Schooling Years and Schooling Quality, AER, Papers and Proceedings, 86(2), pp. 218-223 and also see "International Data on Education", manuscipt.) Data are presented quinquennially for the years 1960-1990;
Educational quality across countries. Table 1 presents data on measures of schooling inputs at five-year intervals from 1960 to 1990. Table 2 contains the data on average test scores for the students of the different age groups for the various subjects.Please see Jong-Wha Lee and Robert J. Barro, "Schooling Quality in a Cross-Section of Countries," (NBER Working Paper No.w6198, September 1997) for more detailed explanation and sources of data.
The data set cobvers the following countries: - Afghanistan - Albania - Algeria - Angola - Argentina - Australia - Austria - Bahamas, The - Bahrain - Bangladesh - Barbados - Belgium - Benin - Bolivia - Botswana - Brazil - Bulgaria - Burkina Faso - Burundi - Cameroon - Canada - Cape verde - Central African Rep. - Chad - Chile - China - Colombia - Comoros - Congo - Costa Rica - Cote d'Ivoire - Cuba - Cyprus - Czechoslovakia - Denmark - Dominica - Dominican Rep. - Ecuador - Egypt - El Salvador - Ethiopia - Fiji - Finland - France - Gabon - Gambia - Germany, East - Germany, West - Ghana - Greece - Grenada - Guatemala - Guinea - Guinea-Bissau - Guyana - Haiti - Honduras - Hong Kong - Hungary - Iceland - India - Indonesia - Iran, I.R. of - Iraq - Ireland - Israel - Italy - Jamaica - Japan - Jordan - Kenya - Korea - Kuwait - Lesotho - Liberia - Luxembourg - Madagascar - Malawi - Malaysia - Mali - Malta - Mauritania - Mauritius - Mexico - Morocco - Mozambique - Myanmar (Burma) - Nepal - Netherlands - New Zealand - Nicaragua - Niger - Nigeria - Norway - Oman - Pakistan - Panama - Papua New Guinea - Paraguay - Peru - Philippines - Poland - Portugal - Romania - Rwanda - Saudi Arabia - Senegal - Seychelles - Sierra Leone - Singapore - Solomon Islands - Somalia - South africa - Spain - Sri Lanka - St.Lucia - St.Vincent & Grens. - Sudan - Suriname - Swaziland - Sweden - Switzerland - Syria - Taiwan - Tanzania - Thailand - Togo - Tonga - Trinidad & Tobago - Tunisia - Turkey - U.S.S.R. - Uganda - United Arab Emirates - United Kingdom - United States - Uruguay - Vanuatu - Venezuela - Western Samoa - Yemen, N.Arab - Yugoslavia - Zaire - Zambia - Zimbabwe
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CN: Average Number of Years of Education: New Labor Force data was reported at 14.000 Year in 2023. This stayed constant from the previous number of 14.000 Year for 2022. CN: Average Number of Years of Education: New Labor Force data is updated yearly, averaging 14.000 Year from Dec 2021 (Median) to 2023, with 3 observations. The data reached an all-time high of 14.000 Year in 2023 and a record low of 13.800 Year in 2021. CN: Average Number of Years of Education: New Labor Force data remains active status in CEIC and is reported by Ministry of Education. The data is categorized under China Premium Database’s Socio-Demographic – Table CN.GD: Average Number of Years of Education.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This table contains 173 series, with data for years 1996 - 1996 (not all combinations necessarily have data for all years), and is no longer being released. This table contains data described by the following dimensions (Not all combinations are available): Geography (173 items: Canada; Newfoundland and Labrador; Health and Community Services St. John's Region, Newfoundland and Labrador; Health and Community Services Eastern Region, Newfoundland and Labrador; ...).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CN: Average Number of Years of Education: Age 15 and Above data was reported at 9.910 Year in 2020. This records an increase from the previous number of 9.080 Year for 2010. CN: Average Number of Years of Education: Age 15 and Above data is updated yearly, averaging 9.080 Year from Dec 1982 (Median) to 2020, with 3 observations. The data reached an all-time high of 9.910 Year in 2020 and a record low of 5.300 Year in 1982. CN: Average Number of Years of Education: Age 15 and Above data remains active status in CEIC and is reported by Ministry of Education. The data is categorized under China Premium Database’s Socio-Demographic – Table CN.GD: Average Number of Years of Education.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
You will find three datasets containing heights of the high school students.
All heights are in inches.
The data is simulated. The heights are generated from a normal distribution with different sets of mean and standard deviation for boys and girls.
Height Statistics (inches) | Boys | Girls |
---|---|---|
Mean | 67 | 62 |
Standard Deviation | 2.9 | 2.2 |
There are 500 measurements for each gender.
Here are the datasets:
hs_heights.csv: contains a single column with heights for all boys and girls. There's no way to tell which of the values are for boys and which ones are for girls.
hs_heights_pair.csv: has two columns. The first column has boy's heights. The second column contains girl's heights.
hs_heights_flag.csv: has two columns. The first column has the flag is_girl. The second column contains a girl's height if the flag is 1. Otherwise, it contains a boy's height.
To see how I generated this dataset, check this out: https://github.com/ysk125103/datascience101/tree/main/datasets/high_school_heights
Image by Gillian Callison from Pixabay
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The model estimated in this document uses a set of variables that are available for a wide range of countries with different levels of development, resulting in a sample of 91 countries for the period 1970-2010. The file titled “Database PLS-PM” contains the data with which is possible to estimate the human capital index (ich) calculated in the paper. The variables used and their notation is as follows: FR= Fertility Rates VAAS = value-added contributed by the agricultural sector to GDP GNI = Gross National Incomes per capita LE = Life Expectancy MR = Mortality rate for children under five years AYE = Average Years of Education SPR = Student-Professor Ratio EC = Energy Consumption per capita PP = patent applications by residents per capita Given the database is not complete for all countries or for all years, this missing data was complete through interpolation method. All variables were transformed by mean of logarithms, except GNI. In the case of EC and PP, block of returns on human capital, the manifest variables are transformed such that they may be retrieved in levels at a later stage. 2. Data to estimate the economic growth regressions Cross-section: The file titled “Database – Cross-Section” contains the data with which it is possible to estimate the results shown in tables 1-5 of the manuscript. The variables used and their notation is the following: grow = GDP per capita, rate of change log(gdp75) = lag of GDP in 1975, logarithm demo = a binary variable measuring the level of democracy in the countries contes = indicators by principal component analysis to approximate the degree of contestation inclu = indicators by principal component analysis to approximate the degree of inclusiveness lnihc = human capital index estimated through PLS-PM, logarithm lnaye = average years of education developed by Barro and Lee (2013), logarithm lninves = investment in physical capital, measured as the average share of investment real to GDP, logarithm lngov = average government consumption as a percentage of GDP, logarithm lninfla = inflation measured by consumer prices, logarithm lnpop = population growth rate, logarithm lnich70, lnich75, lnape70, lnape75 lninves70 lninves75 lnpop70 lnpop75 = lags of lnich, lnaye, lninves and lnpop dafri = dummy for African countries Panel data: The file titled “Database – Panel data” contains the data with which it is possible to estimate the results shown in tables 6-9 of the manuscript. All variables are averages for the underlying period. The variables used and their notation is the following: grow = GDP per capita, rate of change lngdp75 = initial GDP in 1975, logarithm demo = a binary variable measuring the level of democracy in the countries contes = indicators by principal component analysis to approximate the degree of contestation inclu = indicators by principal component analysis to approximate the degree of inclusiveness lnihc = human capital index estimated through PLS-PM, logarithm lnaye = average years of education developed by Barro and Lee (2013), logarithm lninves = investment in physical capital, measured as the average share of investment real to GDP, logarithm lngov = average government consumption as a percentage of GDP, logarithm lninfla = inflation measured by consumer prices, logarithm lnpop = population growth rate, logarithm dafri = dummy for African countries
This layer is a part of Esri GeoInquiries at http://www.esri.com/geoinquiries The HDI was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone. The HDI can also be used to question national policy choices, asking how two countries with the same level of GNI per capita can end up with different human development outcomes. These contrasts can stimulate debate about government policy priorities. The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living. The HDI is the geometric mean of normalized indices for each of the three dimensions. The health dimension is assessed by life expectancy at birth, the education dimension is measured by mean of years of schooling for adults aged 25 years and more and expected years of schooling for children of school entering age. The standard of living dimension is measured by gross national income per capita. The HDI uses the logarithm of income, to reflect the diminishing importance of income with increasing GNI. The scores for the three HDI dimension indices are then aggregated into a composite index using geometric mean. Refer to Technical notes for more details. [source, 2020]This dataset includes the fields:HDI_Rank_2019HDI_2019Life_expectancy_at_birth_inYearExpected_years_of_schoolingMean_years_of_schooling_2019GNI_per_capita_2019Data sources:UN Development Programhttp://hdr.undp.org/en/content/2019-human-development-index-rankingHistoric HDI data source:http://hdr.undp.org/en/data#
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of schools apparent retention rates or ARR, all school sector in Victoria, from census year 2012 to 2023.\r This dataset is prepared and based on data collected from schools as part of the February School Census conducted on the last school day of February each year. It presents information for all government and non-government schools and student enrolments in Victoria, in particular secondary school years. The majority of the statistical data in this publication is drawn from school administration systems. The dataset includes analysis by school sector and sex, Koorie status, as well as on government schools by region.\r Apparent retention rates (ARR) are calculated based on aggregate enrolment data and provide an indicative measurement of student engagement in secondary education. The Department of Education and Training (DET) computes and publishes ARR data at a state-wide and DET region level only.\r \r The term "apparent" retention rate reflects that retention rates are influenced by factors not taken into account by this measure such as: Student repeating year levels, Interstate and overseas migration, Transfer of students between education sectors or schools, Student who have left school previously, returning to continue their school education.\r The ARR for year 7 to 12 (ARR 7-12) refers to the Year 12 enrolment expressed as a proportion of the Year 7 enrolment five years earlier. The ARR for year 10 to 12 (ARR 10-12) refers to the Year 12 enrolment expressed as a proportion of the Year 10 enrolment two years earlier.\r \r Please note that the ABS calculates apparent retention using the number of full-time school students only whereas at the DET we use the number of full-time equivalent school enrolments. Data reported in the ABS Schools, Australia collection is based on enrolment data collected in August by all jurisdictions.\r \r The Department has found that computing ARR at geographical areas smaller than DET regions (e.g. LGA, Postcode) can produce erratic and misleading results that are difficult to interpret or make use of. In small populations, relatively small changes in student numbers can create large movements in apparent retention rates. These populations might include smaller jurisdictions, Aboriginal and Torres Strait Islander students, and subcategories of the non-government affiliation. There are a number of reasons why apparent rates may generate results that differ from actual rates. \r Apparent retention rates provide an indicative measure of the number of full-time school students who have stayed in school, as at a designated year and grade of education. It is expressed as a percentage of the respective cohort group that those students would be expected to have come from, assuming an expected rate of progression of one grade per year.\r \r Provided ARR is a result of calculation of the whole census and is NOT to be re-calculated by average or sum.
In 2022, about 37.7 percent of the U.S. population who were aged 25 and above had graduated from college or another higher education institution, a slight decline from 37.9 the previous year. However, this is a significant increase from 1960, when only 7.7 percent of the U.S. population had graduated from college. Demographics Educational attainment varies by gender, location, race, and age throughout the United States. Asian-American and Pacific Islanders had the highest level of education, on average, while Massachusetts and the District of Colombia are areas home to the highest rates of residents with a bachelor’s degree or higher. However, education levels are correlated with wealth. While public education is free up until the 12th grade, the cost of university is out of reach for many Americans, making social mobility increasingly difficult. Earnings White Americans with a professional degree earned the most money on average, compared to other educational levels and races. However, regardless of educational attainment, males typically earned far more on average compared to females. Despite the decreasing wage gap over the years in the country, it remains an issue to this day. Not only is there a large wage gap between males and females, but there is also a large income gap linked to race as well.
https://data.gov.tw/licensehttps://data.gov.tw/license
(1) The Human Development Index (HDI) is compiled by the United Nations Development Programme (UNDP) to measure a country's comprehensive development in the areas of health, education, and economy according to the UNDP's calculation formula.(2) Explanation: (1) The HDI value ranges from 0 to 1, with higher values being better. (2) Due to our country's non-membership in the United Nations and its special international situation, the index is calculated by our department according to the UNDP formula using our country's data. The calculation of the comprehensive index for each year is mainly based on the data of various indicators adopted by the UNDP. (3) In order to have the same baseline for international comparison, the comprehensive index and rankings are not retroactively adjusted after being published.(3) Notes: (1) The old indicators included life expectancy at birth, adult literacy rate, gross enrollment ratio, and average annual income per person calculated by purchasing power parity. (2) The indicators were updated to include life expectancy at birth, mean years of schooling, expected years of schooling, and nominal gross national income (GNI) calculated by purchasing power parity. Starting in 2011, the GNI per capita was adjusted from nominal value to real value to exclude the impact of price changes. Additionally, the HDI calculation method has changed from arithmetic mean to geometric mean. (3) The calculation method for indicators in the education domain changed from geometric mean to simple average due to retrospective adjustments in the 2014 Human Development Report for the years 2005, 2008, and 2010-2012. Since 2016, the education domain has adopted data compiled by the Ministry of Education according to definitions from the United Nations Educational, Scientific and Cultural Organization (UNESCO) and the Organization for Economic Co-operation and Development (OECD).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.
Key Features
Country: Name of the country.
Density (P/Km2): Population density measured in persons per square kilometer.
Abbreviation: Abbreviation or code representing the country.
Agricultural Land (%): Percentage of land area used for agricultural purposes.
Land Area (Km2): Total land area of the country in square kilometers.
Armed Forces Size: Size of the armed forces in the country.
Birth Rate: Number of births per 1,000 population per year.
Calling Code: International calling code for the country.
Capital/Major City: Name of the capital or major city.
CO2 Emissions: Carbon dioxide emissions in tons.
CPI: Consumer Price Index, a measure of inflation and purchasing power.
CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
Currency_Code: Currency code used in the country.
Fertility Rate: Average number of children born to a woman during her lifetime.
Forested Area (%): Percentage of land area covered by forests.
Gasoline_Price: Price of gasoline per liter in local currency.
GDP: Gross Domestic Product, the total value of goods and services produced in the country.
Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.
Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.
Largest City: Name of the country's largest city.
Life Expectancy: Average number of years a newborn is expected to live.
Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
Minimum Wage: Minimum wage level in local currency.
Official Language: Official language(s) spoken in the country.
Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
Physicians per Thousand: Number of physicians per thousand people.
Population: Total population of the country.
Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
Tax Revenue (%): Tax revenue as a percentage of GDP.
Total Tax Rate: Overall tax burden as a percentage of commercial profits.
Unemployment Rate: Percentage of the labor force that is unemployed.
Urban Population: Percentage of the population living in urban areas.
Latitude: Latitude coordinate of the country's location.
Longitude: Longitude coordinate of the country's location.
Potential Use Cases
Analyze population density and land area to study spatial distribution patterns.
Investigate the relationship between agricultural land and food security.
Examine carbon dioxide emissions and their impact on climate change.
Explore correlations between economic indicators such as GDP and various socio-economic factors.
Investigate educational enrollment rates and their implications for human capital development.
Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.
Study labor market dynamics through indicators such as labor force participation and unemployment rates.
Investigate the role of taxation and its impact on economic development.
Explore urbanization trends and their social and environmental consequences.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A simple table time series for school probability and statistics. We have to learn how to investigate data: value via time. What we try to do: - mean: average is the sum of all values divided by the number of values. It is also sometimes referred to as mean. - median is the middle number, when in order. Mode is the most common number. Range is the largest number minus the smallest number. - standard deviation s a measure of how dispersed the data is in relation to the mean.
https://dataful.in/terms-and-conditionshttps://dataful.in/terms-and-conditions
The Comprehensive Annual Modular Survey (CAMS) was conducted by the National Sample Survey Office (NSSO) during July, 2022 - June, 2023 as a part of NSS 79th round.
This dataset contains State, Age-Group and Gender-wise data on below topics.
1) Persons able to read and write short simple statements in their everyday life with understanding 2) Persons able to read and write short simple statements in their everyday life with understanding and also able to perform simple arithmetic calculations 3) Mean years of schooling in formal education 4) Persons of age 6 to 10 years currently enrolled in primary education (Class I to Class V) 5) Persons with some secondary education 6) Distribution of persons of age 6 to 18 years who never enrolled in formal education by major reasons at the time of survey 7) Persons graduated in science and technology among all graduates 8) Youth in formal and non-formal education and training in the previous 12 months 9) Youth not in education, employment, or training 10) Average medical expenditure (Rs.) per household and per person on hospitalised treatment (including institutional delivery) during last 365 days and on non-hospitalised treatment during the last 30 days 11) Average out-of-pocket medical expenditure (OOPME) per household and per person for treatment on hospitalisation (including institutional delivery) during last 365 days and non-hospitalisation during last 30 days 12) Persons who have an account individually or jointly in any bank/ other financial institution/mobile money service provider 13) Number of borrowers per 1,00,000 persons 14) Persons able to use mobile (including smartphone) 15) Persons who used mobile telephone during the last 3 months preceding the date of survey 16) Persons able to use internet 17) Persons who used internet during the last 3 months preceding the date of survey 18) Population covered by 4G or above mobile technology 19) Persons who send messages (e.g., e-mail, messaging service, SMS) with attached files (e.g., documents, pictures, video) using mobile or computer-like devices during last three months preceding the date of survey 20) Persons who performed copy and paste tools to duplicate or move data, information, documents, etc using mobile or computer-like devices during last three months preceding the date of survey 21) Persons who can search internet for information and Persons who can send or receive emails and Persons who can perform online banking transactions 22) Households possessing different assets as on date of survey 23) Urban population having convenient access to transport facilities and percentage of rural population with all- weather roads within a distance of 2 km from the place of living 24) Persons of age less than 5 years who have registered with civil authority for the birth certificate ever 25) Households using clean fuel for cooking 26) Households having access to improved principal sources of drinking water and percentage of households having access to improved latrine (among households with access to latrine)for each State/UT 27) Number of First-Stage Units (FSUs), households and persons surveyed 28) Estimated number of households and persons
Local Law 102 enacted in 2015 requires the Department of Education of the New York City School District to submit to the Council an annual report concerning physical education for the prior school year. This report provides information about average frequency and average total minutes per week of physical education as defined in Local Law 102 as reported through the 2017-18 STARS database. It is important to note that schools self-report their scheduling information in STARS. The report also includes information regarding the number and ratio of certified physical education instructors and designated physical education instructional space.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of Government schools average Primary Schools Class Sizes, showing average for Prep to Year 2, Year 3 to Year 6 and overall all classes, from census years 2001 to 2023.\r \r The size of primary class is collected from government schools as part of the February School Census conducted on the last school day of February each year.\r The data is collected from all Government schools as part of the February census. The size of a class is defined as the number of students that exist in a class grouping. Where schools have varying class sizes the following definition is used: -count the class as that which exists for the majority of the time and -which includes the time spent teaching literacy and numeracy and -which the school community regards as a class grouping. Decisions on the class size profile for a school are made at the school level and take into account a number of factors such as the number of teachers available, the balance of enrolments across year levels and class room availability.\r Average class size is NOT the same as the student/teacher ratio.
Average Average Roberta Willis Scholarship need-based grant and need-merit scholarship by expected family contribution and sex. Averaged across school years 2016/2017 through 2020/2021.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description This dataset consists of academic and demographic information about 300 students from a university, which can be used for predicting academic outcomes, such as probation status. The dataset was simulated to represent a variety of student attributes across multiple categories like personal data, academic history, and other related information. The primary goal of this dataset is to analyze factors contributing to academic performance and identify students at risk of probation.
Column Descriptions Student No.: (Numeric) A unique identifier for each student. In this dataset, each student has a different ID number, making it a 100% unique column. Cohort: (Numeric) The year a student enrolled in the university. No missing values and consistent across the dataset. College: (Nominal) The name of the college the student belongs to. Examples include "Engineering," "Science," etc. No missing values. College Code: (Nominal) A numerical or alphanumerical code representing the college. This is an alternative representation of the "College" column. Major: (Nominal) The major field of study of the student. Some missing values (23%) represent students who haven’t declared a major or are in an undeclared status. Major Code: (Nominal) A code representing the major subject. Similar to the "Major" column, this has 23% missing values due to undeclared majors. Minor: (Nominal) The minor subject, if any, chosen by the student. This column has a high percentage of missing data (91%) since most students do not have minors. Spec: (Nominal) Specialization within the major field of study. Like the "Minor" column, this has 93% missing data as most students do not declare a specialization. Degree: (Numeric) The type of degree the student is pursuing (e.g., Bachelor's). In this dataset, all students are pursuing the same degree, so there are no missing values. Status: (Nominal) The current academic standing of the student (e.g., "Active," "Inactive"). No missing values. Load Status: (Nominal) The academic load status (e.g., "Full-time," "Part-time"). This column has very few missing values (1%). Gender: (Nominal) The gender of the student (e.g., "Male," "Female"). No missing values. Country: (Nominal) The country of origin of the student. Only 2 missing values, making it nearly complete. Governorate: (Nominal) The administrative region (governorate) the student comes from. This column has a small percentage of missing values (1%). Wellayah: (Nominal) The district or locality within the governorate. Around 1% of the data is missing. CGPA: (Numeric) The cumulative grade point average (CGPA) of the student. This field has 145 missing values, representing students without available CGPA records. Estimated Graduation Year: (Numeric) The expected year in which the student will graduate. No missing values. From HEAC: (Nominal) Indicates whether the student was admitted through the Higher Education Admission Center (HEAC). This column has 4% missing values. Admission Category: (Nominal) The category of admission (e.g., scholarship, self-funded). This column has a significant amount of missing data (98%), indicating that admission category data is either unavailable or irrelevant for most students. Birth Date: (Nominal) The birth date of the student. The dataset includes very few missing values (0%) and has been replaced by the derived feature "Age." Actual Graduation Date: (Nominal) The actual date on which a student graduates. More than half of the values are missing (54%), representing students who haven’t graduated yet. Withdrawal: (Nominal) Indicates whether the student has withdrawn from the university. This column has 89% missing data since the majority of students haven’t withdrawn. Marital Status: (Nominal) The marital status of the student (e.g., "Single," "Married"). No missing values. SQU Hostel: (Nominal) Indicates whether the student lives in the university hostel. No missing values. Percentage (Secondary School Score): (Nominal) The student’s percentage score from secondary school. No missing values. Probation Student: (Nominal) Indicates whether the student is under academic probation. This is the target variable for classification, with no missing values.
Record Details Total Records: 300 Total Attributes: 26 Missing Values: Some columns have a significant proportion of missing data (e.g., Minor, Spec, Major Code), while others have very few or no missing values (e.g., Gender, Cohort, College). Missing values were handled using a placeholder for clarity in certain columns.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset is longitudinal in nature, comprising data from school years (2007/2008-2010/2011) following students in grade 1 to grade 4. Measures were chosen to provide a wide array of both reading and writing measures, encompassing reading and writing skills at the word, sentence, and larger passage or text levels. Participants were tested on all measures once a year, approximately one year apart. Participants were first grade students in the fall of 2007 whose parents consented to participate in the longitudinal study. Participants attended six different schools in a metropolitan school district in Tallahassee, Florida. Data was gathered by trained testers during thirty to sixty minute sessions in a quiet room designated for testing at the schools. The test battery was scored in a lab by two or more raters and discrepancies in the scoring were resolved by an additional rater.
Reading Measures Decoding Measures. The Woodcock Reading Mastery Tests-Revised (WRMT-R; Woodcock, 1987): Word Attack subtest was used to assess accuracy for decoding non-words. The Test of Word Reading Efficiency (TOWRE; Torgesen, Wagner, & Rashotte, 1999): Phonetic Decoding Efficiency (PDE) subtest was also used to assess pseudo-word reading fluency and accuracy. Both subtests were used to form a word level decoding latent factor. The WRMT-R Word Attack subtest consist of a list of non-words that are read out loud by the participant. The lists start off with letters and become increasingly more difficult to include complex non-words. Testing is discontinued after six consecutive incorrect items. The median reliability is reported to be .87 for Word Attack (Woodock, McGrew, & Mather, 2001). The TOWRE PDE requires accurately reading as many non-words as possible in 45 seconds. The TOWRE test manual reports test-retest reliability to be .90 for the PDE subtest. Sentence Reading Measures. Two forms of the Test of Silent Reading Efficiency and Comprehension (TOSREC, forms A and D; Wagner et al., 2010) were used as measures of silent reading fluency. Students were required to read brief statements (e.g., “a cow is an animal”) and verify the truthfulness of the statement by circling yes or no. Students are given three minutes to read and answer as many sentences as possible. The mean alternate forms reliability for the TOSREC ranges from .86 to .95.
Reading Comprehension Measures. The Woodcock-Johnson-III (WJ-III) Passage Comprehension subtest (Woodcock et al., 2001) and the Woodcock Reading Mastery TestRevised Passage Comprehension subtest (WRMT-R; Woodcock, 1987) were used to provide two indicators of reading comprehension. For both of the passage comprehension subtests, students read brief passages to identify missing words. Testing is discontinued when the ceiling is reached (six consecutive wrong answers or until the last page was reached). According to the test manuals, test-retest reliability is reported to be above .90 for WRMT-R, and the median reliability coefficient for WJ-III is reported to be .92.
Spelling Measures. The Spelling subtest from the Wide Range Achievement Test-3 (WRAT-3; Wilkinson, 1993) and the Spelling subtest from the Wechsler Individual Achievement Test-II (WIAT-II; The Psychological Corporation, 2002) were used to form a spelling factor. 14 Both spelling subtests required students to spell words with increasing difficulty from dictation. The ceiling for the WRAT3 Spelling subtest is misspelling ten consecutive words. If the first five words are not spelled correctly, the student is required to write his or her name and a series of letters and then continue spelling until they have missed ten consecutive items. The ceiling for WIAT-II is misspelling 6 consecutive words. The reliability of the WRAT-3 spelling subtest is reported to be .96 and the reliability of the WIAT-II Spelling subtest is reported to be .94.
Written Expression Measures. The Written Expression subtest from the Wechsler Individual Achievement Test-II (WIAT-II; The Psychological Corporation, 2002) was administered. Written Expression score is based on a composite of Word Fluency and Combining Sentences in first and second grades and a composite of Word Fluency, Combining Sentences, and Paragraph tasks in third grade. In this study the Combining Sentences task was used as an indicator of writing ability at the sentence level. For this task students are asked to combine various sentences into one meaningful sentence. According to the manual, the test-retest reliability coefficient for the Written Expression subtest is .86.
Writing Prompts. A writing composition task was also administered. Participants were asked to write a passage on a topic provided by the tester. Students were instructed to scratch out any mistakes and were not allowed to use erasers. The task was administered in groups and lasted 10 minutes. The passages for years 1 and 2 required expository writing and the passage for year 3 required narrative writing. The topics were as follows: choosing a pet for the classroom (year 1), favorite subject (year 2), a day off from school (year 3). The writing samples were transcribed into a computer database by two trained coders. In order to submit the samples to Coh-Metrix (described below) the coders also corrected the samples. Samples were corrected once for spelling and punctuation using a hard criterion (i.e., words were corrected individually for spelling errors regardless of the context, and run-on sentences were broken down into separate sentences). In addition, the samples were completely corrected using the soft criterion: corrections were made for spelling based on context (e.g., correcting there for their), punctuation, grammar, usage, and syntax (see Appendix A for examples of original and corrected transcripts). The samples that were corrected only for spelling and punctuation using the hard criterion were used for several reasons: (a) developing readers make many spelling errors which make their original samples illegible, and (b) the samples that were completely corrected do not stay true to the child’s writing ability. Accuracy of writing was not reflected in 15 the corrected samples because of the elimination of spelling errors. However, as mentioned above spelling ability was measured separately. Data on compositional fluency and complexity were obtained from Coh-Metrix. Compositional fluency refers to how much writing was done and complexity refers to the density of writing and length of sentences (Berninger et al., 2002; Wagner et al., 2010).
Coh-Metrix Measures. The transcribed samples were analyzed using Coh-Metrix (McNamara et al., 2005; Graesser et al., 2004). Coh-Metrix is a computer scoring system that analyzes over 50 measures of coherence, cohesion, language, and readability of texts. Appendix B contains the list of variables provided by Coh-Metrix. In the present study, the variables were broadly grouped into the following categories: a) syntactic, b) semantic, c) compositional fluency, d) frequency, e) readability and f) situation model. Syntactic measures provide information on pronouns, noun phrases, verb and noun constituents, connectives, type-token ratio, and number of words before the main verb. Connectives are words such as so and because that are used to connect clauses. Causal, logical, additive and temporal connectives indicate cohesion and logical ordering of ideas. Type-token ratio is the ratio of unique words to the number of times each word is used. Semantic measures provide information on nouns, word stems, anaphors, content word overlap, Latent Semantic Analysis (LSA), concreteness, and hypernyms. Anaphors are words (such as pronouns) used to avoid repetition (e.g., she refers to a person that was previously described in the text). LSA refers to how conceptually similar each sentence is to every other sentence in the text. Concreteness refers to the level of imaginability of a word, or the extent to which words are not abstract. Concrete words have more distinctive features and can be easily pictured in the mind. Hypernym is also a measure of concreteness and refers to the conceptual taxonomic level of a word (for example, chair has 7 hypernym levels: seat -> furniture -> furnishings -> instrumentality -> artifact -> object -> entity). Compositional fluency measures include the number of paragraphs, sentences and words, as well as their average length and the frequencies of content words. Frequency indices provide information on the frequency of content words, including several transformations of the raw frequency score. Content words are nouns, adverbs, adjectives, main verbs, and other categories with rich conceptual content. Readability indices are related to fluency and include two traditional indices used to assess difficulty of text: Flesch Reading Ease Score and Flesch- 16 Kincaid Grade Level. Finally, situation model indices describe what the text is about, including causality of events and actions, intentionality of performing actions, tenses of actions and spatial information. Because Coh-Metrix hasn’t been widely used to study the development of writing in primary grade children (Puranik et al., 2010) the variables used in the present study were determined in an exploratory manner described below. Out of the 56 variables, 3 were used in the present study: total number of words, total number of sentences and average sentence length (or average number of words per sentence). Nelson and Van Meter (2007) report that total word productivity is a robust measure of developmental growth in writing. Therefore, indicators for a paragraph level factor included total number of words and total number of sentences. Average words per sentence was used as an indicator for a latent sentence level factor, along with the WIAT-II Combining Sentences task.
Following the Sunshine State Standards, students are required to take the Florida
In 2023, the mean income of women with a doctorate degree in the United States stood at 139,100 U.S. dollars. For men with the same degree, mean earnings stood at 175,500 U.S. dollars. On average in 2023, American men earned 91,590 U.S. dollars, while American women earned 65,987 U.S. dollars.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe accurate measurement of educational attainment is of great importance for population research. Past studies measuring average years of schooling rely on strong assumptions to incorporate binned data. These assumptions, which we refer to as the standard duration method, have not been previously evaluated for bias or accuracy.MethodsWe assembled a database of 1,680 survey and census datasets, representing both binned and single-year education data. We developed two models that split bins of education into single year values. We evaluate our models, and compare them to the standard duration method, using out-of-sample predictive validity.ResultsOur results indicate that typical methods used to split bins of educational attainment introduce substantial error and bias into estimates of average years of schooling, as compared to new approaches. Globally, the standard duration method underestimates average years of schooling, with a median error of -0.47 years. This effect is especially pronounced in datasets with a smaller number of bins or higher true average attainment, leading to irregular error patterns between geographies and time periods. Both models we developed resulted in unbiased predictions of average years of schooling, with smaller average error than previous methods. We find that one approach using a metric of distance in space and time to identify training data, had the best performance, with a root mean squared error of mean attainment of 0.26 years, compared to 0.92 years for the standard duration algorithm.ConclusionsEducation is a key social indicator and its accurate estimation should be a population research priority. The use of a space-time distance bin-splitting model drastically improved the estimation of average years of schooling from binned education data. We provide a detailed description of how to use the method and recommend that future studies estimating educational attainment across time or geographies use a similar approach.