Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This repository contains a dataset of higher education institutions in the United States of America. This dataset was compiled in response to a cybersecurity research of American higher education institutions' websites [1]. The data is being made publicly available to promote open science principles [2].
The data includes the following fields for each institution:
The dataset was obtained from the Higher Education Integrated Data System (IPEDS) website [3], which is administered by the National Center for Education Statistics (NCES). NCES serves as the primary federal entity for collecting and analyzing education-related data in the United States. The data was collected on February 2, 2023.
The initial list of institutions was derived from the IPEDS database using the following criteria: (1) US institutions only, (2) degree-granting institutions, primarily bachelor's or higher, and (3) industry classification, which includes: public 4 - year or above, private not-for-profit 4 years or more, private for-profit 4 years or more, public 2 years, private not-for-profit 2 years, private for-profit 2 years, public less than 2 years, private not-for-profit for-profit less than 2 years and private for-profit less than 2 years.
The following variables have been added to the list of institutions: Control of the institution, state abbreviation, degree-granting status, Status of the institution, and Institution's internet website address. This resulted in a report with 1,979 institutions.
The institution's status was labeled with the following values: A (Active), N (New), R (Restored), M (Closed in the current year), C (Combined with another institution), D (Deleted out of business), I (Inactive due to hurricane-related issues), O (Outside IPEDS scope), P (Potential new/add institution), Q (Potential institution reestablishment), W (Potential addition outside IPEDS scope), X ( Potential restoration outside the scope of IPEDS) and G (Perfect Children's Campus).
A filter was applied to the report to retain only institutions with an A, N, or R status, resulting in 1,978 institutions. Finally, a data cleaning process was applied, which involved removing the whitespace at the beginning and end of cell content and duplicate whitespace. The final data were compiled into the dataset included in this repository.
This data is available under the Creative Commons Zero (CC0) license and can be used for any purpose, including academic research purposes. We encourage the sharing of knowledge and the advancement of research in this field by adhering to open science principles [2].
If you use this data in your research, please cite the source and include a link to this repository. To properly attribute this data, please use the following DOI: 10.5281/zenodo.7614862
If you have any updates or corrections to the data, please feel free to open a pull request or contact us directly. Let's work together to keep this data accurate and up-to-date.
We would like to acknowledge the support of the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within the project "Cybers SeC IP" (NORTE-01-0145-FEDER-000044). This study was also developed as part of the Master in Cybersecurity Program at the Instituto Politécnico de Viana do Castelo, Portugal.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By FiveThirtyEight [source]
This repository contains a comprehensive selection of lavish data and processing scripts behind the articles, graphics, and interactive experiences generated by FiveThirtyEight. With this dataset, you'll have the power to explore college programs and their graduates like never before and create stories of your own! Whether you use it to check our work or craft your own powerful visuals, we would absolutely love to know if you found it useful. Under the Creative Commons Attribution 4.0 International License and MIT License respectively, our data is available for anyone who chooses to use it. Let us know how our resources turned out at
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Create an interactive comparison tool for researching college majors and their earning potential, so that prospective students can make informed decisions about what to study.
- Analyze the proportions of male and female graduates across different majors to uncover gender disparities in higher education.
- Explore the correlations between major categories, average salaries earned by graduates from specific major categories, unemployment rates for those with specific majors and more – to identify trends in job opportunities for certain specialties of study
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: majors-list.csv | Column name | Description | |:-------------------|:----------------------------------------------------| | FOD1P | First-level division of the field of study (String) | | Major | The specific major of the field of study (String) | | Major_Category | The broader category of the field of study (String) |
File: recent-grads.csv | Column name | Description | |:-------------------------|:-------------------------------------------------------------------------------| | Major | The specific major of the field of study (String) | | Rank | The rank of the major in terms of popularity (Integer) | | Major_code | The code associated with the major (Integer) | | Major_category | The category of the major (String) | | Total | The total number of students in the major (Integer) | | Sample_size | The sample size of the major (Integer) | | Men | The number of male students in the major (Integer) | | Women | The number of female students in the major (Integer) | | ShareWomen | The percentage of female students in the major (Float) | | Employed | The number of employed graduates from the major (Integer) | | Full_time | The number of full-time employed graduates from the major (Integer) | | Part_time | The number of part-time employed graduates from the major (Integer) | | Full_time_year_round | The number of full-time year-round employed graduates from the major (Integer) | | Unemployed | The number of unemployed graduates from the major (Integer) | | Unemployment_rate | The unemployment rate of graduates from the major (Float) | | Median | The median salary of graduates from the major (Integer) | | P25th | The 25th percentile salary of graduates from the major (Integer) | | P75th | The 75th percentile salary of graduates from the major (Integer) | | College_jobs | The number of college jobs held by graduates from the major...
Facebook
TwitterThe National Survey of College Graduates is a repeated cross-sectional biennial survey that provides data on the nation's college graduates, with a focus on those in the science and engineering workforce. This survey is a unique source for examining the relationship of degree field and occupation in addition to other characteristics of college-educated individuals, including work activities, salary, and demographic information.
Facebook
TwitterIn 2022, about 37.7 percent of the U.S. population who were aged 25 and above had graduated from college or another higher education institution, a slight decline from 37.9 the previous year. However, this is a significant increase from 1960, when only 7.7 percent of the U.S. population had graduated from college. Demographics Educational attainment varies by gender, location, race, and age throughout the United States. Asian-American and Pacific Islanders had the highest level of education, on average, while Massachusetts and the District of Colombia are areas home to the highest rates of residents with a bachelor’s degree or higher. However, education levels are correlated with wealth. While public education is free up until the 12th grade, the cost of university is out of reach for many Americans, making social mobility increasingly difficult. Earnings White Americans with a professional degree earned the most money on average, compared to other educational levels and races. However, regardless of educational attainment, males typically earned far more on average compared to females. Despite the decreasing wage gap over the years in the country, it remains an issue to this day. Not only is there a large wage gap between males and females, but there is also a large income gap linked to race as well.
Facebook
TwitterNOTE: Data in this table represent the 50 states and the District of Columbia. Data through 1995 are for institutions of higher education, while later data are for degree-granting institutions. Degree-granting institutions grant associate’s or higher degrees and participate in Title IV federal financial aid programs. The degree-granting classification is very similar to the earlier higher education classification, but it includes more 2-year colleges and excludes a few higher education institutions that did not grant degrees. Projections in this table were calculated after the onset of the coronavirus pandemic and take into account the expected impacts of the pandemic. Some data have been revised from previously published figures.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Overall educational attainment measures the highest level of education attained by a given individual: for example, an individual counted in the percentage of the measured population with a master’s or professional degree can be assumed to also have a bachelor’s degree and a high school diploma, but they are not counted in the population percentages for those two categories. Overall educational attainment is the broadest education indicator available, providing information about the measured county population as a whole.
Only members of the population aged 25 and older are included in these educational attainment estimates, sourced from the U.S. Census Bureau American Community Survey (ACS).
Champaign County has high educational attainment: over 48 percent of the county's population aged 25 or older has a bachelor's degree or graduate or professional degree as their highest level of education. In comparison, the percentage of the population aged 25 or older in the United States and Illinois with a bachelor's degree in 2023 was 21.8% (+/-0.1) and 22.8% (+/-0.2), respectively. The population aged 25 or older in the U.S. and Illinois with a graduate or professional degree in 2022, respectively, was 14.3% (+/-0.1) and 15.5% (+/-0.2).
Educational attainment data was sourced from the U.S. Census Bureau’s American Community Survey 1-Year Estimates, which are released annually.
As with any datasets that are estimates rather than exact counts, it is important to take into account the margins of error (listed in the column beside each figure) when drawing conclusions from the data.
Due to the impact of the COVID-19 pandemic, instead of providing the standard 1-year data products, the Census Bureau released experimental estimates from the 1-year data in 2020. This includes a limited number of data tables for the nation, states, and the District of Columbia. The Census Bureau states that the 2020 ACS 1-year experimental tables use an experimental estimation methodology and should not be compared with other ACS data. For these reasons, and because data is not available for Champaign County, no data for 2020 is included in this Indicator.
For interested data users, the 2020 ACS 1-Year Experimental data release includes a dataset on Educational Attainment for the Population 25 Years and Over.
Sources: U.S. Census Bureau; American Community Survey, 2023 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using data.census.gov; (16 October 2024).; U.S. Census Bureau; American Community Survey, 2022 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using data.census.gov; (29 September 2023).; U.S. Census Bureau; American Community Survey, 2021 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using data.census.gov; (6 October 2022).; U.S. Census Bureau; American Community Survey, 2019 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using data.census.gov; (4 June 2021).; U.S. Census Bureau; American Community Survey, 2018 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using data.census.gov; (4 June 2021).; U.S. Census Bureau; American Community Survey, 2017 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (13 September 2018).; U.S. Census Bureau; American Community Survey, 2016 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (13 September 2018). U.S. Census Bureau; American Community Survey, 2015 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (19 September 2016).; U.S. Census Bureau; American Community Survey, 2014 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2013 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2012 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2011 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2010 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2009 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2008 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2007 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2006 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (16 March 2016).; U.S. Census Bureau; American Community Survey, 2005 American Community Survey 1-Year Estimates, Table S1501; generated by CCRPC staff; using American FactFinder; (16 March 2016).
Facebook
TwitterThe College Scorecard dataset is provided by the U.S. Department of Education and contains information on nearly every college and university in the United States. The dataset includes data on student loan repayment rates, graduation rates, affordability, earnings after graduation, and more. The goal of this dataset is to help students make informed decisions about their college choice by providing them with clear and concise information about each school's performance
This dataset can help understand the cost of attending college in the United States, as well as the average debt load for students. It can also be used to compare different schools in terms of their graduation rates and repayment rates
This data was originally collected by the US Department of Education and made available on their website. Thank you to the US Department of Education for making this data available!
Facebook
TwitterThere were approximately 18.58 million college students in the U.S. in 2022, with around 13.49 million enrolled in public colleges and a further 5.09 million students enrolled in private colleges. The figures are projected to remain relatively constant over the next few years.
What is the most expensive college in the U.S.? The overall number of higher education institutions in the U.S. totals around 4,000, and California is the state with the most. One important factor that students – and their parents – must consider before choosing a college is cost. With annual expenses totaling almost 78,000 U.S. dollars, Harvey Mudd College in California was the most expensive college for the 2021-2022 academic year. There are three major costs of college: tuition, room, and board. The difference in on-campus and off-campus accommodation costs is often negligible, but they can change greatly depending on the college town.
The differences between public and private colleges Public colleges, also called state colleges, are mostly funded by state governments. Private colleges, on the other hand, are not funded by the government but by private donors and endowments. Typically, private institutions are much more expensive. Public colleges tend to offer different tuition fees for students based on whether they live in-state or out-of-state, while private colleges have the same tuition cost for every student.
Facebook
Twitterhttps://catalog.dvrpc.org/dvrpc_data_license.htmlhttps://catalog.dvrpc.org/dvrpc_data_license.html
As part of the American Community Survey (ACS), the U.S. Census Bureau collects information regarding respondents' educational attainment. Educational attainment refers to the highest level of education that all individuals age 25 and older have completed. Response categories include no schooling completed; nursery school, grades 1 through 11; 12th grade but no diploma; regular high school diploma; GED or alternative credential; some college credit, but less than one year of college; one or more years of college credit, no degree; associate's degree; bachelor's degree; master's degree, professional degree beyond bachelor's degree; and doctorate degree. Data from the 2000 Decennial Census is also summarized.
Facebook
TwitterDESCRIPTION College completion data from 3,800 degree-granting institutions in the United States SUMMARY Source of the data These data were pulled from the College Completion microsite produced by The Chronicle of Higher Education with support from the Bill & Melinda Gates Foundation. Its goal is to share data on completion rates in American higher education in a visually stimulating way. [Their] hope is that ... you will find your own stories in the statistics and use the tools [they] provide to download data files; share charts through your own presentations; and comment, start conversations, or provide tips about this important topic.
Note: This text was adapted from the College Completion website's About page copy. Please visit http://collegecompletion.chronicle.com/about/ for more info
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset contains complete education details for members of the 116th United States Congress, including universities, degrees, and political affiliations. I hope that this dataset is helpful for anyone who may wish to further investigate the correlation of education and academic credentials with policy decisions made by members of our government.
Data was gathered by reviewing Wikipedia pages for members of the U.S. 116th Congress via: https://en.wikipedia.org/wiki/116th_United_States_Congress
Thank you Wikipedia. You are a modern miracle.
This dataset can be used to answer questions like: - Is political affiliation correlated with education? - What is the most common degree type for U.S. Senators? - What percentage of U.S. Congressmen dropped out of college? - Which college has the most representation in the House of Representatives? - What percentage of Congressmen are scientists?
With some creative co-mingling, this dataset can be used to supplement research questions like: - Are policy decisions correlated with education? - Are there relationships associated with college affiliations and voting in Congress? - Can we find a relationship between hot button topics and education? - How is public sentiment influenced by education level?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
School enrollment data are used to assess the socioeconomic condition of school-age children. Government agencies also require these data for funding allocations and program planning and implementation.
Data on school enrollment and grade or level attending were derived from answers to Question 10 in the 2015 American Community Survey (ACS). People were classified as enrolled in school if they were attending a public or private school or college at any time during the 3 months prior to the time of interview. The question included instructions to “include only nursery or preschool, kindergarten, elementary school, home school, and schooling which leads to a high school diploma, or a college degree.” Respondents who did not answer the enrollment question were assigned the enrollment status and type of school of a person with the same age, sex, race, and Hispanic or Latino origin whose residence was in the same or nearby area.
School enrollment is only recorded if the schooling advances a person toward an elementary school certificate, a high school diploma, or a college, university, or professional school (such as law or medicine) degree. Tutoring or correspondence schools are included if credit can be obtained from a public or private school or college. People enrolled in “vocational, technical, or business school” such as post secondary vocational, trade, hospital school, and on job training were not reported as enrolled in school. Field interviewers were instructed to classify individuals who were home schooled as enrolled in private school. The guide sent out with the mail questionnaire includes instructions for how to classify home schoolers.
Enrolled in Public and Private School – Includes people who attended school in the reference period and indicated they were enrolled by marking one of the questionnaire categories for “public school, public college,” or “private school, private college, home school.” The instruction guide defines a public school as “any school or college controlled and supported primarily by a local, county, state, or federal government.” Private schools are defined as schools supported and controlled primarily by religious organizations or other private groups. Home schools are defined as “parental-guided education outside of public or private school for grades 1-12.” Respondents who marked both the “public” and “private” boxes are edited to the first entry, “public.”
Grade in Which Enrolled – From 1999-2007, in the ACS, people reported to be enrolled in “public school, public college” or “private school, private college” were classified by grade or level according to responses to Question 10b, “What grade or level was this person attending?” Seven levels were identified: “nursery school, preschool;” “kindergarten;” elementary “grade 1 to grade 4” or “grade 5 to grade 8;” high school “grade 9 to grade 12;” “college undergraduate years (freshman to senior);” and “graduate or professional school (for example: medical, dental, or law school).”
In 2008, the school enrollment questions had several changes. “Home school” was explicitly included in the “private school, private college” category. For question 10b the categories changed to the following “Nursery school, preschool,” “Kindergarten,” “Grade 1 through grade 12,” “College undergraduate years (freshman to senior),” “Graduate or professional school beyond a bachelor’s degree (for example: MA or PhD program, or medical or law school).” The survey question allowed a write-in for the grades enrolled from 1-12.
Question/Concept History – Since 1999, the ACS enrollment status question (Question 10a) refers to “regular school or college,” while the 1996-1998 ACS did not restrict reporting to “regular” school, and contained an additional category for the “vocational, technical or business school.” The 1996-1998 ACS used the educational attainment question to estimate level of enrollment for those reported to be enrolled in school, and had a single year write-in for the attainment of grades 1 through 11. Grade levels estimated using the attainment question were not consistent with other estimates, so a new question specifically asking grade or level of enrollment was added starting with the 1999 ACS questionnaire.
Limitation of the Data – Beginning in 2006, the population universe in the ACS includes people living in group quarters. Data users may see slight differences in levels of school enrollment in any given geographic area due to the inclusion of this population. The extent of this difference, if any, depends on the type of group quarters present and whether the group quarters population makes up a large proportion of the total population. For example, in areas that are home to several colleges and universities, the percent of individuals 18 to 24 who were enrolled in college or graduate school would increase, as people living in college dormitories are now included in the universe.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Cost of International Education dataset compiles detailed financial information for students pursuing higher education abroad. It covers multiple countries, cities, and universities around the world, capturing the full tuition and living expenses spectrum alongside key ancillary costs. With standardized fields such as tuition in USD, living-cost indices, rent, visa fees, insurance, and up-to-date exchange rates, it enables comparative analysis across programs, degree levels, and geographies. Whether you’re a prospective international student mapping out budgets, an educational consultant advising on affordability, or a researcher studying global education economics, this dataset offers a comprehensive foundation for data-driven insights.
| Column | Type | Description |
|---|---|---|
| Country | string | ISO country name where the university is located (e.g., “Germany”, “Australia”). |
| City | string | City in which the institution sits (e.g., “Munich”, “Melbourne”). |
| University | string | Official name of the higher-education institution (e.g., “Technical University of Munich”). |
| Program | string | Specific course or major (e.g., “Master of Computer Science”, “MBA”). |
| Level | string | Degree level of the program: “Undergraduate”, “Master’s”, “PhD”, or other certifications. |
| Duration_Years | integer | Length of the program in years (e.g., 2 for a typical Master’s). |
| Tuition_USD | numeric | Total program tuition cost, converted into U.S. dollars for ease of comparison. |
| Living_Cost_Index | numeric | A normalized index (often based on global city indices) reflecting relative day-to-day living expenses (food, transport, utilities). |
| Rent_USD | numeric | Average monthly student accommodation rent in U.S. dollars. |
| Visa_Fee_USD | numeric | One-time visa application fee payable by international students, in U.S. dollars. |
| Insurance_USD | numeric | Annual health or student insurance cost in U.S. dollars, as required by many host countries. |
| Exchange_Rate | numeric | Local currency units per U.S. dollar at the time of data collection—vital for currency conversion and trend analysis if rates fluctuate. |
Feel free to explore, visualize, and extend this dataset for deeper insights into the true cost of studying abroad!
Facebook
TwitterA broad and generalized selection of 2011-2015 US Census Bureau 2015 5-year American Community Survey education data estimates, obtained via Census API and joined to the appropriate geometry (in this case, New Mexico counties). The selection is not comprehensive, but allows a first-level characterization of educational attaiment by grade level and sex (for all persons 25 years and older), plus enrollment estimates at key educational levels (for the universe of all persons 3+ years old). The determination of which estimates to include was based upon level of interest and providing a manageable dataset for users. The U.S. Census Bureau's American Community Survey (ACS) is a nationwide, continuous survey designed to provide communities with reliable and timely demographic, housing, social, and economic data every year. The ACS collects long-form-type information throughout the decade rather than only once every 10 years. As in the decennial census, strict confidentiality laws protect all information that could be used to identify individuals or households.The ACS combines population or housing data from multiple years to produce reliable numbers for small counties, neighborhoods, and other local areas. To provide information for communities each year, the ACS provides 1-, 3-, and 5-year estimates. ACS 5-year estimates (multiyear estimates) are “period” estimates that represent data collected over a 60-month period of time (as opposed to “point-in-time” estimates, such as the decennial census, that approximate the characteristics of an area on a specific date). ACS data are released in the year immediately following the year in which they are collected. ACS estimates based on data collected from 2009–2014 should not be called “2009” or “2014” estimates. Multiyear estimates should be labeled to indicate clearly the full period of time. The primary advantage of using multiyear estimates is the increased statistical reliability of the data for less populated areas and small population subgroups. Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. While each full Data Profile contains margin of error (MOE) information, this dataset does not. Those individuals requiring more complete data are directed to download the more detailed datasets from the ACS American FactFinder website. This dataset is organized by New Mexico county boundaries.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Overview
This dataset provides a detailed and realistic simulation of Indian international students studying in the United States, their educational paths, job outcomes after graduation, university information, and visa approval statistics.
It can be used for:
📂 Files Included | File Name| Description | | --- | --- | | indian_international_students_us.csv | Profiles of 10,000 Indian international students, including university, major, degree level, and study status. | |job_outcomes_indian_students_us.csv | Job outcomes for students who graduated, including job title, company, salary, visa status, and time to first job. | |universities_info_us.csv | Information about major US universities, including acceptance rates, GRE/TOEFL averages, and international student percentages. | | visa_approval_stats.csv | Yearly visa approval and denial rates for F1, OPT, and H1B visa types from 2015 to 2023. |
✨ Potential Project Ideas 1. Predict job offer chances based on major, degree, and university. 2. Analyze salary distributions by major, company, and visa status. 3. Visualize visa approval trends over time. 4. Build a career advisory tool for international students.
✨ SQL Potential Project 1. List all students studying in "Computer Science" major. 2. Count how many students are currently enrolled vs graduated. 3. Find top 5 universities with the highest number of students. 4. Get the list of all students whose degree level is "Masters". 5. Find average salary of students who received a job offer. 6. List all companies that hired at least one student.
Find visa approval rate for each visa type (F1, OPT, H1B) over the years.
Build a report showing: University Name Number of students Number of students who got jobs Average salary Job offer rate (%)
Identify majors with the highest average salaries after graduation.
Compare visa approval trends: How have F1, OPT, and H1B approval rates changed from 2015 to 2023?
Create a view showing: Students with highest probability of getting a job based on major, university, and degree level.
Predict (with SQL logic): If a new student graduates from [University X] with [Major Y] and [Degree Level Z], what is their expected salary range?
Cohort Analysis: Analyze students who graduated in a particular year, how many got jobs within 6 months.
⚡ Important Note This dataset is synthetic but designed to be realistic based on trends among Indian students studying abroad. No real personal information is included. Great for educational, research, and portfolio purposes.
🔖 Acknowledgment Generated by Pushkar Joshi using simulated data sources. Inspired by real-world patterns and publicly available educational statistics.
🏷️ Suggested Tags
Facebook
TwitterA broad and generalized selection of 2013-2017 US Census Bureau 2017 5-year American Community Survey education data estimates, obtained via Census API and joined to the appropriate geometry (in this case, New Mexico counties). The selection is not comprehensive, but allows a first-level characterization of educational attaiment by grade level and sex (for all persons 25 years and older), plus enrollment estimates at key educational levels (for the universe of all persons 3+ years old). The determination of which estimates to include was based upon level of interest and providing a manageable dataset for users. The U.S. Census Bureau's American Community Survey (ACS) is a nationwide, continuous survey designed to provide communities with reliable and timely demographic, housing, social, and economic data every year. The ACS collects long-form-type information throughout the decade rather than only once every 10 years. As in the decennial census, strict confidentiality laws protect all information that could be used to identify individuals or households.The ACS combines population or housing data from multiple years to produce reliable numbers for small counties, neighborhoods, and other local areas. To provide information for communities each year, the ACS provides 1-, 3-, and 5-year estimates. ACS 5-year estimates (multiyear estimates) are “period” estimates that represent data collected over a 60-month period of time (as opposed to “point-in-time” estimates, such as the decennial census, that approximate the characteristics of an area on a specific date). ACS data are released in the year immediately following the year in which they are collected. ACS estimates based on data collected from 2009–2014 should not be called “2009” or “2014” estimates. Multiyear estimates should be labeled to indicate clearly the full period of time. The primary advantage of using multiyear estimates is the increased statistical reliability of the data for less populated areas and small population subgroups. Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. While each full Data Profile contains margin of error (MOE) information, this dataset does not. Those individuals requiring more complete data are directed to download the more detailed datasets from the ACS American FactFinder website. This dataset is organized by New Mexico county boundaries.
Facebook
TwitterA broad and generalized selection of 2014-2018 US Census Bureau 2018 5-year American Community Survey education data estimates, obtained via Census API and joined to the appropriate geometry (in this case, New Mexico counties). The selection is not comprehensive, but allows a first-level characterization of educational attaiment by grade level and sex (for all persons 25 years and older), plus enrollment estimates at key educational levels (for the universe of all persons 3+ years old). The determination of which estimates to include was based upon level of interest and providing a manageable dataset for users. The U.S. Census Bureau's American Community Survey (ACS) is a nationwide, continuous survey designed to provide communities with reliable and timely demographic, housing, social, and economic data every year. The ACS collects long-form-type information throughout the decade rather than only once every 10 years. As in the decennial census, strict confidentiality laws protect all information that could be used to identify individuals or households.The ACS combines population or housing data from multiple years to produce reliable numbers for small counties, neighborhoods, and other local areas. To provide information for communities each year, the ACS provides 1-, 3-, and 5-year estimates. ACS 5-year estimates (multiyear estimates) are “period” estimates that represent data collected over a 60-month period of time (as opposed to “point-in-time” estimates, such as the decennial census, that approximate the characteristics of an area on a specific date). ACS data are released in the year immediately following the year in which they are collected. ACS estimates based on data collected from 2009–2014 should not be called “2009” or “2014” estimates. Multiyear estimates should be labeled to indicate clearly the full period of time. The primary advantage of using multiyear estimates is the increased statistical reliability of the data for less populated areas and small population subgroups. Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. While each full Data Profile contains margin of error (MOE) information, this dataset does not. Those individuals requiring more complete data are directed to download the more detailed datasets from the ACS American FactFinder website. This dataset is organized by New Mexico county boundaries.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This comprehensive dataset provides detailed educational attainment and demographic analysis across all 50 US states from 2021-2023, specifically designed for tech companies planning strategic market entry and product launch decisions.
| Column Name | Data Type | Description | Example Value |
|---|---|---|---|
| NAME | String | Full US state name | "Massachusetts" |
| total_population_25plus | Integer | Total population aged 25 and above | 4,975,152 |
| bachelors_degree | Integer | Number of individuals with bachelor's degrees | 1,261,847 |
| masters_degree | Integer | Number of individuals with master's degrees | 788,243 |
| professional_degree | Integer | Number of individuals with professional degrees (JD, MD, etc.) | 157,762 |
| doctoral_degree | Integer | Number of individuals with doctoral degrees (PhD, EdD, etc.) | 169,357 |
| median_household_income | Integer | Median household income in USD | $99,858 |
| total_households | Float | Total number of households (in millions) | 2.41 |
| state | Integer | Numeric state identifier (1-50) | 25 |
| year | Integer | Data collection year | 2023 |
| college_graduates | Integer | Total college graduates (bachelor's + advanced degrees) | 2,377,209 |
| college_graduate_percentage | Float | Percentage of population with college degrees | 47.78% |
| graduate_degree_holders | Integer | Total with master's, professional, or doctoral degrees | 1,115,362 |
| graduate_degree_percentage | Float | Percentage with graduate-level degrees | 22.42% |
| advanced_degree_percentage | Float | Percentage with professional or doctoral degrees | 3.40% |
| education_score | Float | Composite education ranking score | 28.76 |
| education_rank | Integer | State ranking based on education score (1-50, 1=highest) | 1 |
The dataset reveals that Massachusetts consistently ranks #1 in education metrics with: - 47.78% college graduation rate (2023) - 22.42% graduate degree holders - $99,858 median household income - Education score of 28.76
Perfect for identifying premium tech markets and highly-educated consumer bases for sophisticated technology products.
This dataset is ideal for data scientists, market researchers, business analysts, and tech companies looking to make data-driven decisions about market entry, customer targeting, and regional strategy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.
However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.
2 Data-set Introduction
2.1 Data Collection
We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:
The headline must have one or more words directly or indirectly related to COVID-19.
The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.
The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.
Avoid taking duplicate reports.
Maintain a time frame for the above mentioned newspapers.
To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.
2.2 Data Pre-processing and Statistics
Some pre-processing steps performed on the newspaper report dataset are as follows:
Remove hyperlinks.
Remove non-English alphanumeric characters.
Remove stop words.
Lemmatize text.
While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.
The primary data statistics of the two dataset are shown in Table 1 and 2.
Table 1: Covid-News-USA-NNK data statistics
No of words per headline
7 to 20
No of words per body content
150 to 2100
Table 2: Covid-News-BD-NNK data statistics No of words per headline
10 to 20
No of words per body content
100 to 1500
2.3 Dataset Repository
We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.
3 Literature Review
Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.
Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].
Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.
Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.
4 Our experiments and Result analysis
We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:
In February, both the news paper have talked about China and source of the outbreak.
StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.
Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.
Washington Post discussed global issues more than StarTribune.
StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.
While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.
We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases
where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Adam Shl
Released under Database: Open Database, Contents: Database Contents
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This repository contains a dataset of higher education institutions in the United States of America. This dataset was compiled in response to a cybersecurity research of American higher education institutions' websites [1]. The data is being made publicly available to promote open science principles [2].
The data includes the following fields for each institution:
The dataset was obtained from the Higher Education Integrated Data System (IPEDS) website [3], which is administered by the National Center for Education Statistics (NCES). NCES serves as the primary federal entity for collecting and analyzing education-related data in the United States. The data was collected on February 2, 2023.
The initial list of institutions was derived from the IPEDS database using the following criteria: (1) US institutions only, (2) degree-granting institutions, primarily bachelor's or higher, and (3) industry classification, which includes: public 4 - year or above, private not-for-profit 4 years or more, private for-profit 4 years or more, public 2 years, private not-for-profit 2 years, private for-profit 2 years, public less than 2 years, private not-for-profit for-profit less than 2 years and private for-profit less than 2 years.
The following variables have been added to the list of institutions: Control of the institution, state abbreviation, degree-granting status, Status of the institution, and Institution's internet website address. This resulted in a report with 1,979 institutions.
The institution's status was labeled with the following values: A (Active), N (New), R (Restored), M (Closed in the current year), C (Combined with another institution), D (Deleted out of business), I (Inactive due to hurricane-related issues), O (Outside IPEDS scope), P (Potential new/add institution), Q (Potential institution reestablishment), W (Potential addition outside IPEDS scope), X ( Potential restoration outside the scope of IPEDS) and G (Perfect Children's Campus).
A filter was applied to the report to retain only institutions with an A, N, or R status, resulting in 1,978 institutions. Finally, a data cleaning process was applied, which involved removing the whitespace at the beginning and end of cell content and duplicate whitespace. The final data were compiled into the dataset included in this repository.
This data is available under the Creative Commons Zero (CC0) license and can be used for any purpose, including academic research purposes. We encourage the sharing of knowledge and the advancement of research in this field by adhering to open science principles [2].
If you use this data in your research, please cite the source and include a link to this repository. To properly attribute this data, please use the following DOI: 10.5281/zenodo.7614862
If you have any updates or corrections to the data, please feel free to open a pull request or contact us directly. Let's work together to keep this data accurate and up-to-date.
We would like to acknowledge the support of the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within the project "Cybers SeC IP" (NORTE-01-0145-FEDER-000044). This study was also developed as part of the Master in Cybersecurity Program at the Instituto Politécnico de Viana do Castelo, Portugal.