Facebook
TwitterThis dataset includes FAQ data and their categories to train a chatbot specialized for e-learning system used in Tokyo Metropolitan University. We report accuracies of the chatbot in the following paper.
Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "Supporting Creation of FAQ Dataset for E-learning Chatbot", Intelligent Decision Technologies, Smart Innovation, IDT'19, Springer, 2019, to appear.
Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "An FAQ Dataset for E-learning System Used on a Japanese University", Data in Brief, Elsevier, in press.
This dataset is based on real Q&A data about how to use the e-learning system asked by students and teachers who use it in practical classes. The duration we collected the Q&A data is from April 2015 to July 2018.
We attach an English version dataset translated from the Japanese dataset to ease understanding what contents our dataset has. Note here that we did not perform any evaluations on the English version dataset; there are no results how accurate chatbots responds to questions.
File contents:
Results of statistical analyses for the dataset. We used Calinski and Harabaz method, mutual information, Jaccard Index, TF-IDF+KL divergence, and TF-IDF+JS divergence in order to measure qualities of the dataset. In the analyses, we regard each answer as a cluster for questions. We also perform the same analyses for categories by regarding them as clusters for answers.
Grants: JSPS KAKENHI Grant Number 18H01057
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Title: Survey of International Students
Creator: Fatema Islam Meem, Imran Hussain Mahdy
Subject: International Students; Academic and Social Integration; University of Idaho
Description: This dataset contains responses from a survey conducted to understand the factors affecting the academic success and social integration of international students at the University of Idaho.
Contributor: University of Idaho
Date: February 10, 2024
Type: Dataset
Format: CSV
Source: Goole Form
Language: English
Coverage: University of Idaho, [2014-2024]
Sources: The primary data source was a Google Form survey designed to capture international students' perspectives on their integration into the academic and social fabric of the university. Questions were developed to explore academic challenges, social integration, support systems, and overall satisfaction with their university experience.
Collection Methodology: The survey was distributed with the assistance of the International Programs Office (IPO) at the University of Idaho to ensure a broad reach among the target demographic. Efforts were made to design the survey questions to be clear, concise, and sensitive to the cultural diversity of the respondents. The collection process faced challenges, particularly in achieving a high response rate. Despite these obstacles, approximately 48 responses were obtained, providing valuable insights into the experiences of international students. The survey data were anonymized to protect respondents' privacy and maintain data integrity.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/33321/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/33321/terms
The University of Washington - Beyond High School (UW-BHS) project surveyed students in Washington State to examine factors impacting educational attainment and the transition to adulthood among high school seniors. The project began in 1999 in an effort to assess the impact of I-200 (the referendum that ended Affirmative Action) on minority enrollment in higher education in Washington. The research objectives of the project were: (1) to describe and explain differences in the transition from high school to college by race and ethnicity, socioeconomic origins, and other characteristics, (2) to evaluate the impact of the Washington State Achievers Program, and (3) to explore the implications of multiple race and ethnic identities. Following a successful pilot survey in the spring of 2000, the project eventually included baseline and one-year follow-up surveys (conducted in 2002, 2003, 2004, and 2005) of almost 10,000 high school seniors in five cohorts across several Washington school districts. The high school senior surveys included questions that explored students' educational aspirations and future career plans, as well as questions on family background, home life, perceptions of school and home environments, self-esteem, and participation in school related and non-school related activities. To supplement the 2000, 2002, and 2003 student surveys, parents of high school seniors were also queried to determine their expectations and aspirations for their child's education, as well as their own educational backgrounds and fields of employment. Parents were also asked to report any financial measures undertaken to prepare for their child's continued education, and whether the household received any form of financial assistance. In 2010, a ten-year follow-up with the 2000 senior cohort was conducted to assess educational, career, and familial outcomes. The ten year follow-up surveys collected information on educational attainment, early employment experiences, family and partnership, civic engagement, and health status. The baseline, parent, and follow-up surveys also collected detailed demographic information, including age, sex, ethnicity, language, religion, education level, employment, income, marital status, and parental status.
Facebook
Twitter130 million Chinese question text data from primary school to college, 17.63 million K12 question data (including 14.36 million parsing questions), 117 million college and vocational question data; K12 test questions include fields such as quality, type, grade_band, difficulty, grade, course, answer, solution and etc.; university and vocational test questions include fields such as titie, answer and etc. For K12 test questions, the educational stages are primary school, junior high school, and senior high school, with subjects including Chinese, mathematics, English, history, geography, politics, biology, physics, chemistry, and science; for university and vocational test questions, the fields include public security, civil service, medicine, foreign languages, academic qualifications, engineering, education, law, economics, vocational, computer science, professional qualifications, and finance.
Data content
K-12 test question data + university and vocational test question data
Data size
Total of 17.63 million K12 test questions (including 14.36 million with explanations); total of 117 million university and vocational test questions
Data fields
K12 test questions include fields such as quality, type, grade_band, difficulty, grade, course, answer, solution and etc.; university and vocational test questions include fields such as titie, answer
Major categories
For K12 test questions, the educational stages are primary school, junior high school, and senior high school, with subjects including Chinese, mathematics, English, history, geography, politics, biology, physics, chemistry, and science; for university and vocational test questions, the fields include public security, civil service, medicine, foreign languages, academic qualifications, engineering, education, law, economics, vocational, computer science, professional qualifications, and finance.
Question type categories
Multiple-choice, Single-choice, True or false, Fill-in-the-blank, etc.
Storage format
Jsonl
Language
Chinese
Data processing
All fields were analyzed, and content was also cleaned
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The data collection process commenced with web scraping of a selected higher education institution's website, collecting any data that relates to the admission topic of higher education institutions, during the period from July to September 2023. This resulted in a raw dataset primarily cantered around admission-related content. Subsequently, meticulous data cleaning and organization procedures were implemented to refine the dataset. The primary data, in its raw form before annotation into a question-and-answer format, was predominantly in the Indonesian language. Following this, a comprehensive annotation process was conducted to enrich the dataset with specific admission-related information, transforming it into secondary data. Both primary and secondary data predominantly remained in the Indonesian language. To enhance data quality, we added filters to remove or exclude: 1) data not in the Indonesian language, 2) data unrelated to the admission topic, and 3) redundant entries. This meticulous curation has culminated in the creation of a finalized dataset, meticulously prepared and now readily available for research and analysis in the domain of higher education admission.
Computer Science, Education, Marketing, Natural Language Processing
Emny Yossy,Derwin Suhartono,Agung Trisetyarso,Widodo Budiharto
Data Source: Mendeley Data
Facebook
TwitterThis dataset contains 10.44 million English-language test questions parsed and structured for large-scale educational AI and NLP applications. Each question record includes the title, answer, explanation (parse), subject, grade level, and question type. Covering a full range of academic stages from primary and middle school to high school and university, the dataset spans core subjects such as English, mathematics, biology, and accounting. The content follows the Anglo-American education system and supports tasks such as question answering, subject knowledge enhancement, educational chatbot training, and intelligent tutoring systems. All data are formatted for efficient machine learning use and comply with data protection regulations including GDPR, CCPA, and PIPL.
Facebook
Twitter32 million - Science Subjects Questions Text Parsing And Processing Data, including mathematics, physics, chemistry and biology in primary, middle and high school and university.
Facebook
Twitter6.9 million - Chinese Multi-disciplinary Questions Text Parsing And Processing Data, including multiple disciplines in primary school, middle school, high school and university. Each questions contain title, answer, parse, type, subject, grade. The dataset can be used for large model subject knowledge enhancement tasks.
Content Multi-disciplinary questions;
Data Size About 6.9 million;
Data Fields Contains title, answer, parse, subject, grade, question type;
Subject categories The subjects for primary school, middle school, high school and university;
Format Jsonl;
Language Chinese;
Data processing Subject, questions, parse and answers were analyzed, formula conversion and table format conversion were done, and content was also cleaned
Facebook
Twitter6.9 million - Chinese Multi-disciplinary Questions Text Parsing And Processing Data, including multiple disciplines in primary school, middle school, high school and university.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/2221/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/2221/terms
The Fall Enrollment survey is conducted annually by the National Center for Education Statistics (NCES) as part of the Integrated Postsecondary Education Data System (IPEDS). The survey collects data that describe the status of student participation in various types of postsecondary institutions. The data are collected by sex for six racial/ethnic categories as defined by the Office for Civil Rights (OCR). There are two parts included in this survey. Part A, Enrollment Summary by Racial/Ethnic Status, provides enrollment data by race/ethnicity and sex and by level and year of study of the student. Part C, Clarifying Questions, supplies information on students enrolled in remedial courses, extension divisions, and branches of schools, as well as numbers of transfer students from in-state, out of state, and other countries.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset contains a collection of questions and answers for the SAT Subject Test in World History and US History. Each question is accompanied by a corresponding answers and the correct response.
The dataset includes questions from various topics, time periods, and regions on both World History and US History.
For each question, we extracted: - id: number of the question, - subject: SAT subject (World History or US History), - prompt: text of the question, - A: answer A, - B: answer B, - C: answer C, - D: answer D, - E: answer E, - answer: letter of the correct answer to the question
🚀 You can learn more about our high-quality unique datasets here
keywords: answer questions, sat, gpa, university, school, exam, college, web scraping, parsing, online database, text dataset, sentiment analysis, llm dataset, language modeling, large language models, text classification, text mining dataset, natural language texts, nlp, nlp open-source dataset, text data, machine learning
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis-Follow-Up-Questions-24 comprises 18980 simulated continuations of conversational search sessions and 3883 human judgments of such.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The article celebrates the ten years of the question-and-answer website developed at the Reference Center for Teaching Physics (CREF) of the Federal University of Rio Grande do Sul. These are questions and doubts, most intriguing and some of a complex nature, whose answers are hardly found in the physics literature. The description, evolution, scope and impact of the site with the Portuguese-speaking community of teachers and students are discussed. Also presented are the ten most accessed questions on the site and, as examples, six complete posts.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added in the most recent iteration. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2023 and single recent year data pertain to citations received during calendar year 2023. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2024 snapshot from Scopus, updated to end of citation year 2023. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2024. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a
Facebook
Twitter10.4 Million - English Test Questions Text Parsing And Processing Data, Each question contains title, answer, parse, subject, grade, question type; The educational stages cover primary, middle, high school, and university; Subjects cover mathmatics, biology, accounting, etc.The data are questions text under the Anglo-American system, which can be used to enhance the subject knowledge of large models
Content
English questions text data under Anglo-American system;
Data Size
About 10.44 million;
Data Fields
contains title, answer, parse, subject, grade, question type;
Subject categories
Subjects across primary, middle, high school, and university;
Question type categories
Multiple Choice,Single Choice,True/False,Fill in the Blanks, etc.;
Format
jsonl;
Language
English.
Data processing
Subject, questions, parse and answers were analyzed, formula conversion and table format conversion were done, and content was also cleaned
Facebook
Twitterhttps://data.gov.tw/licensehttps://data.gov.tw/license
The entrance exam questions for the Central Police University's admissions exam.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of Student Responses for the survey about Ideal Student Life and key factors that are important in student life.
The dataset can be used to answer key questions such as :
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of the norms associated with the 1364 general knowledge questions responded by 385 Spanish university students.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset was created and deposited onto the University of Sheffield Online Research Data repository (ORDA) on 23-Jun-2023 by Dr. Matthew S. Hanchard, Research Associate at the University of Sheffield iHuman Institute.
The dataset forms part of three outputs from a project titled ‘Fostering cultures of open qualitative research’ which ran from January 2023 to June 2023:
· Fostering cultures of open qualitative research: Dataset 1 – Survey Responses · Fostering cultures of open qualitative research: Dataset 2 – Interview Transcripts · Fostering cultures of open qualitative research: Dataset 3 – Coding Book
The project was funded with £13,913.85 Research England monies held internally by the University of Sheffield - as part of their ‘Enhancing Research Cultures’ scheme 2022-2023.
The dataset aligns with ethical approval granted by the University of Sheffield School of Sociological Studies Research Ethics Committee (ref: 051118) on 23-Jan-2021.This includes due concern for participant anonymity and data management.
ORDA has full permission to store this dataset and to make it open access for public re-use on the basis that no commercial gain will be made form reuse. It has been deposited under a CC-BY-NC license.
This dataset comprises one spreadsheet with N=91 anonymised survey responses .xslx format. It includes all responses to the project survey which used Google Forms between 06-Feb-2023 and 30-May-2023. The spreadsheet can be opened with Microsoft Excel, Google Sheet, or open-source equivalents.
The survey responses include a random sample of researchers worldwide undertaking qualitative, mixed-methods, or multi-modal research.
The recruitment of respondents was initially purposive, aiming to gather responses from qualitative researchers at research-intensive (targetted Russell Group) Universities. This involved speculative emails and a call for participant on the University of Sheffield ‘Qualitative Open Research Network’ mailing list. As result, the responses include a snowball sample of scholars from elsewhere.
The spreadsheet has two tabs/sheets: one labelled ‘SurveyResponses’ contains the anonymised and tidied set of survey responses; the other, labelled ‘VariableMapping’, sets out each field/column in the ‘SurveyResponses’ tab/sheet against the original survey questions and responses it relates to.
The survey responses tab/sheet includes a field/column labelled ‘RespondentID’ (using randomly generated 16-digit alphanumeric keys) which can be used to connect survey responses to interview participants in the accompanying ‘Fostering cultures of open qualitative research: Dataset 2 – Interview transcripts’ files.
A set of survey questions gathering eligibility criteria detail and consent are not listed with in this dataset, as below. All responses provide in the dataset gained a ‘Yes’ response to all the below questions (with the exception of one question, marked with an asterisk (*) below):
· I am aged 18 or over · I have read the information and consent statement and above. · I understand how to ask questions and/or raise a query or concern about the survey. · I agree to take part in the research and for my responses to be part of an open access dataset. These will be anonymised unless I specifically ask to be named. · I understand that my participation does not create a legally binding agreement or employment relationship with the University of Sheffield · I understand that I can withdraw from the research at any time. · I assign the copyright I hold in materials generated as part of this project to The University of Sheffield. · * I am happy to be contacted after the survey to take part in an interview.
The project was undertaken by two staff: Co-investigator: Dr. Itzel San Roman Pineda ORCiD ID: 0000-0002-3785-8057 i.sanromanpineda@sheffield.ac.uk
Postdoctoral Research Assistant Principal Investigator (corresponding dataset author): Dr. Matthew Hanchard ORCiD ID: 0000-0003-2460-8638 m.s.hanchard@sheffield.ac.uk Research Associate iHuman Institute, Social Research Institutes, Faculty of Social Science
Facebook
TwitterThis dataset includes FAQ data and their categories to train a chatbot specialized for e-learning system used in Tokyo Metropolitan University. We report accuracies of the chatbot in the following paper.
Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "Supporting Creation of FAQ Dataset for E-learning Chatbot", Intelligent Decision Technologies, Smart Innovation, IDT'19, Springer, 2019, to appear.
Yasunobu Sumikawa, Masaaki Fujiyoshi, Hisashi Hatakeyama, and Masahiro Nagai "An FAQ Dataset for E-learning System Used on a Japanese University", Data in Brief, Elsevier, in press.
This dataset is based on real Q&A data about how to use the e-learning system asked by students and teachers who use it in practical classes. The duration we collected the Q&A data is from April 2015 to July 2018.
We attach an English version dataset translated from the Japanese dataset to ease understanding what contents our dataset has. Note here that we did not perform any evaluations on the English version dataset; there are no results how accurate chatbots responds to questions.
File contents:
Results of statistical analyses for the dataset. We used Calinski and Harabaz method, mutual information, Jaccard Index, TF-IDF+KL divergence, and TF-IDF+JS divergence in order to measure qualities of the dataset. In the analyses, we regard each answer as a cluster for questions. We also perform the same analyses for categories by regarding them as clusters for answers.
Grants: JSPS KAKENHI Grant Number 18H01057