The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 questions from English exams, targeting Chinese students aged 12-18. RACE consists of two subsets, RACE-M and RACE-H, from middle school and high school exams, respectively. RACE-M has 28,293 questions and RACE-H has 69,574. Each question is associated with 4 candidate answers, one of which is correct. The data generation process of RACE differs from most machine reading comprehension datasets - instead of generating questions and answers by heuristics or crowd-sourcing, questions in RACE are specifically designed for testing human reading skills, and are created by domain experts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper constructs a dataset for Tibetan machine reading comprehension. The data comes from Yunzang website, and covers 12 fields of nature, culture, education, geography, history, life, society, art, technology, people, science and sports. The questions and answers of the dataset are manually entered and marked by 20 Tibetan professionals. It contains 631 articles, 903 paragraphs, and 2,000 question-and-answer pairs constructed based on the paragraphs. Data items mainly include article ID, title, paragraph, question and answer. The publication of this dataset is of great value for promoting the development of Tibetan information processing.
This dataset contains replication materials for the Journal of Educational Psychology paper entitled: "Improving Reading Comprehension, Science Domain Knowledge, and Reading Engagement through a First-Grade Content Literacy Intervention." Materials include the dataset and programs to replicate the analyses
Read Philippines or Basa Pilipinas was a four-year early grade reading project that operated from January 2013 to December 2016 and supported the Philippine Department of Education’s national reading program. Basa assisted the implementation of transformative literacy practices in selected divisions of Regions 1 and 7 by providing teacher and student materials, training teachers and school heads, and providing post-training support for Grade 1, 2 and 3 teachers, as well as providing Early Language, Literacy and Numeracy training to kindergarten teachers. The Basa Pilipinas activity used a quasi-experimental cross-sectional design to evaluate the impact of the treatment in improving reading and comprehension skills. Sampling was conducted at three levels: school, classrooms, and student. The school sample was drawn randomly from the activity’s five provinces. Within each school, one grade 2 classroom was selected randomly for baseline and midline with an additional grade 3 classroom selected during the endline. Within each classroom, students were randomly selected to be administered the assessment. A total of 469 students were sample from 40 schools in two provinces at the baseline (comparison), 1,216 students were sampled from 80 schools in five provinces at the midline (intervention 1), and 1,658 students were sampled from 5 provinces at the endline (intervention 2). The disparity in the number of provinces sampled is due to the expansion of the intervention from two provinces to five provinces starting at the midline to provide a more complete picture of the Basa outcomes. To enable the computation of estimates of literacy skills among students in all schools affected by the Basa intervention, design weights were applied to the analyses of EGRA data. Design weights were applied to compensate for differences in provincial sampling and to ensure an appropriate representation of learners in all provinces in the sample.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
César E. Corona-González, Claudia Rebeca De Stefano-Ramos, Juan Pablo Rosado-Aíza, Fabiola R Gómez-Velázquez, David I. Ibarra-Zarate, Luz María Alonso-Valerdi
César E. Corona-González
https://orcid.org/0000-0002-7680-2953
a00833959@tec.mx
Psychophysiological data from Mexican children with learning difficulties who strengthen reading and math skills by assistive technology
2023
The current dataset consists of psychometric and electrophysiological data from children with reading or math learning difficulties. These data were collected to evaluate improvements in reading or math skills resulting from using an online learning method called Smartick.
The psychometric evaluations from children with reading difficulties encompassed: spelling tests, where 1) orthographic and 2) phonological errors were considered, 3) reading speed, expressed in words read per minute, and 4) reading comprehension, where multiple-choice questions were given to the children. The last 2 parameters were determined according to the standards from the Ministry of Public Education (Secretaría de Educación Pública in Spanish) in Mexico. On the other hand, group 2 assessments embraced: 1) an assessment of general mathematical knowledge, as well as 2) the hits percentage, and 3) reaction time from an arithmetical task. Additionally, selective attention and intelligence quotient (IQ) were also evaluated.
Then, individuals underwent an EEG experimental paradigm where two conditions were recorded: 1) a 3-minute eyes-open resting state and 2) performing either reading or mathematical activities. EEG recordings from the reading experiment consisted of reading a text aloud and then answering questions about the text. Alternatively, EEG recordings from the math experiment involved the solution of two blocks with 20 arithmetic operations (addition and subtraction). Subsequently, each child was randomly subcategorized as 1) the experimental group, who were asked to engage with Smartick for three months, and 2) the control group, who were not involved with the intervention. Once the 3-month period was over, every child was reassessed as described before.
The dataset contains a total of 76 subjects (sub-), where two study groups were assessed: 1) reading difficulties (R) and 2) math difficulties (M). Then, each individual was subcategorized as experimental subgroup (e), where children were compromised to engage with Smartick, or control subgroup (c), where they did not get involved with any intervention.
Every subject was followed up on for three months. During this period, each subject underwent two EEG sessions, representing the PRE-intervention (ses-1) and the POST-intervention (ses-2).
The EEG recordings from the reading difficulties group consisted of a resting state condition (run-1) and while performing active reading and reading comprehension activities (run-2). On the other hand, EEG data from the math difficulties group was collected from a resting state condition (run-1) and when solving two blocks of 20 arithmetic operations (run-2 and run-3). All EEG files were stored in .set format. The nomenclature and description from filenames are shown below:
Nomenclature | Description |
---|---|
sub- | Subject |
M | Math group |
R | Reading group |
c | Control subgroup |
e | Experimental subgroup |
ses-1 | PRE-intervention |
ses-2 | POST-Intervention |
run-1 | EEG for baseline |
run-2 | EEG for reading activity, or the first block of math |
run-3 | EEG for the second block of math |
Example: the file sub-Rc11_ses-1_task-SmartickDataset_run-2_eeg.set is related to: - The 11th subject from the reading difficulties group, control subgroup (sub-Rc11). - EEG recording from the PRE-intervention (ses-1) while performing the reading activity (run-2)
Psychometric data from the reading difficulties group:
Psychometric data from the math difficulties group:
Psychometric data can be found in the 01_Psychometric_Data.xlsx file
Engagement percentage be found in the 05_SessionEngagement.xlsx file
Seventy-six Mexican children between 7 and 13 years old were enrolled in this study.
The sample was recruited through non-profit foundations that support learning and foster care programs.
g.USBamp RESEARCH amplifier
The stimuli nested folder contains all stimuli employed in the EEG experiments.
Level 1 - Math: Images used in the math experiment. - Reading: Images used in the reading experiment.
Level 2
- Math
* POST_Operations: arithmetic operations from the POST-intervention.
* PRE_Operations: arithmetic operations from the PRE-intervention.
- Reading
* POST_Reading1: text 1 and text-related comprehension questions from the POST-intervention.
* POST_Reading2: text 2 and text-related comprehension questions from the POST-intervention.
* POST_Reading3: text 3 and text-related comprehension questions from the POST-intervention.
* PRE_Reading1: text 1 and text-related comprehension questions from the PRE-intervention.
* PRE_Reading2: text 2 and text-related comprehension questions from the PRE-intervention.
* PRE_Reading3: text 3 and text-related comprehension questions from the PRE-intervention.
Level 3 - Math * Operation01.jpg to Operation20.jpg: arithmetical operations solved during the first block of the math
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study examined the role of science domain knowledge, reading motivation, and decoding skills in reading comprehension achievement in third-grade students who are English learners (ELs) and students who are monolingual, using a nationally representative data set. Multigroup probit regression analyses showed that third-grade science domain knowledge and motivation for reading, decoding skills, and early attainment of decoding skills were significantly associated with third-grade reading comprehension in both language groups. Also, using Wald chi-square tests, the study showed that the association between third-grade science domain knowledge and reading comprehension was stronger in students who were ELs than in students who were monolingual. These findings suggest that cultivating science domain knowledge is very important to supporting reading comprehension development in third grade, particularly for students who are ELs.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Transitioning from learning to read to reading to learn is a central aim of primary school education (Chall, 1983). However, while this aim is shared across borders, countries differ with regard to conditions facilitating or hindering this aim. These conditions encompass extracurricular learning environments and resources, socio-economic backgrounds, home languages, migratory backgrounds and the parents’ reading appreciation (e.g. El-Khechen et al., 2016; Kieffer, 2012; Kigel et al., 2015). Those determinants are also relevant in explaining students’ school entry reading skills, which, in turn, predict students’ reading literacy in primary school (Cameron et al., 2023; Claessens et al., 2009; Duncan et al., 2007). However, it is unclear at which point in a students’ life these effects are most impactful, which is necessary in order to recommend interventions in a timely effective manner.
Therefore, in this study we attempt to investigate to what extent factors that are primarily time-invariant for the students affect both the students’ reading skills at school entry and their reading literacy in fourth grade. At the core, we investigate to what extend differences due to time-invariant variables affect students’ rate of learning to reading throughout primary school, or if pre-school differences in reading competence accumulate during primary school. Put differently, we investigate if differences in reading competence start early and stay the same (accumulate) or if they start early and then widen (affecting the learning rate).
Building upon Carroll’s (1963) concept of time-on-task within the model of school learning, we investigate the effects of students’ language at home (El-Khechen et al, 2016). Within this theory, spending time on learning a language, such as when students have the ability to practice the language at home, the language skills improve due to higher time investments. Since students whose home language is different from the test language have less time-on-task in learning the test language, we assume that the test language affects the reading competence at school entry. Furthermore, because the home language persists throughout primary school and thus the time-on-task effects persist, we also assume that the home language affects the reading literacy in fourth grade.
We consider the parents’ reading appreciation within a social learning theory (SLT; Bandura, 1977) context. SLT posits that students learn by observing and imitating the behavior of adults. Children of parents that highly appreciate reading are more likely to observe their parents while reading and attempt to imitate that behavior. Hence, parents’ reading appreciation is a persistent factor for students, both before and during primary school, and may thus affect both the reading competence at school entry and the reading literacy in fourth grade.
In addition, we investigate the effects of household possessions (see Avvisati, 2020) as a persistent environmental factor under the framework of resource deprivation. Students from family backgrounds with few household possessions may lack cultural resources for a home learning environment that is conductive for learning to read before entering primary school and the economic resources to provide support to struggling learners during primary school (see Kieffer, 2012).
We use structural equation modelling on secondary data from the Progress in International Reading Literacy Study (PIRLS) with N = 177,386 fourth grade students from 17 European countries, gathered in 2016 and 2021, to investigate whether effects differ between countries and if they are stable even in changing educational circumstances, such as the COVID-19 pandemic (see Gee et al., 2023; Werner & Woessmann, 2023).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Learning Agency Lab’s data science competition, “The Quest for Quality Questions: Improving Reading Comprehension through Automated Question Generation,” was designed to build AI algorithms that can automatically generate questions that test young learners’ reading comprehension.
As many educators and researchers know, questions are key in teaching and evaluating narrative comprehension skills in young learners. However, generating high-quality reading comprehension queries is time consuming, which limits the number of texts that young readers can engage with in this way. Datasets can help by informing quality question automation.
The Quest challenge dataset can be accessed on this page and was aided by foundational data from the Lab’s FairytaleQA dataset of 10,580 questions. Those queries were created to address gaps in similar datasets, which often overlooked fine reading skills that showcased an understanding of varying narrative elements.
The Quest was made possible by The Learning Agency Lab, Mark Warschauer at UC Irvine, and Ying Xu at The University of Michigan School of Education. More can be found about the creators here.
Quest dataset © 2024 by The Learning Agency Lab is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/
Competition - https://www.thequestchallenge.org/
Publications - Xu, Y., Wang, D., Yu, M., Ritchie, D., Yao, B., Wu, T., ... & Warschauer, M. (2022). Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension. arXiv preprint arXiv:2203.13947.
Longitudinal data of language and cognitive skills of Chinese children followed from Grade 1 to Grade 3
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Will all children be able to read by 2030? The ability to read with comprehension is a foundational skill that every education system around the world strives to impart by late in primary school—generally by age 10. Moreover, attaining the ambitious Sustainable Development Goals (SDGs) in education requires first achieving this basic building block, and so does improving countries’ Human Capital Index scores. Yet past evidence from many low- and middle-income countries has shown that many children are not learning to read with comprehension in primary school. To understand the global picture better, we have worked with the UNESCO Institute for Statistics (UIS) to assemble a new dataset with the most comprehensive measures of this foundational skill yet developed, by linking together data from credible cross-national and national assessments of reading. This dataset covers 115 countries, accounting for 81% of children worldwide and 79% of children in low- and middle-income countries. The new data allow us to estimate the reading proficiency of late-primary-age children, and we also provide what are among the first estimates (and the most comprehensive, for low- and middle-income countries) of the historical rate of progress in improving reading proficiency globally (for the 2000-17 period). The results show that 53% of all children in low- and middle-income countries cannot read age-appropriate material by age 10, and that at current rates of improvement, this “learning poverty” rate will have fallen only to 43% by 2030. Indeed, we find that the goal of all children reading by 2030 will be attainable only with historically unprecedented progress. The high rate of “learning poverty” and slow progress in low- and middle-income countries is an early warning that all the ambitious SDG targets in education (and likely of social progress) are at risk. Based on this evidence, we suggest a new medium-term target to guide the World Bank’s work in low- and middle- income countries: cut learning poverty by at least half by 2030. This target, together with improved measurement of learning, can be as an evidence-based tool to accelerate progress to get all children reading by age 10.
For further details, please refer to https://thedocs.worldbank.org/en/doc/e52f55322528903b27f1b7e61238e416-0200022022/original/Learning-poverty-report-2022-06-21-final-V7-0-conferenceEdition.pdf
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract This article aims at profiling and comparing good (GR) and poor readers’comprehension (PRC). Through a word reading task and a reading comprehension task, 49 good readers and 37 poor readers were identified among 336 students in the 8th grade of public schools in the southern Brazil, not including the intermediate group in the analysis. The investigation of the profile used a self-completion written questionnaire. The results showed difference in the groups’ reading experience and a positive correlation between reading comprehension performance and number of books read by the students in a year. The study verifies that the research on reading habits might help to comprehend the differences in good and poor readers’ profiles. Future research may improve the instrument and expand it to direct not only theoretical studies but also practical studies of clinical and pedagogical intervention.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Datasets for the thesis titled "Monitoring and Regulation in Reading Comprehension among Chinese children: Exploring the Role of Comprehension Monitoring, Lexical Ambiguity Resolution and Reading Strategy". The thesis contained three studies and aimed to provide a more comprehensive view on monitoring and regulation in reading and examine its contributions on reading comprehension. Dataset for Study 1 contained cross-sectional and longitudinal data from junior primary school children on their general cognitive abilities, reading comrpehesion performance and cognitive-linguistic skills. Dataset for Study 2 and 3 contained eye-movement data and data on general cognitive abilities and reading-related cogntive-linguistic skills of grade 3 children.
Please refer to the README file for details of the datasets.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes the key data analysed and discussed in the paper titled 'The Effects of Choice on the Reading Comprehension and Enjoyment of Children with Severe Inattention and no Attentional Difficulties' published in the Research on Child and Adolescent Psychopathology journal. Data were collected as part of a larger PhD study of Myrofora Kakoulidou, under the supervision of Professor Jane Hurry, Dr Frances Le Cornu Knight and Dr Roberto Filippi.Variables includedCodenumber = Participant IDSex = Biological sexSEN status = Whether children had (or did not have) an SEN statement or and Education and Health Care planReading Motivation Questions = MRQ items (38 items)MRQ_NoMissing = Total scores on MRQ after imputationSentenceCompletion_NGRT = Children's answers to the Sentence Completion items of the NGRT (20 items)TextQuestions_NGRT = Children's answers to the Reading Comprehension questions across the three passages (28 items)NewGroupReadingScores = Total scores across the 48 NGRT itemsNGRT_Standardised = Raw scores converted to standardised scores by ageReadingscores_Choice = Children's reading scores in the Choice conditionReadingscores_NoChoice = Children's reading scores in the No Choice conditionReadingdifference_Final = Difference scores for Reading Comprehension (Reading comprehension scores in the No Choice condition subtracted by reading comprehension scores in the Choice condition)Readingdenjoyment_Final = Difference scores for Reading Enjoyment (Reading Enjoyment scores in the No Choice condition subtracted by reading enjoyment scores in the Choice condition)Enjoyment_NoChoice = Children's enjoyment scores in the No Choice conditionEnjoyment_Choice = Children's enjoyment scores in the Choice conditionTeacherConnersItems = Teachers' ratings on the Conners 3 scale (39 items in total, short version)TeacherRatedInattention = Teachers' ratings of children's inattentionOmissionErrors = Raw scores on Omission errors in AULARTV (Reaction Time Variability) = Raw scores on RTV in AULATeacherRatedInattention_Trichotomised = Trichotomised scores on Teacher-rated InattentionOmissionErrors_Trichotomised = Trichotomised scores on Omission errorsRTV_Trichotomised = Trichotomised scores on RTV
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note. Ns are for individuals.*p
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The present study examined the reading comprehension performance of children with Autism Spectrum Disorder and compared it with that of their typically developing counterparts to identify the possible variables that may relate to their reading comprehension performance. A series of six tests namely tone identification test, Embedded Figures Test (EFT), homograph reading test, homophone test, music test and reading test were conducted with native Cantonese ASD participants with and without musical training and their corresponding TD groups. The results were recorded in the file.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This repository contains the NarrativeQA dataset. It includes the list of documents with Wikipedia summaries, links to full stories, and questions and answers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of this study was to examine the differences in foreign language reading comprehension among high-, middle-, and low ambiguity tolerance students.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We proposed a Chinese machine reading comprehension dataset for security field (SecMRC), which solves the problem of lack of professional data support for machine reading comprehension technology research in this field. The dataset contains 2 1 00 Anti-terrorism and security domain news, 7300 extracted question-answer pairs, 2 1 00 generative Q&A pairs , and a total of 47 9 6264 characters.Tests were conducted using advanced reading comprehension models on the SecMRC. The results show that the F1 of the extraction task reaches 72.5%, and the average ROUGE-L of the generative task is 37.8%, both of which are significantly weaker than the human level. SecMRC highlights domain knowledge and is difficult and challenging. It can effectively support the research of machine reading comprehension technology in this field. And the dataset construction method is universal and can be extended to other professional fields.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The description of each dataset is given below. Study01 assessed Grades 1 to 3 bilingual children to identify the cognitive and linguistic skills that predict early reading comprehension. The data includes all the participants' scores in different cognitive and linguistic measures.Study02 identified different subgroups of early bilingual readers using their word reading, oral language, and reading fluency skills. The data includes the participants' scores in different linguistic measures in their first and second languages.Study 03 determined the efficacy of a multi-component oral language intervention for bilingual Grades 1 to 3 children. The data include scores of pretest and posttest measures of control and intervention groups.
The Chinese judicial reading comprehension (CJRC) dataset contains approximately 10K documents and almost 50K questions with answers. The documents come from judgment documents and the questions are annotated by law experts.
The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 questions from English exams, targeting Chinese students aged 12-18. RACE consists of two subsets, RACE-M and RACE-H, from middle school and high school exams, respectively. RACE-M has 28,293 questions and RACE-H has 69,574. Each question is associated with 4 candidate answers, one of which is correct. The data generation process of RACE differs from most machine reading comprehension datasets - instead of generating questions and answers by heuristics or crowd-sourcing, questions in RACE are specifically designed for testing human reading skills, and are created by domain experts.