7 datasets found
  1. Sigma Dolphin Filtered and Cleaned

    • kaggle.com
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Mutiga (2024). Sigma Dolphin Filtered and Cleaned [Dataset]. https://www.kaggle.com/datasets/ryanmutiga/sigma-dolphin-filtered-and-cleaned
    Explore at:
    zip(60569 bytes)Available download formats
    Dataset updated
    Jun 25, 2024
    Authors
    Ryan Mutiga
    Description

    Dataset Description for Filtered Sigma Dolphin Dataset

    Overview

    This dataset is a cleaned and filtered version of the Sigma Dolphin dataset (https://www.kaggle.com/datasets/saurabhshahane/sigmadolphin), designed to aid in solving maths word problems using AI techniques. This was used as an effort towards taking part in the AI Mathematical Olympiad - Progress Prize 1 (https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/overview). The dataset was processed using TF-IDF vectorisation and K-means clustering, specifically targeting questions relevant to the AIME (American Invitational Mathematics Examination) and AMC 12 (American Mathematics Competitions).

    Context

    The Sigma Dolphin dataset is a project initiated by Microsoft Research Asia, aimed at building an intelligent system with natural language understanding and reasoning capacities to automatically solve maths word problems written in natural language. This project began in early 2013, and the dataset includes maths word problems from various sources, including community question-answering sites like Yahoo! Answers.

    Source and Original Dataset Details

    Content

    The filtered dataset includes problems that are relevant for preparing for maths competitions such as AIME and AMC. The data is structured to facilitate the training and evaluation of AI models aimed at solving these types of problems.

    Datasets:

    There are several filtered versions of the dataset based on different similarity thresholds (0.3 and 0.5). These thresholds were used to determine the relevance of problems from the original Sigma Dolphin dataset to the AIME and AMC problems.

    1. Number Word Problems Filtered at 0.3 Threshold:

      • File: number_word_test_filtered_0.3_Threshold.csv
      • Description: Contains problems filtered with a similarity threshold of 0.3, ensuring moderate relevance to AIME and AMC 12 problems.
    2. Number Word Problems Filtered at 0.5 Threshold:

      • File: number_word_std.test_filtered_0.5_Threshold.csv
      • Description: Contains problems filtered with a higher similarity threshold of 0.5, ensuring higher relevance to AIME and AMC 12 problems.
    3. Filtered Number Word Problems 2 at 0.3 Threshold:

      • File: filtered_number_word_problems2_Threshold.csv
      • Description: Another set of problems filtered at a 0.3 similarity threshold.
    4. Filtered Number Word Problems 2 at 0.5 Threshold:

      • File: filtered_number_word_problems_Threshold.csv
      • Description: Another set of problems filtered at a 0.5 similarity threshold.

    Why Different Similarity Thresholds?

    Different similarity thresholds (0.3 and 0.5) are used to provide flexibility in selecting problems based on their relevance to AIME and AMC problems. A lower threshold (0.3) includes a broader range of problems, ensuring a diverse set of questions, while a higher threshold (0.5) focuses on problems with stronger relevance, offering a more targeted and precise dataset. This allows users to choose the level of specificity that best fits their needs.

    For a detailed explanation of the preprocessing and filtering process, please refer to the Sigma Dolphin Filtered & Cleaned Notebook.

    Acknowledgements

    We extend our gratitude to all the original authors of the Sigma Dolphin dataset and the creators of the AIME and AMC problems. This project leverages the work of numerous researchers and datasets to build a comprehensive resource for AI-based problem solving in mathematics.

    Usage

    This dataset is intended for research and educational purposes. It can be used to train AI models for natural language processing and problem-solving tasks, specifically targeting maths word problems in competitive environments like AIME and AMC.

    Licensing

    This dataset is shared under the Computational Use of Data Agreement v1.0.

    This description provides an extensive overview of the dataset, its sources, contents, and usage. If any specific details or additional sections are needed, please let me know!

  2. GSM8K - Grade School Math 8K dataset for LLM

    • kaggle.com
    zip
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johnson chong (2024). GSM8K - Grade School Math 8K dataset for LLM [Dataset]. https://www.kaggle.com/datasets/johnsonhk88/gsm8k-grade-school-math-8k-dataset-for-llm
    Explore at:
    zip(5156809 bytes)Available download formats
    Dataset updated
    May 21, 2024
    Authors
    Johnson chong
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer. A bright middle school student should be able to solve every problem: from the paper, "Problems require no concepts beyond the level of early Algebra, and the vast majority of problems can be solved without explicitly defining a variable." Solutions are provided in natural language, as opposed to pure math expressions. From the paper: "We believe this is the most generally useful data format, and we expect it to shed light on the properties of large language models’ internal monologues"

  3. Student Performance Data Set

    • kaggle.com
    zip
    Updated Mar 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
    Explore at:
    zip(12353 bytes)Available download formats
    Dataset updated
    Mar 27, 2020
    Authors
    Data-Science Sean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

  4. m

    Secondary School Science and Mathematics Teachers’ Self-Efficacy Beliefs...

    • data.mendeley.com
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thumah Mapulanga (2025). Secondary School Science and Mathematics Teachers’ Self-Efficacy Beliefs Regarding Learner-centred Instructional Practices [Dataset]. http://doi.org/10.17632/dvvdvw22mn.4
    Explore at:
    Dataset updated
    Nov 5, 2025
    Authors
    Thumah Mapulanga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset reports on secondary school science and mathematics teacher self-efficacy beliefs to enact learner-centred instructional practices. This is based on the fact that teachers' self-efficacy beliefs and their ability to enact learner-centered instructional practices have been associated with effective teaching and improved learner academic performance. The data were collected from 257 participants using an online-based questionnaire that was distributed through WhatsApp platforms of participants enrolled in science or mathematics education programmes at three universities in Zambia. The data were collected from April to July 2025. The dataset consists of raw data in a Microsoft Excel file 'Raw data_Teachers self-efficacy beliefs about learner centred instructional practices' and a Microsoft Word document 'Analysed Data-Teachers' self-efficacy'. The raw data (sheet 1) shows participants’ demographic data (D1 to D7) and self-efficacy beliefs to enact learner-centred instructional practices SE1 to SE25. The analysed data contains results presented in Tables 1 to 8. There are two main reasons for presenting the analysed data in sheet 3 and this paper. To: (1) establish the reliability of the questionnaire survey, and (2) determine the level of the participants’ self-efficacy beliefs/confidence level to use learner-centred instructional practices, and the role of demographic characteristics on teachers’ self-efficacy beliefs to enact the learner-centred instructional practices. The data were analysed using the SPSS version 25 by computing descriptive and inferential statistics. The findings revealed that the questionnaire was very reliable and therefore can be used to measure secondary school science and mathematics teachers’ self-efficacy beliefs about enacting learner-centred instructional practices. Furthermore, the findings on the level of teachers' self-efficacy beliefs indicate that teacher report very high self-efficacy beliefs to enact learner-centred instructional practices. The results show statistically significant differences in teachers' self-efficacy beliefs between in-service and pre-service teachers. However, the data shows no statistically significant differences were found based on gender, specialisation, teaching experience, highest qualifications, and qualifications being pursued. The data may provide insight into the teachers’ self-efficacy beliefs regarding the use of learner-centred instructional practices as measured in the questionnaire. The data may also provide a basis for planning large-scale studies targeted at improving the teaching and learning of science and mathematics. These data should be interpreted in the context of secondary school science and mathematics teachers in a developing country-Zambia. However, insights from the data contributes to the global discourse on self-efficacy beliefs in science and mathematics teaching.

  5. r

    Evaluation through follow-up - pupils born in 1953

    • researchdata.se
    Updated Aug 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kjell Härnqvist; Sven-Erik Reuterberg; Allan Svensson; Airi Rovio-Johansson (2024). Evaluation through follow-up - pupils born in 1953 [Dataset]. https://researchdata.se/en/catalogue/dataset/snd0480-2
    Explore at:
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    University of Gothenburg
    Authors
    Kjell Härnqvist; Sven-Erik Reuterberg; Allan Svensson; Airi Rovio-Johansson
    Time period covered
    1966 - 1973
    Area covered
    Sweden
    Description

    Since the beginning of the 1960s, Statistics Sweden, in collaboration with various research institutions, has carried out follow-up surveys in the school system. These surveys have taken place within the framework of the IS project (Individual Statistics Project) at the University of Gothenburg and the UGU project (Evaluation through follow-up of students) at the University of Teacher Education in Stockholm, which since 1990 have been merged into a research project called 'Evaluation through Follow-up'. The follow-up surveys are part of the central evaluation of the school and are based on large nationally representative samples from different cohorts of students.

    Evaluation through follow-up (UGU) is one of the country's largest research databases in the field of education. UGU is part of the central evaluation of the school and is based on large nationally representative samples from different cohorts of students. The longitudinal database contains information on nationally representative samples of school pupils from ten cohorts, born between 1948 and 2004. The sampling process was based on the student's birthday for the first two and on the school class for the other cohorts.

    For each cohort, data of mainly two types are collected. School administrative data is collected annually by Statistics Sweden during the time that pupils are in the general school system (primary and secondary school), for most cohorts starting in compulsory school year 3. This information is provided by the school offices and, among other things, includes characteristics of school, class, special support, study choices and grades. Information obtained has varied somewhat, e.g. due to changes in curricula. A more detailed description of this data collection can be found in reports published by Statistics Sweden and linked to datasets for each cohort.

    Survey data from the pupils is collected for the first time in compulsory school year 6 (for most cohorts). Questionnaire in survey in year 6 includes questions related to self-perception and interest in learning, attitudes to school, hobbies, school motivation and future plans. For some cohorts, questionnaire data are also collected in year 3 and year 9 in compulsory school and in upper secondary school.

    Furthermore, results from various intelligence tests and standartized knowledge tests are included in the data collection year 6. The intelligence tests have been identical for all cohorts (except cohort born in 1987 from which questionnaire data were first collected in year 9). The intelligence test consists of a verbal, a spatial and an inductive test, each containing 40 tasks and specially designed for the UGU project. The verbal test is a vocabulary test of the opposite type. The spatial test is a so-called ‘sheet metal folding test’ and the inductive test are made up of series of numbers. The reliability of the test, intercorrelations and connection with school grades are reported by Svensson (1971).

    For the first three cohorts (1948, 1953 and 1967), the standartized knowledge tests in year 6 consist of the standard tests in Swedish, mathematics and English that up to and including the beginning of the 1980s were offered to all pupils in compulsory school year 6. For the cohort 1972, specially prepared tests in reading and mathematics were used. The test in reading consists of 27 tasks and aimed to identify students with reading difficulties. The mathematics test, which was also offered for the fifth cohort, (1977) includes 19 assignments. After a changed version of the test, caused by the previously used test being judged to be somewhat too simple, has been used for the cohort born in 1982. Results on the mathematics test are not available for the 1987 cohort. The mathematics test was not offered to the students in the cohort in 1992, as the test did not seem to fully correspond with current curriculum intentions in mathematics. For further information, see the description of the dataset for each cohort.

    For several of the samples, questionnaires were also collected from the students 'parents and teachers in year 6. The teacher questionnaire contains questions about the teacher, class size and composition, the teacher's assessments of the class' knowledge level, etc., school resources, working methods and parental involvement and questions about the existence of evaluations. The questionnaire for the guardians includes questions about the child's upbringing conditions, ambitions and wishes regarding the child's education, views on the school's objectives and the parents' own educational and professional situation.

    The students are followed up even after they have left primary school. Among other things, data collection is done during the time they are in high school. Then school administrative data such as e.g. choice of upper secondary school line / program and grades after completing studies. For some of the cohorts, in addition to school administrative data, questionnaire data were also collected from the students.

    he sample consisted of students born on the 5th, 15th and 25th of any month in 1953, a total of 10,723 students.

    The data obtained in 1966 were: 1. School administrative data (school form, class type, year and grades). 2. Information about the parents' profession and education, number of siblings, the distance between home and school, etc.

    This information was collected for 93% of all born on the current days. The reason for this is reduced resources for Statistics Sweden for follow-up work - reminders etc. Annual data for cohorts in 1953 were collected by Statistics Sweden up to and including academic year 1972/73.

    1. Answers to certain questions that shed light on students' school motivation, leisure activities and study and career plans. Some of the questions changed significantly compared to the cohort in 1948 due to the fact that they did not function satisfactorily from a metrological point of view.
    2. Results on three aptitude tests, one verbal, one spatial and one inductive.
    3. Standard test results in reading, writing, mathematics and English, which were offered to the students who belonged to year 6.

    Response rate for test and questionnaire data is 88% Standard test results were received for just over 85% of those who took the tests.

    The sample included a total of 9955 students, for whom some form of information was obtained.

    Part of the "Individual Statistics Project" together with cohort 1953.

  6. h

    MCLM

    • huggingface.co
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GUIJIN SON (2025). MCLM [Dataset]. https://huggingface.co/datasets/amphora/MCLM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2025
    Authors
    GUIJIN SON
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multilingual Competition Level Math (MCLM)

    Link to Paper: https://arxiv.org/abs/2502.17407 Overview:MCLM is a benchmark designed to evaluate advanced mathematical reasoning in a multilingual context. It features competition-level math problems across 55 languages, moving beyond standard word problems to challenge even state-of-the-art large language models.

      Dataset Composition
    

    MCLM is constructed from two main types of reasoning problems:

    Machine-translated… See the full description on the dataset page: https://huggingface.co/datasets/amphora/MCLM.

  7. i

    Southern and Eastern Africa Consortium for Monitoring Educational Quality...

    • datacatalog.ihsn.org
    • catalog.ihsn.org
    Updated Mar 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Southern and Eastern Africa Consortium for Monitoring Educational Quality (2019). Southern and Eastern Africa Consortium for Monitoring Educational Quality 2000 - Namibia [Dataset]. https://datacatalog.ihsn.org/catalog/4713
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    Southern and Eastern Africa Consortium for Monitoring Educational Quality
    Time period covered
    2000
    Area covered
    Namibia
    Description

    Abstract

    In 1991 the International Institute for Educational Planning (IIEP) and a number of Ministries of Education in Southern and Eastern Africa began to work together in order to address training and research needs in Education. The focus for this work was on establishing long-term strategies for building the capacity of educational planners to monitor and evaluate the quality of their basic education systems. The first two educational policy research projects undertaken by SACMEQ (widely known as "SACMEQ I" and "SACMEQ II") were designed to provide detailed information that could be used to guide planning decisions aimed at improving the quality of education in primary school systems.

    During 1995-1998 seven Ministries of Education participated in the SACMEQ I Project. The SACMEQ II Project commenced in 1998 and the surveys of schools, involving 14 Ministries of Education, took place between 2000 and 2004. The survey was undertaken in schools in Botswana, Kenya, Lesotho, Malawi, Mauritius, Mozambique, Namibia, Seychelles, South Africa, Swaziland, Tanzania, Uganda, Zambia and Zanzibar.

    Moving from the SACMEQ I Project (covering around 1100 schools and 20,000 pupils) to the SACMEQ II Project (covering around 2500 schools and 45,000 pupils) resulted in a major increase in the scale and complexity of SACMEQ's research and training programmes.

    SACMEQ's mission is to: a) Expand opportunities for educational planners to gain the technical skills required to monitor and evaluate the quality of their education systems; and b) Generate information that can be used by decision-makers to plan and improve the quality of education.

    Geographic coverage

    National coverage

    Analysis unit

    • Pupils
    • Teachers
    • Schools

    Universe

    The target population for SACMEQ's Initial Project was defined as "all pupils at the Grade 6 level in 1995 who were attending registered government or non-government schools". Grade 6 was chosen because it was the grade level where the basics of reading literacy were expected to have been acquired.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Sampling The "best" sample design for a particular project is one that provides levels of sampling accuracy that are acceptable in terms of the main aims of the project, while simultaneously limiting cost, logistic, and procedural demands to manageable levels. The major constraints that were established prior to the preparation of the sample designs for the SACMEQ II Project have been listed below.

    Target Population: The target population definitions should focus on Grade 6 pupils attending registered mainstream government or non-government schools. In addition, the defined target population should be constructed by excluding no more than 5 percent of pupils from the desired target population.

    Bias Control: The sampling should conform to the accepted rules of scientific probability sampling. That is, the members of the defined target population should have a known and non-zero probability of selection into the sample so that any potential for bias in sample estimates due to variations from "epsem sampling" (equal probability of selection method) could be addressed through the use of appropriate sampling weights.

    Sampling Errors: The sample estimates for the main criterion variables should conform to the sampling accuracy requirements that the standard error of sampling for the pupil tests should be of a magnitude that is equal to, or smaller than, what would be achieved by The Specification of the Target Population employing a simple random sample of 400 pupils.

    Response Rates: Each SACMEQ country should aim to achieve an overall response rate for pupils of 80 percent. This figure was based on the wish to achieve or exceed a response rate of 90 percent for schools and a response rate of 90 percent for pupils within schools.

    Administrative and Financial Costs: The number of schools selected in each country should recognise limitations in the administrative and financial resources available for data collection.

    Other Constraints: The number of learners selected to participate in the data collection in each selected school should be set at a level that will maximise validity of the within-school data collection for the learner reading and mathematics tests.

    For Namibia, the desired target population was all learners enrolled in Grade 6 in the ninth month of the school year (i.e. in September 2000). The net enrolment ratio for the age group 7-13 years old who were enrolled in Grades 1 to 7 in Namibia in 2000 was 91.3 percent. However, in Namibia it was decided to exclude certain learners. These were learners in schools having fewer than 15 Grade 6 learners in them, learners in 'inaccessible schools, and learners in special schools. In all 884 learners from 82 schools were excluded but this only amounted to 1.8 percent of all learners. In Namibia there were 849 primary schools having 48,567 learners. After excluding the 1.8 percent of learners the defined population from which a sample had to be drawn consisted of 47,683 learners from 767 schools.

    The number of schools required in the sample is in part a function of the intra-class correlation (rho) which is an indicator of the proportion of variation (in achievement in this case) among schools of total variation. The following is the formula often used for estimating the value of rho in situations where two-stage cluster sampling is employed using (approximately) equal sized clusters.

    estimated rho = (b. s(a)square - (s)square) / (b - 1)(s)square

    where s(a)square is the variance of cluster means, (s)square is the variance of the element values, and b is the cluster size. In SACMEQ I the rho had been 0.60 in Namibia. That is 60 percent of the variation was among schools and only 40 percent within schools. Therefore, in the case of Namibia a rho of 0.60 was used. This meant drawing a sample of 248 schools.

    The major aim of the sampling was to have the equivalent of a simple random sample of 400 learners. In Namibia, this was 767 for reading achievement and 810 for mathematics. Hence the sample was a very good one for Namibia. For SACMEQ I it had been 335 which was below the required 400. This was because SACMEQ I was the first sample survey in Namibia and at that time it was assumed that the rho was 0.30. It was not. In SACMEQ II the rhos were 0.60 for reading and 0.53 for mathematics. Thus, in 2000 the variation among schools was slightly lower than in 1995.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The data collection for SACMEQ’s Initial Project took place in October 1995 and involved the administration of questionnaires to pupils, teachers, and school heads. The pupil questionnaire contained questions about the pupils’ home backgrounds and their school life; the teacher questionnaire asked about classrooms, teaching practices, working conditions, and teacher housing; and the school head questionnaire collected information about teachers, enrolments, buildings, facilities, and management. A reading literacy test was also given to the pupils. The test was based on items that were selected after a trial-testing programme had been completed.

    Cleaning operations

    Data entry and data cleaning A team of five persons from the University of Namibia Multi-Disciplinary Research Centre computer lab was appointed and trained in the use of WINDEM, a special data entry package to be used in SACMEQ. The numbers of keystrokes required to enter one copy of each data collection instrument were as follows: learner questionnaire: 150; learner reading test: 85; learner mathematics test: 65; teacher questionnaire: 587; teacher reading test: 51; teacher mathematics test: 43; school head questionnaire: 319; school form: 58; and learner name form: 51.

    In the case of Namibia the total number of keystrokes was as follows: learner questionnaire: 762,600; learner reading test: 429,080; learner mathematics test: 328,250; teacher questionnaire: 358,657; teacher reading test: 15,504; teacher mathematics test: 14,061; school head questionnaire: 86,130; school form: 39,150; and learner name form: 259,284. That is, a total of 2,292,716 keystrokes were required to enter all of the data for Namibia.

    An experienced keyboard operator can work at a rate of 25 keystrokes per minute (working from multi-paged questionnaires and stopping occasionally to clarify individual questionnaire entries with the supervisor). Assuming that this kind of work rate could be sustained for, say, around a maximum of six hours per day, then the whole data entry operation for Namibia was estimated to amount to around 255 person days of data entry work. This implied an estimated 10 weeks of work for the 5-person data entry team that operated in Namibia. However, the work was completed in 7 weeks because the data enterers worked extra hours.

    At the end of this procedure the data files were sent by email to the unit 'Monitoring Educational Quality' at the IIEP in Paris. Many consistency checks were made for many variables as well as for the identification codes used. The IIEP team had many queries. The first data files were sent to Paris in May 2001 and after nine to-ings and fro-ings the files were finally declared to be clean on 25 January 2002.

    Response rate

    Response rates for pupils and schools respectively were 91.8 percent and 100 percent. The reason for the shortfall in learner numbers was absenteeism by some learners in some of the schools on the day of data collection. However, sampling weights were used to correct for disproportionality among strata in the calculation

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ryan Mutiga (2024). Sigma Dolphin Filtered and Cleaned [Dataset]. https://www.kaggle.com/datasets/ryanmutiga/sigma-dolphin-filtered-and-cleaned
Organization logo

Sigma Dolphin Filtered and Cleaned

Cleaned and Filtered Version Of Sigma Dolphin

Explore at:
zip(60569 bytes)Available download formats
Dataset updated
Jun 25, 2024
Authors
Ryan Mutiga
Description

Dataset Description for Filtered Sigma Dolphin Dataset

Overview

This dataset is a cleaned and filtered version of the Sigma Dolphin dataset (https://www.kaggle.com/datasets/saurabhshahane/sigmadolphin), designed to aid in solving maths word problems using AI techniques. This was used as an effort towards taking part in the AI Mathematical Olympiad - Progress Prize 1 (https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/overview). The dataset was processed using TF-IDF vectorisation and K-means clustering, specifically targeting questions relevant to the AIME (American Invitational Mathematics Examination) and AMC 12 (American Mathematics Competitions).

Context

The Sigma Dolphin dataset is a project initiated by Microsoft Research Asia, aimed at building an intelligent system with natural language understanding and reasoning capacities to automatically solve maths word problems written in natural language. This project began in early 2013, and the dataset includes maths word problems from various sources, including community question-answering sites like Yahoo! Answers.

Source and Original Dataset Details

Content

The filtered dataset includes problems that are relevant for preparing for maths competitions such as AIME and AMC. The data is structured to facilitate the training and evaluation of AI models aimed at solving these types of problems.

Datasets:

There are several filtered versions of the dataset based on different similarity thresholds (0.3 and 0.5). These thresholds were used to determine the relevance of problems from the original Sigma Dolphin dataset to the AIME and AMC problems.

  1. Number Word Problems Filtered at 0.3 Threshold:

    • File: number_word_test_filtered_0.3_Threshold.csv
    • Description: Contains problems filtered with a similarity threshold of 0.3, ensuring moderate relevance to AIME and AMC 12 problems.
  2. Number Word Problems Filtered at 0.5 Threshold:

    • File: number_word_std.test_filtered_0.5_Threshold.csv
    • Description: Contains problems filtered with a higher similarity threshold of 0.5, ensuring higher relevance to AIME and AMC 12 problems.
  3. Filtered Number Word Problems 2 at 0.3 Threshold:

    • File: filtered_number_word_problems2_Threshold.csv
    • Description: Another set of problems filtered at a 0.3 similarity threshold.
  4. Filtered Number Word Problems 2 at 0.5 Threshold:

    • File: filtered_number_word_problems_Threshold.csv
    • Description: Another set of problems filtered at a 0.5 similarity threshold.

Why Different Similarity Thresholds?

Different similarity thresholds (0.3 and 0.5) are used to provide flexibility in selecting problems based on their relevance to AIME and AMC problems. A lower threshold (0.3) includes a broader range of problems, ensuring a diverse set of questions, while a higher threshold (0.5) focuses on problems with stronger relevance, offering a more targeted and precise dataset. This allows users to choose the level of specificity that best fits their needs.

For a detailed explanation of the preprocessing and filtering process, please refer to the Sigma Dolphin Filtered & Cleaned Notebook.

Acknowledgements

We extend our gratitude to all the original authors of the Sigma Dolphin dataset and the creators of the AIME and AMC problems. This project leverages the work of numerous researchers and datasets to build a comprehensive resource for AI-based problem solving in mathematics.

Usage

This dataset is intended for research and educational purposes. It can be used to train AI models for natural language processing and problem-solving tasks, specifically targeting maths word problems in competitive environments like AIME and AMC.

Licensing

This dataset is shared under the Computational Use of Data Agreement v1.0.

This description provides an extensive overview of the dataset, its sources, contents, and usage. If any specific details or additional sections are needed, please let me know!

Search
Clear search
Close search
Google apps
Main menu